Floating-Point Representation

Floating-point numbers approximate real numbers with a finite number of bits. You can see the bits representing a floating-point number with the BitViewer tool. The bits are calculated as shown in the following formula. The representation is binary, so the base is 2. The bits bn represent binary digits (0 or 1). The precision P is the number of bits in the nonexponential part of the number (the significand), and E is the exponent. With these parameters, binary floating-point numbers approximate real numbers with the values:

( - 1)s b0 . b 1 b2 ... b P-1 x 2E

where s is 0 or 1 (+ or - ), and Emin<= E <= Emax

The following table gives the standard values for these parameters for single, double, and extended-double formats and the resulting bit widths for the sign, the exponent, and the full number.

Parameters for IEEE Floating-Point Formats

Parameter Single Double Extended Double
Sign width in bits 1 1 1
P 24 53 64
Emax +127 +1023 +16383
Emin - 126 - 1022 - 16382
Exponent bias +127 +1023 +16383
Exponent width in bits 8 11 15
Format width in bits 32 64 80

The standard requires that the single and double formats be normalized, so b0 is always 1. The actual number of bits needed to represent the precisions 24 and 53 is therefore 23 and 52, respectively, because b0 is chosen to be 1 implicitly.

Extended-double format need not be normalized, so it uses the full 64 bits for precision. A bias is added to all exponents so that only positive integer exponents occur. This expedites comparisons of exponent values. The stored exponent is actually:

e = E + bias

For more information: