Floating-Point Representation

Floating-point numbers approximate real numbers with a finite number of bits. You can see the bits representing a floating-point number with the BitViewer tool. The bits are calculated as shown in the following formula. The representation is binary, so the base is 2. The bits bn represent binary digits (0 or 1). The precision P is the number of bits in the nonexponential part of the number (the significand), and E is the exponent. With these parameters, binary floating-point numbers approximate real numbers with the values:

( - 1)s b₀ . b₁ b₂ ... b _P-1 x 2^E

where s is 0 or 1 (+ or - ), and E_min<= E <= E_max

The following table gives the standard values for these parameters for single, double, and extended-double formats and the resulting bit widths for the sign, the exponent, and the full number.

Parameters for IEEE Floating-Point Formats

Parameter	Single	Double	Extended Double
Sign width in bits	1	1	1
P	24	53	64
E_max	+127	+1023	+16383
E_min	- 126	- 1022	- 16382
Exponent bias	+127	+1023	+16383
Exponent width in bits	8	11	15
Format width in bits	32	64	80

The standard requires that the single and double formats be normalized, so b₀ is always 1. The actual number of bits needed to represent the precisions 24 and 53 is therefore 23 and 52, respectively, because b₀ is chosen to be 1 implicitly.

Extended-double format need not be normalized, so it uses the full 64 bits for precision. A bias is added to all exponents so that only positive integer exponents occur. This expedites comparisons of exponent values. The stored exponent is actually:

e = E + bias

For more information:

On the floating-point representation, see Native IEEE Floating-Point Representations.
On using the BitViewer tool, see Viewing Floating-Point Representations with BitViewer.
On rReading or writing floating-point data other than native IEEE little endian data, see Converting Unformatted Numeric Data.