The first IEEE 754 standard was published in 1985 and included only the binary representation. Its basic formats were simple and double precision. In 2008, a second version was published which also included the representation of numbers in decimal, with two basic formats. Also, a basic binary format with quadruple precision was added. In 2019, a third version, with minor modifications, was published.

The five basic formats and their most important parameters are

$$ \begin{array}{|c|ccc|cc|} \hline & \mathrm{Binary}& \mathrm{formats} & (b = 2)& \mathrm{Decimal}\;\mathrm{formats}&(b= 10)\\ \hline \mathrm{parameter} & \mathrm{binary32} & \mathrm{binary64} & \mathrm{binary128} & \mathrm{decimal64} & \mathrm{decimal128}\\ \hline \hline \mathrm{precision} (p) & 24 & 53 & 113 & 16 & 34\\ e_{max} & +127 & +1023 & +16383 & +384 & +6144\\ \hline \end{array} $$

The default format in Numerical Computing is currently double precision binary. We will study single and double precision binary formats in more detail.

Normalized format¶

Decimal number¶

Given the decimal number $314.15$, we normalize it the following way

We move the point so that a single non-zero digit appears to its left.
Then, we have to multiply by $10^n$ where $n$ is the number of positions that we have moved the point to the left or $10^{- n}$ being $n$ the number of positions that we have moved the comma to the right.
We add the sign.

Thus, the previous number, normalized is

$$+3.1415\times10^2$$

whose components are

Sign: $+$
Mantissa: $3.1415$
Exponent: $2$

We must add that the base (or radix) is $10$.

Binary number¶

Given the binary number $10101.11001$, we apply the same steps but must take into account that the base is now 2. The normalized number is

$$+1.010111001\times2^4$$

with

Sign: $+$
Mantissa: $1.010111001$
Exponent: $4$ (later, we will express this in binary)

And the base (or radix) is $2$.

The representation of the exponent¶

The exponent in this standard will always be an integer with biased representation. Let's see how it works with an example. Let's suppose that we have $m = 4$ bits to store 4 binary digits. We could store $2^m=2^4=16$ different combinations and we could represent the following numbers

$$ \begin{array}{cccc} \hline \mathrm{Binary} & \mathrm{Face}& & \mathrm{Signed}\,\mathrm{integer}\\ \mathrm{representation} & \mathrm{value}& & \mathrm{(Exponent)}\\ (m=4\; bits)& & & bias=2^{m-1}-1 \\ \hline \mathtt{0000} & 0 & & Reserved\\ \mathtt{0001} & 1 & & -6\\ \mathtt{0010} & 2 & & -5\\ \mathtt{0011} & 3 & & -4\\ \mathtt{0100} & 4 & & -3\\ \mathtt{0101} & 5 &bias & -2\\ \mathtt{0110} & 6 & \longrightarrow& -1\\ \mathtt{0111} & 7 & -7& 0\\ \mathtt{1000} & 8 & & 1\\ \mathtt{1001} & 9 & & 2\\ \mathtt{1010} & 10 & & 3\\ \mathtt{1011} & 11 & & 4\\ \mathtt{1100} & 12 & & 5\\ \mathtt{1101} & 13 & & 6\\ \mathtt{1110} & 14 & & 7\\ \mathtt{1111} & 15 & & Reserved\\ \hline \end{array} $$

Binary32 (single precision)¶

It uses 32 bits (4 bytes):

1 bit for the sign.
8 bits for the exponent.
23 bits for the mantissa (or significand).

Sign¶

It is $\mathtt{0}$ for a positive number and $\mathtt{1}$ for a negative one.

Exponent¶

We have $m=8$ bits for the exponent. Therefore there are $2^m=2^8=256$ different combinations and, in principle, we can represent $256$ numbers. As we start at $0$ it will end at $255$. The first number, $\mathtt{0000\,0000}$, and the last one, $\mathtt{1111\,1111}$ are reserved (we will see later for what). And since the representation is biased, we subtract the bias that is

$$bias=2^{m-1}-1=2^{8-1}-1=2^7-1=128-1=127$$

to get the represented value

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{1111\,1100}& 252 & & 125 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$

Thus, the minimum represented exponent is $e_{min} = -126$ and the maximum $e_{max} = 127.$

Mantissa¶

In binary format, in the normalized representation, the digit to the left of the point is always $\mathtt{1}$. It is not stored and is called the hidden bit. So even though we have $32-1-8=23$ binary digits to store we are using one more and the precision is $24.$

Binary64 (double precision)¶

It uses 64 bits (8 bytes):

1 bit for the sign.
11 bits for the exponent.
52 bits for the mantissa (or significand).

Sign¶

If the sign is positive $\mathtt{0}$ is used and if the sign is negative $\mathtt{1}$.

Exponent¶

We have $m=11$ bits for the exponent. Therefore there are $2^m=2^{11}=2048$ different combinations and, in principle, we can represent $2048$ numbers. As we start at $0$ it will end at $2047$. The first number, $\mathtt{0000\,0000\,000}$, and the last one, $\mathtt{1111\,1111\,111}$ are reserved. And since the representation is biased, we subtract the bias that is

$$bias=2^{m-1}-1=2^{11-1}-1=2^{10}-1=1024-1 = 1023$$

to get the represented value.

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000\,000}& 0& & R\\ \mathtt{0000\,0000\,001}& 1& & -1022\\ \mathtt{0000\,0000\,010}& 2& & -1021\\ \mathtt{0000\,0000\,011}& 3& & -1020\\ \cdots & \cdots & -1023 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{1111\,1111\,100}& 2044 & & 1021 \\ \mathtt{1111\,1111\,101}& 2045 & & 1022 \\ \mathtt{1111\,1111\,110}& 2046 & & 1023 \\ \mathtt{1111\,1111\,111}& 2047 & & R\\ \hline \end{array} $$

And the minimum exponent is $e_{min} = -1022$ and the maximum $e_{max} = 1023.$

Mantissa¶

In binary format, in the normalized representation, the digit to the left of the point is always $\mathtt{1}$. It is not stored and is called the hidden bit. So even though we have $64-1-11=52$ binary digits to store we are using one more and the precision is $53.$

Exercise¶

If the number

$$ \begin{array}{|c|c|c|} \hline \mathrm{Sign} & \mathrm{Exponent} & \mathrm{Mantissa}\\ 1\,\mathrm{bit} & 8\,\mathrm{bits} & 23\,\mathrm{bits} \\ \hline \mathtt{1}&\mathtt{1000\,1101}&\mathtt{0110\,1000\,0000\,0000\,0000\,000}\\ \hline \end{array} $$

follows the IEEE 754 single precision floating point representation, give its representation in the decimal base.

Sign¶

As the bit for the sign is $\mathtt{{\color{red}1}}$ $\longrightarrow$ the sign is negative.

Exponent¶

The face value of the exponent is $\mathtt{{\color{red}{1000\,1101}}}$. If we take into account the position of the digits

$$ \begin{array}{ccccccc} \tiny{(7)}&\tiny{(6)}&\tiny{(5)}&\tiny{(4)}& \,\tiny{(3)}&\tiny{(2)}&\tiny{(1)}&\tiny{(0)}& & \\ \mathtt{1}&\mathtt{0}&\mathtt{0}&\mathtt{0}&\,\mathtt{1}&\mathtt{1}&\mathtt{0}&\mathtt{1} \end{array} $$

we have

$$2^7+2^3+2^2+2^0=128+8+4+1=141$$

And, as we have $m=8$ bits, $bias = 2^{m-1}-1 = 2^{8-1}-1=2^7-1=128-1=127$

And the exponent value is $141-127=14$.

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{{\color{red}{1000\,1101}}}& {\color{red}{141}}& & {\color{red}{14}}\\ \cdots & \cdots & & \cdots\\ \cdots & \cdots & & \cdots\\ \mathtt{1111\,1100}& 252 & & 125 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$

Mantissa¶

The digits for the mantissa are $\mathtt{{\color{ForestGreen}{0110\,1000\,0000\,0000\,0000\,000}}}$ and taking into account the hidden bit, that is $\mathtt{1}$

$$\mathtt{1.{\color{ForestGreen}{0110\,1000\,0000\,0000\,0000\,000}}}$$

also

$$\mathtt{1.{\color{ForestGreen}{0110\,1}}}$$$$ \begin{array}{ccccccc} \tiny{(0)}&&\tiny{(-1)}&\tiny{(-2)}&\tiny{(-3)}&\tiny{(-4)}&\tiny{(-5)}\\ \mathtt{1}&.&\mathtt{{\color{ForestGreen}0}}&\mathtt{{\color{ForestGreen}1}}&\mathtt{\color{ForestGreen}1}&\,\mathtt{{\color{ForestGreen}0}}&\mathtt{\color{ForestGreen}1} \end{array} $$

Number¶

Summing up, the number is

$${\color{red}-}\mathtt{1.{\color{ForestGreen}{0110\,1}}}\times 2^{\color{red}{14}} \quad \longrightarrow \quad -(1+2^{-2}+2^{-3}+2^{-5})\times2^{14} = \fbox{$-$23040} $$

Contents

Introduction¶

The IEEE 754 standard¶

Normalized format¶

Decimal number¶

Binary number¶

The representation of the exponent¶

Binary32 (single precision)¶

Sign¶

Exponent¶

Mantissa¶

Binary64 (double precision)¶

Sign¶

Exponent¶

Mantissa¶

Exercise¶

Sign¶

Exponent¶

Mantissa¶

Number¶