Course Webpage

Introduction

The IEEE 754 standard

The commonly used floating-point format is established by the IEEE 754 standard. IEEE stands for Institute of Electrical and Electronic Engineers. This standard is currently used by almost all processors.

The first IEEE 754 standard was published in 1985 and included only the binary representation. Its basic formats were simple and double precision. In 2008, a second version was published which also included the representation of numbers in decimal, with two basic formats. Also, a basic binary format with quadruple precision was added. In 2019, a third version, with minor modifications, was published.

The five basic formats and their most important parameters are

$$ \begin{array}{|c|ccc|cc|} \hline & \mathrm{Binary}& \mathrm{formats} & (b = 2)& \mathrm{Decimal}\;\mathrm{formats}&(b= 10)\\ \hline \mathrm{parameter} & \mathrm{binary32} & \mathrm{binary64} & \mathrm{binary128} & \mathrm{decimal64} & \mathrm{decimal128}\\ \hline \hline \mathrm{precision} (p) & 24 & 53 & 113 & 16 & 34\\ e_{max} & +127 & +1023 & +16383 & +384 & +6144\\ \hline \end{array} $$

The default format in Numerical Computing is currently double precision binary. We will study single and double precision binary formats in more detail.

Normalized format

Decimal number

Given the decimal number $314.15$, we normalize it the following way

  1. We move the point so that a single non-zero digit appears to its left.
  2. Then, we have to multiply by $10^n$ where $n$ is the number of positions that we have moved the point to the left or $10^{- n}$ being $n$ the number of positions that we have moved the comma to the right.
  3. We add the sign.

Thus, the previous number, normalized is

$$+3.1415\times10^2$$

whose components are

  • Sign: $+$
  • Mantissa: $3.1415$
  • Exponent: $2$

We must add that the base (or radix) is $10$.

Binary number

Given the binary number $10101.11001$, we apply the same steps but must take into account that the base is now 2. The normalized number is

$$+1.010111001\times2^4$$

with

  • Sign: $+$
  • Mantissa: $1.010111001$
  • Exponent: $4$ (later, we will express this in binary)

And the base (or radix) is $2$.

The representation of the exponent

The exponent in this standard will always be an integer with biased representation. Let's see how it works with an example. Let's suppose that we have $m = 4$ bits to store 4 binary digits. We could store $2^m=2^4=16$ different combinations and we could represent the following numbers

$$ \begin{array}{cccc} \hline \mathrm{Binary} & \mathrm{Face}& & \mathrm{Signed}\,\mathrm{integer}\\ \mathrm{representation} & \mathrm{value}& & \mathrm{(Exponent)}\\ (m=4\; bits)& & & bias=2^{m-1}-1 \\ \hline \mathtt{0000} & 0 & & Reserved\\ \mathtt{0001} & 1 & & -6\\ \mathtt{0010} & 2 & & -5\\ \mathtt{0011} & 3 & & -4\\ \mathtt{0100} & 4 & & -3\\ \mathtt{0101} & 5 &bias & -2\\ \mathtt{0110} & 6 & \longrightarrow& -1\\ \mathtt{0111} & 7 & -7& 0\\ \mathtt{1000} & 8 & & 1\\ \mathtt{1001} & 9 & & 2\\ \mathtt{1010} & 10 & & 3\\ \mathtt{1011} & 11 & & 4\\ \mathtt{1100} & 12 & & 5\\ \mathtt{1101} & 13 & & 6\\ \mathtt{1110} & 14 & & 7\\ \mathtt{1111} & 15 & & Reserved\\ \hline \end{array} $$

Binary32 (single precision)

 

It uses 32 bits (4 bytes):

  • 1 bit for the sign.
  • 8 bits for the exponent.
  • 23 bits for the mantissa (or significand).

Sign

It is $\mathtt{0}$ for a positive number and $\mathtt{1}$ for a negative one.

Exponent

We have $m=8$ bits for the exponent. Therefore there are $2^m=2^8=256$ different combinations and, in principle, we can represent $256$ numbers. As we start at $0$ it will end at $255$. The first number, $\mathtt{0000\,0000}$, and the last one, $\mathtt{1111\,1111}$ are reserved (we will see later for what). And since the representation is biased, we subtract the bias that is

$$bias=2^{m-1}-1=2^{8-1}-1=2^7-1=128-1=127$$

to get the represented value

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{1111\,1100}& 252 & & 125 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$

Thus, the minimum represented exponent is $e_{min} = -126$ and the maximum $e_{max} = 127.$

Mantissa

In binary format, in the normalized representation, the digit to the left of the point is always $\mathtt{1}$. It is not stored and is called the hidden bit. So even though we have $32-1-8=23$ binary digits to store we are using one more and the precision is $24.$


Binary64 (double precision)

It uses 64 bits (8 bytes):

  • 1 bit for the sign.
  • 11 bits for the exponent.
  • 52 bits for the mantissa (or significand).

Sign

If the sign is positive $\mathtt{0}$ is used and if the sign is negative $\mathtt{1}$.

Exponent

We have $m=11$ bits for the exponent. Therefore there are $2^m=2^{11}=2048$ different combinations and, in principle, we can represent $2048$ numbers. As we start at $0$ it will end at $2047$. The first number, $\mathtt{0000\,0000\,000}$, and the last one, $\mathtt{1111\,1111\,111}$ are reserved. And since the representation is biased, we subtract the bias that is

$$bias=2^{m-1}-1=2^{11-1}-1=2^{10}-1=1024-1 = 1023$$

to get the represented value.

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000\,000}& 0& & R\\ \mathtt{0000\,0000\,001}& 1& & -1022\\ \mathtt{0000\,0000\,010}& 2& & -1021\\ \mathtt{0000\,0000\,011}& 3& & -1020\\ \cdots & \cdots & -1023 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{1111\,1111\,100}& 2044 & & 1021 \\ \mathtt{1111\,1111\,101}& 2045 & & 1022 \\ \mathtt{1111\,1111\,110}& 2046 & & 1023 \\ \mathtt{1111\,1111\,111}& 2047 & & R\\ \hline \end{array} $$

And the minimum exponent is $e_{min} = -1022$ and the maximum $e_{max} = 1023.$

Mantissa

In binary format, in the normalized representation, the digit to the left of the point is always $\mathtt{1}$. It is not stored and is called the hidden bit. So even though we have $64-1-11=52$ binary digits to store we are using one more and the precision is $53.$


Exercise

If the number

$$ \begin{array}{|c|c|c|} \hline \mathrm{Sign} & \mathrm{Exponent} & \mathrm{Mantissa}\\ 1\,\mathrm{bit} & 8\,\mathrm{bits} & 23\,\mathrm{bits} \\ \hline \mathtt{1}&\mathtt{1000\,1101}&\mathtt{0110\,1000\,0000\,0000\,0000\,000}\\ \hline \end{array} $$

follows the IEEE 754 single precision floating point representation, give its representation in the decimal base.


Sign

As the bit for the sign is $\mathtt{{\color{red}1}}$ $\longrightarrow$ the sign is negative.

Exponent

The face value of the exponent is $\mathtt{{\color{red}{1000\,1101}}}$. If we take into account the position of the digits

$$ \begin{array}{ccccccc} \tiny{(7)}&\tiny{(6)}&\tiny{(5)}&\tiny{(4)}& \,\tiny{(3)}&\tiny{(2)}&\tiny{(1)}&\tiny{(0)}& & \\ \mathtt{1}&\mathtt{0}&\mathtt{0}&\mathtt{0}&\,\mathtt{1}&\mathtt{1}&\mathtt{0}&\mathtt{1} \end{array} $$

we have

$$2^7+2^3+2^2+2^0=128+8+4+1=141$$

And, as we have $m=8$ bits, $bias = 2^{m-1}-1 = 2^{8-1}-1=2^7-1=128-1=127$

And the exponent value is $141-127=14$.

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{{\color{red}{1000\,1101}}}& {\color{red}{141}}& & {\color{red}{14}}\\ \cdots & \cdots & & \cdots\\ \cdots & \cdots & & \cdots\\ \mathtt{1111\,1100}& 252 & & 125 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$

Mantissa

The digits for the mantissa are $\mathtt{{\color{ForestGreen}{0110\,1000\,0000\,0000\,0000\,000}}}$ and taking into account the hidden bit, that is $\mathtt{1}$

$$\mathtt{1.{\color{ForestGreen}{0110\,1000\,0000\,0000\,0000\,000}}}$$

also

$$\mathtt{1.{\color{ForestGreen}{0110\,1}}}$$$$ \begin{array}{ccccccc} \tiny{(0)}&&\tiny{(-1)}&\tiny{(-2)}&\tiny{(-3)}&\tiny{(-4)}&\tiny{(-5)}\\ \mathtt{1}&.&\mathtt{{\color{ForestGreen}0}}&\mathtt{{\color{ForestGreen}1}}&\mathtt{\color{ForestGreen}1}&\,\mathtt{{\color{ForestGreen}0}}&\mathtt{\color{ForestGreen}1} \end{array} $$

Number

Summing up, the number is

$${\color{red}-}\mathtt{1.{\color{ForestGreen}{0110\,1}}}\times 2^{\color{red}{14}} \quad \longrightarrow \quad -(1+2^{-2}+2^{-3}+2^{-5})\times2^{14} = \fbox{$-$23040} $$