Course Webpage

Exercise

  1. Represent 0.3 in single precision rounding it to the closest even.
  2. Obtain the relative and absolute error in base 10. Check that the relative error is less than the machine $\epsilon.$

Represent 0.3 in single precision rounding it to the closest even.

Decimal to binary

This number has no integer part, only a fractional part. To convert it to base $2$ we multiply by $2$, then subtract the integer part, which we keep. Then, we repeat. These integer parts will be the binary digits.

$$\begin{array}{lccccccc} 0.3 &\times& 2 &= &0.6 &\rightarrow & 0 & \downarrow\\ \hline 0.6&\times& 2& = &1.2 &\rightarrow & 1& \downarrow\\ 0.2&\times& 2& = &0.4&\rightarrow & 0& \downarrow\\ 0.4&\times& 2& = &0.8&\rightarrow & 0& \downarrow\\ 0.8&\times& 2& = &1.6&\rightarrow & 1& \downarrow\\ \hline 0.6&\times& 2& = &1.2 &\rightarrow & 1& \downarrow\\ 0.2&\times& 2& = &0.4&\rightarrow & 0& \downarrow\\ 0.4&\times& 2& = &0.8&\rightarrow & 0& \downarrow\\ 0.8&\times& 2& = &1.6&\rightarrow & 1& \downarrow\\ \hline \vdots & & & &\vdots & & \vdots & \end{array}$$

In this case, it is a binary number with a periodic fractional part. We save the digits starting from the top

$$(0.3)_{10} = (\mathtt{0.0\,1001\,1001\,1001\,1001\,1001\,1001\,1001\ldots})_2$$

Normalization

  1. We move the point so that a single non-zero digit appears to its left.
  2. Then, we have to multiply by $10^n$ where $n$ is the number of positions that we have moved the point to the left or $10^{- n}$ being $n$ the number of positions that we have moved the comma to the right.
  3. We add the sign.

This number, normalized is

$$+\mathtt{1.001\,1001\,1001\,1001\,1001\,1001\,1001\ldots}\times2^{-2}$$

with

  • Sign: $+$
  • Mantissa: $\mathtt{1.001\,1001\,1001\,1001\,1001\,1001\,1001\ldots}$
  • Exponent: $-2$

Sign

As it is a positive number $\longrightarrow$ the bit for the sign is $\mathtt{0}$

Exponent

We have $m=8$ bits for the exponent. Therefore there are $2^m=2^8=256$ different combinations and, in principle, we can represent $256$ numbers. As we start at $0$ it will end at $255$. The first number, $\mathtt{0000\,0000}$, and the last one, $\mathtt{1111\,1111}$ are reserved. And since the representation is biased, we subtract the bias $$bias=2^{m-1}-1 = 128-1 = 127$$ to get the represented value.

The exponent value is $-2$. To obtain the face value we have to add the bias and $-2 + 127 = 125$ which in binary would be

$$ \begin{array}{cccc} \hline \mathrm{Dividend} & \mathrm{Divisor} & \mathrm{Quotient} & \mathrm{Remainder} & \\ \hline 125 & 2 & 62 & 1 & \uparrow \\ 62 & 2 & 31 & 0 & \uparrow \\ 31 & 2 & 15 & 1 & \uparrow \\ 15 & 2 & 7 & 1 & \uparrow \\ 7 & 2 & 3 & 1 & \uparrow \\ 3 & 2 & 1 & 1 & \uparrow \\ 1 & 2 & 0 & 1 & \uparrow \\ \hline \end{array} $$

That is

$$(125)_{10} = (\mathtt{111\,1101})_2$$
$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{{\color{red}{0111\,1101}}}& {\color{red}{125}}& & {\color{red}{-2}}\\ \cdots & \cdots & +127 & \cdots\\ \cdots & \cdots & \longleftarrow & \cdots\\ \mathtt{1111\,1100}& 252 & & 126 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$

Mantissa

$$(0.3)_{10}$$

The mantissa is $$\mathtt{1.{\color{ForestGreen}{001\,1001\,1001\,1001\,1001\,1001}\,{\color{red}{1001}\ldots}}}.$$ We have to take into account the hidden bit, which we do not store.

We do not store the bits $\mathtt{{\color{red}{1001}\ldots}}$ but we must take them into account to round the number.

The rounding method will be to the nearest even number.

    • Truncation: We remove the last digits to leave 24 digits $$\mathtt{1.{001\,\color{ForestGreen}{1001\,1001\,1001\,1001}\,1001}}$$
  • Rounding: We take the next number in this precision which is $$ \begin{array}{rll} & \mathtt{1.{001\,\color{ForestGreen}{1001\,1001\,1001\,1001}\,1001}} \\[-40pt] +&\\[-50pt] & \mathtt{0.{000\,\color{ForestGreen}{0000\,0000\,0000\,0000}\,0001}} \\[-40pt] \hline & \mathtt{1.001\,{\color{ForestGreen}{1001\,1001\,1001\,1001}\,1010}} \\[-40pt] \end{array} $$ The middle point between these two points is $$\mathtt{{\color{red}{1.001}}{\color{ForestGreen}{\,1001\,1001\,1001\,1001}\,{\color{red}{1001\,1}}}}$$ Since the number we want to round is greater than this midpoint, it is greater than the midpoint, it rounds up towards $$\mathtt{1.001\,{\color{ForestGreen}{1001\,1001\,1001\,1001}\,1010}}$$

We store 23 digits. The $1$ to the left of the point is not stored. It is the hidden bit.

Number

The number is $0.3$ and in single precision is stored as

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign}&\mathtt{exponent}&\mathtt{mantissa}\\ \hline \mathtt{0}&\mathtt{0111\,1101}&\mathtt{001\,1001\,1001\,1001\,1001\,1010}\\ \hline \end{array} $$

Absolute and relative error in base 10

The stored number is

$$\mathtt{1.001\,1001\,1001\,1001\,1001\,1010}\times 2^{-2}$$

that is base 10 is

$$\left(1+2^{-3}+2^{-4}+2^{-7}+2^{-8}+2^{-11}+2^{-12}+2^{-15}+2^{-16}+2^{-19}+2^{-20}+2^{-22}\right)\times 2^{-2}$$

that is

$$x^* = 0.30000001192092896$$

But we were trying to store $x = 0.3$

  • Absolute error $$e_a=|x-x^*|=0.00000001192092896\approx 1.2\times10^{-8}$$
  • Relative error $$e_r=\frac{e_a}{|x|}\approx 4 \times10^{-8}$$

This number should be smaller that the machine $\epsilon.$ In single precision, the machine $\epsilon$ is

$$\epsilon = 2^{-23}\approx 1.2 \times 10^{-7}$$

then

$$e_r = 4\times 10^{-8}\lt 12 \times 10^{-8}= 1.2 \times 10^{-7} = \epsilon$$