Exercise¶

Represent 0.3 in single precision rounding it to the closest even.
Obtain the relative and absolute error in base 10. Check that the relative error is less than the machine $\epsilon.$

Represent 0.3 in single precision rounding it to the closest even.¶

Decimal to binary¶

This number has no integer part, only a fractional part. To convert it to base $2$ we multiply by $2$, then subtract the integer part, which we keep. Then, we repeat. These integer parts will be the binary digits.

$$\begin{array}{lccccccc} 0.3 &\times& 2 &= &0.6 &\rightarrow & 0 & \downarrow\\ \hline 0.6&\times& 2& = &1.2 &\rightarrow & 1& \downarrow\\ 0.2&\times& 2& = &0.4&\rightarrow & 0& \downarrow\\ 0.4&\times& 2& = &0.8&\rightarrow & 0& \downarrow\\ 0.8&\times& 2& = &1.6&\rightarrow & 1& \downarrow\\ \hline 0.6&\times& 2& = &1.2 &\rightarrow & 1& \downarrow\\ 0.2&\times& 2& = &0.4&\rightarrow & 0& \downarrow\\ 0.4&\times& 2& = &0.8&\rightarrow & 0& \downarrow\\ 0.8&\times& 2& = &1.6&\rightarrow & 1& \downarrow\\ \hline \vdots & & & &\vdots & & \vdots & \end{array}$$

In this case, it is a binary number with a periodic fractional part. We save the digits starting from the top

$$(0.3)_{10} = (\mathtt{0.0\,1001\,1001\,1001\,1001\,1001\,1001\,1001\ldots})_2$$

Normalization¶

We move the point so that a single non-zero digit appears to its left.
Then, we have to multiply by $10^n$ where $n$ is the number of positions that we have moved the point to the left or $10^{- n}$ being $n$ the number of positions that we have moved the comma to the right.
We add the sign.

This number, normalized is

$$+\mathtt{1.001\,1001\,1001\,1001\,1001\,1001\,1001\ldots}\times2^{-2}$$

with

Sign: $+$
Mantissa: $\mathtt{1.001\,1001\,1001\,1001\,1001\,1001\,1001\ldots}$
Exponent: $-2$

Sign¶

As it is a positive number $\longrightarrow$ the bit for the sign is $\mathtt{0}$

Exponent¶

We have $m=8$ bits for the exponent. Therefore there are $2^m=2^8=256$ different combinations and, in principle, we can represent $256$ numbers. As we start at $0$ it will end at $255$. The first number, $\mathtt{0000\,0000}$, and the last one, $\mathtt{1111\,1111}$ are reserved. And since the representation is biased, we subtract the bias $$bias=2^{m-1}-1 = 128-1 = 127$$ to get the represented value.

The exponent value is $-2$. To obtain the face value we have to add the bias and $-2 + 127 = 125$ which in binary would be

$$ \begin{array}{cccc} \hline \mathrm{Dividend} & \mathrm{Divisor} & \mathrm{Quotient} & \mathrm{Remainder} & \\ \hline 125 & 2 & 62 & 1 & \uparrow \\ 62 & 2 & 31 & 0 & \uparrow \\ 31 & 2 & 15 & 1 & \uparrow \\ 15 & 2 & 7 & 1 & \uparrow \\ 7 & 2 & 3 & 1 & \uparrow \\ 3 & 2 & 1 & 1 & \uparrow \\ 1 & 2 & 0 & 1 & \uparrow \\ \hline \end{array} $$

That is

$$(125)_{10} = (\mathtt{111\,1101})_2$$

$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{{\color{red}{0111\,1101}}}& {\color{red}{125}}& & {\color{red}{-2}}\\ \cdots & \cdots & +127 & \cdots\\ \cdots & \cdots & \longleftarrow & \cdots\\ \mathtt{1111\,1100}& 252 & & 126 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$

Mantissa¶

$$(0.3)_{10}$$

The mantissa is $$\mathtt{1.{\color{ForestGreen}{001\,1001\,1001\,1001\,1001\,1001}\,{\color{red}{1001}\ldots}}}.$$ We have to take into account the hidden bit, which we do not store.

We do not store the bits $\mathtt{{\color{red}{1001}\ldots}}$ but we must take them into account to round the number.

The rounding method will be to the nearest even number.

- Truncation: We remove the last digits to leave 24 digits $$\mathtt{1.{001\,\color{ForestGreen}{1001\,1001\,1001\,1001}\,1001}}$$
Rounding: We take the next number in this precision which is $$ \begin{array}{rll} & \mathtt{1.{001\,\color{ForestGreen}{1001\,1001\,1001\,1001}\,1001}} \\[-40pt] +&\\[-50pt] & \mathtt{0.{000\,\color{ForestGreen}{0000\,0000\,0000\,0000}\,0001}} \\[-40pt] \hline & \mathtt{1.001\,{\color{ForestGreen}{1001\,1001\,1001\,1001}\,1010}} \\[-40pt] \end{array} $$ The middle point between these two points is $$\mathtt{{\color{red}{1.001}}{\color{ForestGreen}{\,1001\,1001\,1001\,1001}\,{\color{red}{1001\,1}}}}$$ Since the number we want to round is greater than this midpoint, it is greater than the midpoint, it rounds up towards $$\mathtt{1.001\,{\color{ForestGreen}{1001\,1001\,1001\,1001}\,1010}}$$

We store 23 digits. The $1$ to the left of the point is not stored. It is the hidden bit.

Number¶

The number is $0.3$ and in single precision is stored as

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign}&\mathtt{exponent}&\mathtt{mantissa}\\ \hline \mathtt{0}&\mathtt{0111\,1101}&\mathtt{001\,1001\,1001\,1001\,1001\,1010}\\ \hline \end{array} $$

Absolute and relative error in base 10¶

The stored number is

$$\mathtt{1.001\,1001\,1001\,1001\,1001\,1010}\times 2^{-2}$$

that is base 10 is

$$\left(1+2^{-3}+2^{-4}+2^{-7}+2^{-8}+2^{-11}+2^{-12}+2^{-15}+2^{-16}+2^{-19}+2^{-20}+2^{-22}\right)\times 2^{-2}$$

that is

$$x^* = 0.30000001192092896$$

But we were trying to store $x = 0.3$

Absolute error $$e_a=|x-x^*|=0.00000001192092896\approx 1.2\times10^{-8}$$
Relative error $$e_r=\frac{e_a}{|x|}\approx 4 \times10^{-8}$$

This number should be smaller that the machine $\epsilon.$ In single precision, the machine $\epsilon$ is

$$\epsilon = 2^{-23}\approx 1.2 \times 10^{-7}$$

then

$$e_r = 4\times 10^{-8}\lt 12 \times 10^{-8}= 1.2 \times 10^{-7} = \epsilon$$

Contents