Exercise¶

A computer stores floating point numbers with 10 bits. The first bit is for the sign. The next four bits are for the biased exponent, and the last five bits are for the mantissa. Using a standard similar to IEEE 754:

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_{1}e_{2}e_{3}e_{4}} & \mathtt{m_{1}m_{2}m_{3}m_{4}m_{5}} \\\hline \end{array}

Convert the number $\left(\mathtt{1011011010}\right)_{2}$ written with this standard, to base 10. $$\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{1} & \mathtt{0110} & \mathtt{11010} \\\hline \end{array}$$
What is the machine $\epsilon$ in base 10?
Calculate the maximum integer that can be represented exactly so the next integer cannot be represented exactly with this norm.
What are the smallest and the largest normal positive numbers? Give their binary representation. Give their precision. What would be the gap with the next and previous, respectively, representable numbers?
How many normal positive numbers can we represent exactly with this standard?
What are the smallest and the largest subnormal positive numbers? Give their binary representation. What is their precision? How long would be the gap with the next and previous, respectively, representable numbers?
How many subnormal positive numbers can we represent with this standard?
How do we represent $0$, $+\infty$, $-\infty$?
Give an example for $\mathtt{NaN}$ representation.
Calculate the number $-1.5625$ with this standard

Convert the number 1011011010 to base 10.¶

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{1} & \mathtt{0110} & \mathtt{11010} \\\hline \end{array}

Sign¶

As we have $\mathtt{{\color{red}1}}$ $\longrightarrow$ it is a negative number.

Exponent¶

We have $m=4$ bits for the exponent. Therefore there are $2^m = 2^4 = 16$ different combinations and we can represent $16$ numbers. As we start at $0$ it will end at $15$. The first number, $\mathtt{0000}$, and the last one, $\mathtt{1111}$ are reserved. And since the representation is biased, we subtract the bias $$bias=2^{m-1}-1=2^{4-1}-1 =2^3-1= 8-1 = 7$$ to get the represented value

\begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000}& 0& & R\\ \mathtt{0001}& 1& & -6\\ \mathtt{0010}& 2& & -5\\ \vdots & \vdots& bias & \vdots\\ \mathtt{{\color{red}{0110}}}& {\color{red}6}& -7 & {\color{red}{-1}}\\ \vdots & \vdots& \longrightarrow & \vdots\\ \mathtt{1101}& 13& & 6\\ \mathtt{1110}& 14& & 7\\ \mathtt{1111}& 15& & R\\ \hline \end{array}

The exponent face value is $\mathtt{{\color{red}{0110}}}$. Taking into account the position of the digits

$$ \begin{array}{cccc} \tiny{(3)}&\tiny{(2)}&\tiny{(1)}&\tiny{(0)}\\ \mathtt{0}&\mathtt{1}&\mathtt{1}&\mathtt{0} \end{array} $$

the face value is

$$2^2+2^1=4+2=6$$

As $bias = 7$, the exponent represented value is $6-7=-1$.

Mantissa¶

The mantissa stores the digits $\mathtt{{\color{ForestGreen}{11010}}}$. If we take into account the hidden bit, the mantissa is

$$\mathtt{1.{\color{ForestGreen}{11010}}}$$

and as the positions of the digits are

$$ \begin{array}{ccccc} \tiny{(0)}&&\tiny{(-1)}&\tiny{(-2)}&\tiny{(-3)}&\tiny{(-4)}&\tiny{(-5)}\\ \mathtt{1}&,&\mathtt{{\color{ForestGreen}1}}&\mathtt{{\color{ForestGreen}1}}&\mathtt{\color{ForestGreen}0}&\,\mathtt{{\color{ForestGreen}1}}&\mathtt{\color{ForestGreen}0} \end{array} $$

Number¶

the number is

$${\color{red}-}\mathtt{1.{\color{ForestGreen}{11010}}}\times 2^{\color{red}{-1}} \quad \longrightarrow \quad -(1+2^{-1}+2^{-2}+2^{-4})\times2^{-1} = \fbox{$-$0.90625} $$

Machine epsilon¶

Let's remember that the format is

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_{1}e_{2}e_{3}e_{4}} & \mathtt{1.m_{1}m_{2}m_{3}m_{4}m_{5}} \\\hline \end{array}

The machine $\epsilon$ is defined as the distance between the number $1$ and the next number that can be exactly represented in this standard.

In this standard, $1$ and the following number are

\begin{array}{l} \mathtt{1.00000}\times2^{0} \\ \mathtt{1.00001}\times2^{0} \end{array}

We calculate the distance by subtracting these two values

\begin{array}{rll} & \mathtt{1.00001}\times2^{0} &\\ -&&\\[-40pt] & \mathtt{1.00000}\times2^{0} & \\ \hline \epsilon\rightarrow & 0.00001\times2^{0}\rightarrow & \fbox{$2^{-5}\times 2^0 =2^{-5}=0.03125$} \end{array}

Maximum integer¶

We want to calculate the maximum integer that can be represented exactly so that the next integer cannot be represented exactly.

Remember that our system is

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_{1}e_{2}e_{3}e_{4}} & \mathtt{1,m_{1}m_{2}m_{3}m_{4}m_{5}} \\\hline \end{array}

In this format, we have $5$ bits plus the hidden bit to store digits. Therefore, the largest integer that we can store all the digits for is

$$\mathtt{1\,11111} \quad\rightarrow\quad \mathtt{1.\,11111}\times 2^{5} $$

The next number is

$$\mathtt{10\overbrace{0000{\color{red}0}}^{5\,bits}}= \mathtt{1\overbrace{00000}^{5\,bits}{\color{red}0}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{00000}^{5\,bits}}\times 2^{6}$$

and we must drop the last $\mathtt{{\color{red}0}}.$ Because it is a zero, there is no error. This number, in decimal, is

$$\fbox{$2^{6}=64$}$$

The following integer number is

$$\mathtt{10\overbrace{0000{\color{red}1}}^{5\,bits}}= \mathtt{1\overbrace{00000}^{5\,bits}{\color{red}1}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{00000}^{5\,bits}}\times 2^{6}$$

and we cannot store the last $\mathtt{{\color{red}1}}$, we do not have an exact representation for this number and we must round it to the previous one or to the next (in this case, to the previous one).

Normal numbers¶

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_{1}e_{2}e_{3}e_{4}} & \mathtt{1.m_{1}m_{2}m_{3}m_{4}m_{5}} \\\hline \end{array}

In (a) we saw that the minimum exponent is $\begin{array}{|c|}\hline e_{min}=-6\\ \hline\end{array}$ and the maximum $\begin{array}{|c|}\hline e_{max}=7\\ \hline\end{array}.$

The smallest positive normal number¶

The mantissa and exponent will be the minimum. In this format

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{0} & \mathtt{0001} & \mathtt{00000} \\\hline \end{array}

We cannot have $\mathtt{0000}$ as exponent because it is reserved. This number is

$$\mathtt{1.00000}\times 2^{-6}\quad\longrightarrow\quad 2^{-6}=\fbox{0.015625}$$

Its precision is $\fbox{$p=6$}$ because it has 6 digits, 5 stored plus the hidden bit.

The next number we can represent exactly is

$$\mathtt{1.00001}\times 2^{-6}$$

And we can obtain the gap between these two numbers substracting them and the gap is

$$\mathtt{0.00001}\times 2^{-6}\quad\longrightarrow\quad 2^{-5}\times 2^{-6}=2^{-11}= \fbox{0.00048828125}$$

The largest positive normal number¶

The mantissa and exponent will be the maximum. In this format

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{0} & \mathtt{1110} & \mathtt{11111} \\\hline \end{array}

We cannot have $\mathtt{1111}$ as exponent because it is reserved. This number is

$$\mathtt{1.11111}\times 2^{7}\quad\longrightarrow\quad (1+2^{-1}+2^{-2}+2^{-3}+2^{-4}+2^{-5})\times 2^{7}=\fbox{252}$$

Its precision is $\fbox{$p=6$}$ because it has 6 digits, 5 stored plus the hidden bit.

The previous number we can represent exactly is

$$\mathtt{1.11110}\times 2^{7}$$

And we can obtain the gap between these two numbers substracting them and the gap is

$$\mathtt{0.00001}\times 2^{7}\quad\longrightarrow\quad 2^{-5}\times 2^{7}=2^{2}= \fbox{4}$$

How many positive normal numbers are?¶

For $m = 4$ bits for the exponent, we have $2^m$ different numbers. But as the first $\mathtt{0000}$ and the latter $\mathtt{1111}$ are reserved we have

$$2^m-2=2^4-2=16-2=14\;\mathrm{exponents}$$

For $n = 5$ bits for the mantissa, we have $2^n$ different numbers. As the hidden bit is always 1, it does not count and we have

$$2^n=2^5=32\;\mathrm{mantissas}$$

And the total number of positive normal numbers is

$$14\,\mathrm{exponents}\times32\,\mathrm{mantissas} = \fbox{448 positive normal numbers}$$

Denormal numbers¶

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{0000} & \mathtt{0.m_{1}m_{2}m_{3}m_{4}m_{5}} \\\hline \end{array}

For the denormal numbers

All the digits in the exponent are zero.
The represented exponent is the minimum of the system $e_{min}=-6.$
The hidden bit is zero.

The smallest positive denormal number¶

All the digits in the exponent are zero and the mantissa will be minimum (but not all zeros, that is an special value, the zero)

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{0} & \mathtt{0000} & \mathtt{00001} \\\hline \end{array}

This number, in decimal is

$$\mathtt{0.00001}\times 2^{-6}\quad\longrightarrow\quad 2^{-5}\times2^{-6}=2^{-11}=\fbox{0.00048828125}$$

Its precision is $\fbox{$p=1$}$ as there is only one significant digit (zeros to the left do not count).

The next number we can represent exactly is

$$\mathtt{0.00010}\times 2^{-6}$$

And we can obtain the gap between these two numbers substracting them and the gap is

$$\mathtt{0.00001}\times 2^{-6}\quad\longrightarrow\quad 2^{-5}\times 2^{-6}=2^{-11}= \fbox{0.00048828125}$$

The largest positive denormal number¶

All the digits in the exponent are zero and the mantissa will be maximum

\begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{0} & \mathtt{0000} & \mathtt{11111} \\\hline \end{array}

It represents

$$\mathtt{0.11111}\times 2^{-6}\quad\longrightarrow\quad (2^{-1}+2^{-2}+2^{-3}+2^{-4}+2^{-5})\times 2^{-6}=\fbox{0.01513671875}$$

Its precision is $\fbox{$p=5$}$ because we have 5 significant digits.

The previous number we can represent exactly is

$$\mathtt{0.11110}\times 2^{-6}$$

And we can obtain the gap between these two numbers substracting them and the gap is

$$\mathtt{0.00001}\times 2^{-6}\quad\longrightarrow\quad 2^{-5}\times 2^{-6}=2^{-11}= \fbox{0.00048828125}$$

How many positive denormal numbers are?¶

There is only one exponent that is $\mathtt{0000}.$

With $n = 5$ bits for the mantissa we have $2^n$ different numbers. The hidden bit is always 0, and it does not add any new number.

So we have $$2^n-1=2^5-1=32-1\;\mathrm{mantissas}$$ because we must remove the one with all zeros, that represents the special value zero.

And, therefore

$$1\,\mathrm{exponent}\times31\,\mathrm{mantissas} = \fbox{31 positive denormal numbers}$$

Special values¶

Let's remember that the format is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3e_4} & \mathtt{m_1m_2m_3m_4m_5}\\ \hline \end{array} $$

The zero is represented by all the bits of the exponent and the mantissa zero.

$$ \begin{array}{|c|c|c|c|} \hline &\mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline +0&\mathtt{0} & \mathtt{0000} & \mathtt{00000}\\ \hline -0&\mathtt{1} & \mathtt{0000} & \mathtt{00000}\\ \hline \end{array} $$

The infinity is represented by all the bits of the exponent 1 and all the bits of the mantissa 0

$$ \begin{array}{|c|c|c|c|} \hline &\mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline +\infty&\mathtt{0} & \mathtt{1111} & \mathtt{00000}\\ \hline -\infty&\mathtt{1} & \mathtt{1111} & \mathtt{00000}\\ \hline \end{array} $$

$\mathtt{NaN}$ (Not a Number) is represented with all the bits of the exponent 1 and the bits of the mantissa with any combination other than all zeros, for example

$$ \begin{array}{|c|c|c|c|} \hline &\mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{NaN}&\mathtt{0} & \mathtt{1111} & \mathtt{01101}\\ \hline \end{array} $$

Represent -1.5625 in this system¶

Decimal to binary¶

Integer part¶

The integer part is $1$ in base $2$ and in base $10$

Fractionary part¶

We multiply by $2$, then we substract the integer part and repeat.

$$\begin{array}{lccccccc} 0.5625 &\times& 2 &= &1.125 &\rightarrow & 1 & \downarrow\\ 0.125 &\times& 2 &= &0.25 &\rightarrow & 0 & \downarrow\\ 0.25 &\times& 2 &= &0.5 &\rightarrow & 0 & \downarrow\\ 0.5&\times& 2& = &1.0&\rightarrow & 1& \downarrow \end{array}$$

We construct this part from top to bottom and the binary number is $\mathtt{0.1001}$

And the full number is

$$(1.5625)_{10} = (\mathtt{1.1001})_2$$

Normalization¶

We move the point so that a single non-zero digit appears to its left.
Then, we have to multiply by $10^n$ where $n$ is the number of positions that we have moved the point to the left or $10^{- n}$ being $n$ the number of positions that we have moved the comma to the right.
We add the sign.

This number, normalized is

$$-\mathtt{1.1001}\times2^0$$

with

Sign: $-$
Mantissa: $\mathtt{1.1001}$
Exponent: $0$

Sign¶

As it is a negative number $\longrightarrow$ the bit for the sign is $\mathtt{1}$

Exponent¶

We have $m=4$ bits for the exponent. Therefore there are $2^m=2^4=16$ different combinations and, in principle, we can represent $16$ numbers. As we start at $0$ it will end at $15$. The first number, $\mathtt{0000}$, and the last one, $\mathtt{1111}$ are reserved. And since the representation is biased, we subtract the $bias=2^{m-1}-1=2^{4-1}-1 =2^3-1= 8-1 = 7$ to get the represented value.

The exponent value is $0$. To get its face value we must add the bias and $0+7=7$ that in binary would be

$$ \begin{array}{cccc} \hline \mathrm{Dividend} & \mathrm{Divisor} & \mathrm{Quotient} & \mathrm{Remainder} & \\ \hline 7 & 2 & 3 & 1 & \uparrow \\ 3 & 2 & 1 & 1 & \uparrow \\ 1 & 2 & 0 & 1 & \uparrow \\ \hline \end{array} $$

that is

$$(7)_{10} = (\mathtt{111})_2$$

And we fill with a zero to the left

\begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000}& 0& & R\\ \mathtt{0001}& 1& & -6\\ \mathtt{0010}& 2& & -5\\ \vdots & \vdots& bias & \vdots\\ \mathtt{{\color{red}{0111}}}& {\color{red}7}& +7 & {\color{red}{0}}\\ \vdots & \vdots& \longleftarrow & \vdots\\ \mathtt{1101}& 13& & 6\\ \mathtt{1110}& 14& & 7\\ \mathtt{1111}& 15& & R\\ \hline \end{array}

Mantissa¶

The mantisa is $$\mathtt{1,{\color{ForestGreen}{1001}}}.$$ We do not store the hidden bit and we fill with zeros to the right until we have 5 bits.

Number¶

The number $-1.5621$ in this system is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign}&\mathtt{exponent}&\mathtt{mantissa}\\ \hline \mathtt{1}&\mathtt{0\color{red}{111}}&\mathtt{\color{ForestGreen}{1001}0}\\ \hline \end{array} $$

Contents

Exercise¶

Convert the number 1011011010 to base 10.¶

Sign¶

Exponent¶

Mantissa¶

Number¶

Machine epsilon¶

Maximum integer¶

Normal numbers¶

The smallest positive normal number¶

The largest positive normal number¶

How many positive normal numbers are?¶

Denormal numbers¶

The smallest positive denormal number¶

The largest positive denormal number¶

How many positive denormal numbers are?¶

Special values¶

Represent -1.5625 in this system¶

Decimal to binary¶

Integer part¶

Fractionary part¶

Normalization¶

Sign¶

Exponent¶

Mantissa¶

Number¶