1 Normal numbers
- 1.1 Maximum and minimum exponent
- 1.2 Normal positive numbers
2 Denormal numbers
- 2.1 Single precision denormal numbers
- 2.2 Denormal numbers in this standard
3 The machine epsilon
4 Maximum integer
- 4.1 Maximum integer in single precision
- 4.2 Maximum integer in double precision
5 Special values

Exercise¶

A computer stores floating point numbers in binary format with 6 bits using a standard similar to IEEE 754. The first bit is for the sign, the next three bits are for the biased exponent and the last two bits are for the mantissa.

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{m_1m_2}\\ \hline \end{array} $$

Calculate the maximum and minimum exponents.
Compute the representable normal positive numbers and draw them on the real line. How many numbers do we get with this standard?
Compute the representable subnormal positive numbers. How many subnormal numbers do we get with this standard?
Calculate the machine $\epsilon$ for this standard.
Draw a graph of the absolute and relative errors that we would make when we store real numbers that are between one and the maximum representable normal number for both rounding and truncation.
Calculate the maximum integer that can be represented exactly so the next integer cannot be represented exactly with this standard.
How would you represent the zero with this standard? And infinity? And $\mathtt{NaN}$?

Normal numbers¶

Maximum and minimum exponent¶

The format is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1,m_1m_2}\\ \hline \end{array} $$

We have $m=3$ bits for the exponent. Therefore there are $2^m=2^3=8$ different combinations and, in principle, we can represent $8$ numbers. As we start at $0$ it will end at $7$. The first number, $\mathtt{000}$, and the last one, $\mathtt{111}$ are reserved. And since the representation is biased, we subtract the bias $$bias=2^{m-1}-1=2^{3-1}-1=2^2-1=4-1=3$$ to get the represented value

\begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{000}& 0& & R\\ \mathtt{001}& 1& & -2\\ \mathtt{010}& 2& & -1\\ \mathtt{011}& 3& bias & 0\\ \mathtt{100}& 4& -3 & 1\\ \mathtt{101}& 5& \longrightarrow & 2\\ \mathtt{110}& 6& & 3\\ \mathtt{111}& 7& & R\\ \hline \end{array}

Thus, in this standard, the minimum exponent is $\begin{array}{|c|}\hline e_{min}=-2\\ \hline\end{array}$ and the maximum is $\begin{array}{|c|}\hline e_{max}=3\\ \hline\end{array}.$

Normal positive numbers¶

Let's remember that the format is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1,m_1m_2}\\ \hline \end{array} $$

As we are taking into account only the positive numbers, the first bit, the sign, is a zero.
The three following bits correspond to the exponent.
If we have $n=2$ bits for the mantissa, we write each exponent with every possible mantissa. Let's remember that the hidden bit is not stored and that, for a normal number, it is always $1$.

\begin{array}{|lcll||} \hline \mathrm{Num (bin)} & & \mathrm{Num (dec)} & \mathrm{Gap}\\ \hline \mathtt{0\:001\:00} & 1.00\times2^{-2} & 0.25 & 2^{-4}=0.0625 \\ \mathtt{0\:001\:01} & 1.01\times2^{-2} & 0.3125 & 2^{-4}=0.0625 \\ \mathtt{0\:001\:10} & 1.10\times2^{-2} & 0.375 & 2^{-4}=0.0625 \\ \mathtt{0\:001\:11} & 1.11\times2^{-2} & 0.4375 & 2^{-4}=0.0625 \\ & & & & & & \\ \mathtt{0\:010\:00} & 1.00\times2^{-1} & 0.5 & 2^{-3}=0.125\\ \mathtt{0\:010\:01} & 1.01\times2^{-1} & 0.625 & 2^{-3}=0.125\\ \mathtt{0\:010\:10} & 1.10\times2^{-1} & 0.75 & 2^{-3}=0.125\\ \mathtt{0\:010\:11} & 1.11\times2^{-1} & 0.875 & 2^{-3}=0.125\\ & & & & & & \\ \mathtt{0\:011\:00} & 1.00\times2^{0} & 1 & 2^{-2}=0.25\\ \mathtt{0\:011\:01} & 1.01\times2^{0} & 1.25 & 2^{-2}=0.25\\ \mathtt{0\:011\:10} & 1.10\times2^{0} & 1.5 & 2^{-2}=0.25\\ \mathtt{0\:011\:11} & 1.11\times2^{0} & 1.75 & 2^{-2}=0.25\\ & & & & & & \\ \mathtt{0\:100\:00} & 1.00\times2^{1} & 2 & 2^{-1}=0.5\\ \mathtt{0\:100\:01} & 1.01\times2^{1} & 2.5 & 2^{-1}=0.5\\ \mathtt{0\:100\:10} & 1.10\times2^{1} & 3 & 2^{-1}=0.5\\ \mathtt{0\:100\:11} & 1.11\times2^{1} & 3.5 & 2^{-1}=0.5\\ & & & & & & \\ \mathtt{0\:101\:00} & 1.00\times2^{2} & 4 & 2^{0}=1\\ \mathtt{0\:101\:01} & 1.01\times2^{2} & 5 & 2^{0}=1\\ \mathtt{0\:101\:10} & 1.10\times2^{2} & 6 & 2^{0}=1\\ \mathtt{0\:101\:11} & 1.11\times2^{2} & 7 & 2^{0}=1\\ & & & & & & \\ \mathtt{0\:110\:00} & 1.00\times2^{3} & 8 & 2^{1}=2\\ \mathtt{0\:110\:01} & 1.01\times2^{3} & 10 & 2^{1}=2\\ \mathtt{0\:110\:10} & 1.10\times2^{3} & 12 & 2^{1}=2\\ \mathtt{0\:110\:11} & 1.11\times2^{3} & 14 & \\ \hline \end{array}

With $m = 3$ bits for the exponent, we have $2^m$ different numbers. But as the first one is $\mathtt{000}$ and the last one $\mathtt{111}$ and they are reserved, we have

$$2^m-2=2^3-2=8-2=6\;\mathrm{exponents}$$

We have $n = 2$ bits for the mantissa and that means $2^n$ different combinations. The hidden bit is always 1, and it does not provide new values. Thus

$$2^n=2^2=4\;\mathrm{mantissas}$$

and

$$6\,\mathrm{exponents}\times4\,\mathrm{mantissas} = \fbox{24 normal positive numbers}$$

If we represent these numbers on the real line

We can see that:

The space between numbers increases when we move to the right. In fact, every time we change the exponent, the space between numbers doubles.
There is quite a relatively big gap between the smallest normal number and zero.
We have marked in green the number $1$ and the next number that can be represented in this standard.
The minimum normalized number to be represented exactly is $0.25$ and the maximum is $14.$

Denormal numbers¶

The normal number are

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1.m_1m_2}\\ \hline \end{array} $$

while the denormal (or denormalized or subnormal) numbers are

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{000} & \mathtt{0.m_1m_2}\\ \hline \end{array} $$

Single precision denormal numbers¶

If we consider that the hidden bit is zero, this allows us to represent numbers smaller than the minimum normal number.

These numbers, in the IEEE 754 standard, are represented with an exponent $\mathtt{0000\,0000}$ in single precision and with an exponent $\mathtt{0000\,0000\,000}$ in double precision. But the value of its exponent is interpreted as the minimum exponent, that is, $-126$ in single precision and $-1022$ in double precision.

The downside of these numbers is that their precision is less than $24$ in single precision and less than $53$ in double precision.
And their advantage is that they increase the range of numbers to represent by filling in the space between the smallest normalized number and zero.

Let's see an example

The following number is represented in single precision according to the IEEE 754 standard.

$$ \begin{array}{|c|c|c|} \hline \texttt{sign} & {\texttt{exponent}} & {\texttt{mantissa}}\\ \hline \texttt{0} & \mathtt{0000\,0000} & \mathtt{\color{red}{000}1\,0110\,0000\,0000\,0000\,000}\\ \hline \end{array} $$

What is its value in base $10$? What is the precision of the represented number?

Since its exponent is $\mathtt{0000\,0000}$ and its mantissa is not zero, it is a denormal number:

Its exponent is the minimum exponent of the standard $-126.$
Its hidden bit is 0.

Therefore, it represents the number

$$ \mathtt{0.0001011}\cdot2^{-126} $$

which corresponds to the following number in base $10$

$$ (2^{-4}+2^{-6}+2^{-7})\cdot2^{-126}\approx1.0102\cdot10^{-39}. $$

This number has a precision of only $20$ (not $24$, as corresponds to normal numbers in single precision) since, for precision purposes, the three zeros to the left of the first one of the mantissa do not count (the three zeros in red $\mathtt{\color{red}{000}}$).

Denormal numbers in this standard¶

We return to our standard. Remember that the format is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1,m_1m_2}\\ \hline \end{array} $$

and for denormal numbers

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{000} & \mathtt{0,m_1m_2}\\ \hline \end{array} $$

Since the numbers that we are going to represent are denormal, their exponent is $\mathtt{000}$ and their mantissa cannot be zero:

The value of the exponent is the minimum exponent of the standard $-2.$
Its hidden bit is $0.$

In this standard, the denormal numbers are

\begin{array}{|cll|} \hline \mathrm{Num (bin)} & & \mathrm{Num (dec)} & \mathrm{gap} \\ \hline \mathtt{0\:000\:01} & 0.01\times2^{-2} & 0.0625 & 2^{-4}=0.0625 \\ \mathtt{0\:000\:10} & 0.10\times2^{-2} & 0.125 & 2^{-4}=0.0625 \\ \mathtt{0\:000\:11} & 0.11\times2^{-2} & 0.1875 & \\ \hline \end{array}

As the exponent, the sign (we are only counting the positives) and the hidden bit do not vary, we only have to take into account the number of bits of the mantissa $n=2$ and we have to remove the case where the mantissa is all zeros . So

$$2^n-1=2^2-1=\fbox{3 denormal numbers}$$

We can see that:

The gap between numbers is constant and matches the space between the smallest normal numbers.
The minimum denormal number to represent exactly is $0.0625$ and the maximum is $0.1875$. Both of them are between zero and the smallest normal number.
The precision is 1 for the first number and two for the next two (the precision of this standard for normal numbers is three, the two bits plus the one of the hidden bit).

The machine epsilon¶

Let's remember that our toy system is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1,m_1m_2}\\ \hline \end{array} $$

The machine $\epsilon$ is defined as the distance between the number $1$ and the next number that can be exactly represented in this standard.

Remember that in this standard $1$ and the following number are

\begin{array}{ll} \mathtt{1.00}\times2^{0} & 1 \\ \mathtt{1.01}\times2^{0} & 1.25 \end{array}

We calculate the distance by subtracting these two values

$$ \begin{array}{rll} & \mathtt{1.01}\times2^{0} & 1.25\\ -&&\\[-40pt] & \mathtt{1.00}\times2^{0} & 1 \\ \hline \epsilon\rightarrow & 0.01\times2^{0}\rightarrow & 0.25 \end{array} $$

That is, for this standard $$\fbox{$\epsilon=0.25$}$$

The machine $\epsilon$ is an upper bound on the relative rounding error that we have when we store a number in this standard. We will check this in the next section.

The machine epsilon in single precision¶

For single precision, the representation of one and the next exactly representable number, taking into account that we have $23$ bits of mantissa plus the hidden bit, is

\begin{array}{l} \mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{0} \\ \mathtt{1.0000\,0000\,0000\,0000\,0000\,001}\times2^{0} \end{array}

If we subtract these two numbers

\begin{array}{rl} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,001}\times2^{0}\\ -&\\[-40pt] & \mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{0}\\ \hline \epsilon\rightarrow & \mathtt{0.0000\,0000\,0000\,0000\,0000\,001}\times2^{0}\rightarrow & \fbox{$2^{-23}\approx 1.19 \times 10^{-7}$} \end{array}

The machine epsilon in double precision¶

If we apply the definition for double precision, the representation of one and the next representable number exactly, considering that we have $52$ bits of mantissa plus the hidden bit, is

\begin{array}{l} \mathtt{1.\overbrace{0000\,0000\cdots00}^{52\,bits}}\times2^{0} \\ \mathtt{1.0000\,0000\cdots01}\times2^{0} \end{array}

and the gap between the two numbers is

\begin{array}{rl} & \mathtt{1.\overbrace{0000\,0000\cdots01}^{52\,bits}}\times2^{0} \\ -&\\[-40pt] & \mathtt{1.0000\,0000\cdots00}\times2^{0}\\ \hline \epsilon\rightarrow & \mathtt{0.0000\,0000\cdots01}\times2^{0}\rightarrow & \fbox{$2^{-52}\approx 2.22 \times 10^{-16}$} \end{array}

Rounding errors¶

For now, we will plot the absolute and relative errors that appear when we store numbers with this standard. We have seen that we can represent numbers between $0$ and $14.$ When we want to store a number in this range, if it does not match any of the numbers in the table, we store it as one of the closest numbers in the table.

If we store it as the closest number below, we will have the truncation error or rounding towards zero error (we will clarify these denominations later) which will be the distance from the number that we want to represent to the number that is the closest below.
The maximum error will be given for the numbers just below the numbers that can be exactly represented.
If we store it as the nearest number, we will have a rounding to the nearest even number error which will be the distance to the nearest number.
The maximum error is obtained for the number in the midpoint between two numbers that can be exactly represented.

We can see that:

The absolute error grows with the number, but the relative error, which is the important one, does not.
The maximum absolute error for truncation is twice the maximum absolute error for rounding.
All relative errors are less than $\epsilon=0.25$ as we said.

Maximum integer¶

We want to calculate the maximum integer that can be represented exactly so that the next integer cannot be represented exactly.

Remember that our system is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1,m_1m_2}\\ \hline \end{array} $$

If we look at the represented integers we see that we can represent all the integers from $1$ to $8$ but we cannot represent $9$ exactly. Therefore, the answer to the question would be "8".

How can we reason this for any system that is similar? The idea is that if we can store all the digits in a format, we will not have an error. If we must drop a digit, we could have an error. For the actual format, we have two digits of mantissa plus the hidden bit, so we can store 3 digits.

\begin{array}{|crc|} \hline \mathrm{int} & \mathrm{bin} &\\ \hline 1&\mathtt{1} & \mathtt{1.00}\times2^{0} \\ 2&\mathtt{10} & \mathtt{1.00}\times2^{1} \\ 3&\mathtt{11} & \mathtt{1.10}\times2^{1} \\ 4&\mathtt{100} & \mathtt{1.00}\times2^{2} \\ 5&\mathtt{101} & \mathtt{1.01}\times2^{2} \\ 6&\mathtt{110} & \mathtt{1.10}\times2^{2} \\ 7&\mathtt{111} & \mathtt{1.11}\times2^{2} \\ 8&\mathtt{100{\color{red}0}} & \mathtt{1.00}\times2^{3} \\ \hline 9&\mathtt{100{\color{red}1}} & {\color{ForestGreen}{\mathtt{1.00}\times2^{3}}} \\ 10&\mathtt{101{\color{red}0}} & \mathtt{1.01}\times2^{3} \\ 11&\mathtt{101{\color{red}1}} & {\color{ForestGreen}{\mathtt{1.10}\times2^{3}}} \\ 12&\mathtt{110{\color{red}0}} & \mathtt{1.10}\times2^{3}\\ 13&\mathtt{110{\color{red}1}} & {\color{ForestGreen}{\mathtt{1.10}\times2^{3}}} \\ 14&\mathtt{111{\color{red}0}} & \mathtt{1.11}\times2^{3}\\ \hline \end{array}

We can see that:

Up to number $7$ there is no problem because we have space to store all the significant digits.
For number $8$ we lack space to store the last digit, but since it is a $\mathtt{\color{red}0}$, no error is made.
The $9$ can no longer be stored exactly because the digit we cannot store is a $\mathtt{\color{red}1}$. We'll have to round it to $8$ or $10.$ As the distance from $9$ to these two numbers is the same, we choose the one whose binary representation ends in $\mathtt{0}.$
For $10$ the case is similar to $8$. We do not have space for all the digits, but since the one that we drop is a $\mathtt{\color{red}0}$, no error is made.
$11$ cannot be stored exactly because the digit we cannot store is a $\mathtt{\color{red}1}$. We'll have to round it to $10$ or $12.$ As the distance from $9$ to these two numbers is the same, we choose the one whose binary representation ends in $\mathtt{0}.$ In this case, towards ${\color{ForestGreen}{1.10\times2^ 3}}$.
And so on.

This value, $8$ for this format, gives us an idea of the capacity of this format to store integers exactly.

Maximum integer in single precision¶

Let's find the maximum integer stored exactly so that all the integers below it are stored exactly.

If we do the same reasoning as in the previous case, taking into account that in this format we have $23$ bits plus the hidden bit, we can store $24$ significant digits

\begin{array}{|crc|} \hline \mathrm{int} & \mathrm{bin} &\\ \hline 1&\mathtt{1} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{0} \\ 2&\mathtt{10} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{1} \\ 3&\mathtt{11} & \mathtt{1.1000\,0000\,0000\,0000\,0000\,000}\times2^{1} \\ 4&\mathtt{100} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{2} \\ 5&\mathtt{101} & \mathtt{1.0100\,0000\,0000\,0000\,0000\,000}\times2^{2} \\ 6&\mathtt{110} & \mathtt{1.1000\,0000\,0000\,0000\,0000\,000}\times2^{2} \\ 7&\mathtt{111} & \mathtt{1.1100\,0000\,0000\,0000\,0000\,000}\times2^{2} \\ \vdots & \vdots & \vdots \\ 16777215 &\mathtt{1\,1111\,1111\,1111\,1111\,1111\,111} & \mathtt{1.1111\,1111\,1111\,1111\,1111\,111}\times2^{23}\\ 16777216 &\mathtt{1\,0000\,0000\,0000\,0000\,0000\,000{\color{red}0}} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{24} \\ \hline 16777217 &\mathtt{1\,0000\,0000\,0000\,0000\,0000\,000{\color{red}1}} & {\color{ForestGreen}{\mathtt{1.0000\,0000\,0000\,0000\,0000\,000}\times2^{24}}} \\ 16777218 &\mathtt{1\,0000\,0000\,0000\,0000\,0000\,001{\color{red}0}} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,001}\times2^{24} \\ 16777219 &\mathtt{1\,0000\,0000\,0000\,0000\,0000\,001{\color{red}1}} & {\color{ForestGreen}{\mathtt{1.0000\,0000\,0000\,0000\,0000\,010}\times2^{24}}}\\ 16777220 &\mathtt{1\,0000\,0000\,0000\,0000\,0000\,010{\color{red}0}} & \mathtt{1.0000\,0000\,0000\,0000\,0000\,010}\times2^{24} \\ \vdots & \vdots & \vdots \\ \hline \end{array}

That is, we have $23$ bits plus the hidden bit to store digits. Therefore, the largest integer that we can store all the digits for is

$$\mathtt{1\,\,1111\,1111\,1111\,1111\,1111\,111}$$

that is

$$\mathtt{1\,\,\overbrace{1111\cdots111}^{23\,bits}} \quad\rightarrow\quad \mathtt{1.\,\overbrace{1111\cdots111}^{23\,bits}}\times 2^{23} $$

The next number is

$$\mathtt{10\overbrace{0000\cdots00{\color{red}0}}^{23\,bits}}= \mathtt{1\overbrace{0000\cdots000}^{23\,bits}{\color{red}0}}\quad\rightarrow\quad \mathtt{1,\,\overbrace{0000\cdots000}^{23\,bits}}\times 2^{24}$$

and we must drop the last $\mathtt{{\color{red}0}}.$ Because it is a zero, there is no error. This number, in decimal, is

$$\fbox{$2^{24}=16777216$}$$

The following integer number is

$$\mathtt{10\overbrace{0000\cdots00{\color{red}1}}^{23\,bits}}= \mathtt{1\overbrace{0000\cdots000}^{23\,bits}{\color{red}1}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{0000\cdots000}^{23\,bits}}\times 2^{24}$$

We cannot store the last $\mathtt{{\color{red}1}}$, we do not have an exact representation for this number and we must round it to the previous one or to the next (in this case, to the previous one).

Maximum integer in double precision¶

Let's find the maximum integer stored exactly so that all the integers below it are stored exactly.

If we do the same reasoning as in the previous case, taking into account that in this format we have $52$ bits plus the hidden bit, we can store $53$ significant digits

Therefore, the largest integer we can store all its digits is

$$\mathtt{1\,\,\overbrace{1111\cdots11}^{52\,bits}} \quad\rightarrow\quad \mathtt{1.\,\overbrace{1111\cdots11}^{52\,bits}}\times 2^{52} $$

The next number is

$$\mathtt{10\overbrace{0000\cdots0{\color{red}0}}^{52\,bits}}= \mathtt{1\overbrace{0000\cdots00}^{52\,bits}{\color{red}0}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{0000\cdots00}^{52\,bits}}\times 2^{53}$$

Because it is a zero, there is no error. This number, in decimal, is

$$\fbox{$2^{53}=9007199254740992$}$$

The following integer number is

$$\mathtt{10\overbrace{0000\cdots0{\color{red}1}}^{52\,bits}}= \mathtt{1\overbrace{0000\cdots00}^{52\,bits}{\color{red}1}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{0000\cdots00}^{52\,bits}}\times 2^{53}$$

We cannot store the last $\mathtt{{\color{red}1}}$, we do not have an exact representation for this number and we must round it to the previous one or to the next (in this case, to the previous one).

Special values¶

Our system is

$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{1,m_1m_2}\\ \hline \end{array} $$

The zero is represented by all the bits of the exponent and the mantissa zero.

$$ \begin{array}{|c|c|c|c|} \hline &\mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline +0&\mathtt{0} & \mathtt{000} & \mathtt{00}\\ \hline -0&\mathtt{1} & \mathtt{000} & \mathtt{00}\\ \hline \end{array} $$

The infinity is represented by all the bits of the exponent 1 and all the bits of the mantissa 0

$$ \begin{array}{|c|c|c|c|} \hline &\mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline +\infty&\mathtt{0} & \mathtt{111} & \mathtt{00}\\ \hline -\infty&\mathtt{1} & \mathtt{111} & \mathtt{00}\\ \hline \end{array} $$

$\mathtt{NaN}$ (Not a Number) is represented with all the bits of the exponent 1 and the bits of the mantissa with any combination other than all zeros, for example

$$ \begin{array}{|c|c|c|c|} \hline &\mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{NaN}&\mathtt{0} & \mathtt{111} & \mathtt{01}\\ \hline \end{array} $$

Contents

Exercise¶

Normal numbers¶

Maximum and minimum exponent¶

Normal positive numbers¶

Denormal numbers¶

Single precision denormal numbers¶

Denormal numbers in this standard¶

The machine epsilon¶

The machine epsilon in single precision¶

The machine epsilon in double precision¶

Rounding errors¶

Maximum integer¶

Maximum integer in single precision¶

Maximum integer in double precision¶

Special values¶