A computer stores floating point numbers in binary format with 6 bits using a standard similar to IEEE 754. The first bit is for the sign, the next three bits are for the biased exponent and the last two bits are for the mantissa.
$$ \begin{array}{|c|c|c|} \hline \mathtt{sign} & \mathtt{exponent} & \mathtt{mantissa}\\ \hline \mathtt{s} & \mathtt{e_1e_2e_3} & \mathtt{m_1m_2}\\ \hline \end{array} $$We have $m=3$ bits for the exponent. Therefore there are $2^m=2^3=8$ different combinations and, in principle, we can represent $8$ numbers. As we start at $0$ it will end at $7$. The first number, $\mathtt{000}$, and the last one, $\mathtt{111}$ are reserved. And since the representation is biased, we subtract the bias $$bias=2^{m-1}-1=2^{3-1}-1=2^2-1=4-1=3$$ to get the represented value
Thus, in this standard, the minimum exponent is $\begin{array}{|c|}\hline e_{min}=-2\\ \hline\end{array}$ and the maximum is $\begin{array}{|c|}\hline e_{max}=3\\ \hline\end{array}.$
Let's remember that the format is
With $m = 3$ bits for the exponent, we have $2^m$ different numbers. But as the first one is $\mathtt{000}$ and the last one $\mathtt{111}$ and they are reserved, we have
$$2^m-2=2^3-2=8-2=6\;\mathrm{exponents}$$We have $n = 2$ bits for the mantissa and that means $2^n$ different combinations. The hidden bit is always 1, and it does not provide new values. Thus
$$2^n=2^2=4\;\mathrm{mantissas}$$and
$$6\,\mathrm{exponents}\times4\,\mathrm{mantissas} = \fbox{24 normal positive numbers}$$If we represent these numbers on the real line
We can see that:
The normal number are
while the denormal (or denormalized or subnormal) numbers are
If we consider that the hidden bit is zero, this allows us to represent numbers smaller than the minimum normal number.
These numbers, in the IEEE 754 standard, are represented with an exponent $\mathtt{0000\,0000}$ in single precision and with an exponent $\mathtt{0000\,0000\,000}$ in double precision. But the value of its exponent is interpreted as the minimum exponent, that is, $-126$ in single precision and $-1022$ in double precision.
Let's see an example
The following number is represented in single precision according to the IEEE 754 standard.
$$ \begin{array}{|c|c|c|} \hline \texttt{sign} & {\texttt{exponent}} & {\texttt{mantissa}}\\ \hline \texttt{0} & \mathtt{0000\,0000} & \mathtt{\color{red}{000}1\,0110\,0000\,0000\,0000\,000}\\ \hline \end{array} $$What is its value in base $10$? What is the precision of the represented number?
Since its exponent is $\mathtt{0000\,0000}$ and its mantissa is not zero, it is a denormal number:
Therefore, it represents the number
$$ \mathtt{0.0001011}\cdot2^{-126} $$which corresponds to the following number in base $10$
$$ (2^{-4}+2^{-6}+2^{-7})\cdot2^{-126}\approx1.0102\cdot10^{-39}. $$This number has a precision of only $20$ (not $24$, as corresponds to normal numbers in single precision) since, for precision purposes, the three zeros to the left of the first one of the mantissa do not count (the three zeros in red $\mathtt{\color{red}{000}}$).
We return to our standard. Remember that the format is
and for denormal numbers
Since the numbers that we are going to represent are denormal, their exponent is $\mathtt{000}$ and their mantissa cannot be zero:
In this standard, the denormal numbers are
As the exponent, the sign (we are only counting the positives) and the hidden bit do not vary, we only have to take into account the number of bits of the mantissa $n=2$ and we have to remove the case where the mantissa is all zeros . So
$$2^n-1=2^2-1=\fbox{3 denormal numbers}$$We can see that:
Let's remember that our toy system is
The machine $\epsilon$ is defined as the distance between the number $1$ and the next number that can be exactly represented in this standard.
Remember that in this standard $1$ and the following number are
We calculate the distance by subtracting these two values
That is, for this standard $$\fbox{$\epsilon=0.25$}$$
The machine $\epsilon$ is an upper bound on the relative rounding error that we have when we store a number in this standard. We will check this in the next section.
For single precision, the representation of one and the next exactly representable number, taking into account that we have $23$ bits of mantissa plus the hidden bit, is
If we subtract these two numbers
If we apply the definition for double precision, the representation of one and the next representable number exactly, considering that we have $52$ bits of mantissa plus the hidden bit, is
and the gap between the two numbers is
For now, we will plot the absolute and relative errors that appear when we store numbers with this standard. We have seen that we can represent numbers between $0$ and $14.$ When we want to store a number in this range, if it does not match any of the numbers in the table, we store it as one of the closest numbers in the table.
We can see that:
The absolute error grows with the number, but the relative error, which is the important one, does not.
The maximum absolute error for truncation is twice the maximum absolute error for rounding.
All relative errors are less than $\epsilon=0.25$ as we said.
We want to calculate the maximum integer that can be represented exactly so that the next integer cannot be represented exactly.
Remember that our system is
If we look at the represented integers we see that we can represent all the integers from $1$ to $8$ but we cannot represent $9$ exactly. Therefore, the answer to the question would be "8".
How can we reason this for any system that is similar? The idea is that if we can store all the digits in a format, we will not have an error. If we must drop a digit, we could have an error. For the actual format, we have two digits of mantissa plus the hidden bit, so we can store 3 digits.
We can see that:
This value, $8$ for this format, gives us an idea of the capacity of this format to store integers exactly.
Let's find the maximum integer stored exactly so that all the integers below it are stored exactly.
If we do the same reasoning as in the previous case, taking into account that in this format we have $23$ bits plus the hidden bit, we can store $24$ significant digits
That is, we have $23$ bits plus the hidden bit to store digits. Therefore, the largest integer that we can store all the digits for is
$$\mathtt{1\,\,1111\,1111\,1111\,1111\,1111\,111}$$that is
$$\mathtt{1\,\,\overbrace{1111\cdots111}^{23\,bits}} \quad\rightarrow\quad \mathtt{1.\,\overbrace{1111\cdots111}^{23\,bits}}\times 2^{23} $$The next number is
$$\mathtt{10\overbrace{0000\cdots00{\color{red}0}}^{23\,bits}}= \mathtt{1\overbrace{0000\cdots000}^{23\,bits}{\color{red}0}}\quad\rightarrow\quad \mathtt{1,\,\overbrace{0000\cdots000}^{23\,bits}}\times 2^{24}$$and we must drop the last $\mathtt{{\color{red}0}}.$ Because it is a zero, there is no error. This number, in decimal, is
$$\fbox{$2^{24}=16777216$}$$The following integer number is
$$\mathtt{10\overbrace{0000\cdots00{\color{red}1}}^{23\,bits}}= \mathtt{1\overbrace{0000\cdots000}^{23\,bits}{\color{red}1}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{0000\cdots000}^{23\,bits}}\times 2^{24}$$We cannot store the last $\mathtt{{\color{red}1}}$, we do not have an exact representation for this number and we must round it to the previous one or to the next (in this case, to the previous one).
Let's find the maximum integer stored exactly so that all the integers below it are stored exactly.
If we do the same reasoning as in the previous case, taking into account that in this format we have $52$ bits plus the hidden bit, we can store $53$ significant digits
Therefore, the largest integer we can store all its digits is
$$\mathtt{1\,\,\overbrace{1111\cdots11}^{52\,bits}} \quad\rightarrow\quad \mathtt{1.\,\overbrace{1111\cdots11}^{52\,bits}}\times 2^{52} $$The next number is
$$\mathtt{10\overbrace{0000\cdots0{\color{red}0}}^{52\,bits}}= \mathtt{1\overbrace{0000\cdots00}^{52\,bits}{\color{red}0}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{0000\cdots00}^{52\,bits}}\times 2^{53}$$Because it is a zero, there is no error. This number, in decimal, is
$$\fbox{$2^{53}=9007199254740992$}$$The following integer number is
$$\mathtt{10\overbrace{0000\cdots0{\color{red}1}}^{52\,bits}}= \mathtt{1\overbrace{0000\cdots00}^{52\,bits}{\color{red}1}}\quad\rightarrow\quad \mathtt{1.\,\overbrace{0000\cdots00}^{52\,bits}}\times 2^{53}$$We cannot store the last $\mathtt{{\color{red}1}}$, we do not have an exact representation for this number and we must round it to the previous one or to the next (in this case, to the previous one).
Our system is
The zero is represented by all the bits of the exponent and the mantissa zero.
The infinity is represented by all the bits of the exponent 1 and all the bits of the mantissa 0
$\mathtt{NaN}$ (Not a Number) is represented with all the bits of the exponent 1 and the bits of the mantissa with any combination other than all zeros, for example