The commonly used floating-point format is established by the IEEE 754 standard. IEEE stands for Institute of Electrical and Electronic Engineers. This standard is currently used by almost all processors.
The first IEEE 754 standard was published in 1985 and included only the binary representation. Its basic formats were simple and double precision. In 2008, a second version was published which also included the representation of numbers in decimal, with two basic formats. Also, a basic binary format with quadruple precision was added. In 2019, a third version, with minor modifications, was published.
The five basic formats and their most important parameters are
The default format in Numerical Computing is currently double precision binary. We will study single and double precision binary formats in more detail.
Given the decimal number $314.15$, we normalize it the following way
Thus, the previous number, normalized is
$$+3.1415\times10^2$$whose components are
We must add that the base (or radix) is $10$.
Given the binary number $10101.11001$, we apply the same steps but must take into account that the base is now 2. The normalized number is
$$+1.010111001\times2^4$$with
And the base (or radix) is $2$.
The exponent in this standard will always be an integer with biased representation. Let's see how it works with an example. Let's suppose that we have $m = 4$ bits to store 4 binary digits. We could store $2^m=2^4=16$ different combinations and we could represent the following numbers
$$ \begin{array}{cccc} \hline \mathrm{Binary} & \mathrm{Face}& & \mathrm{Signed}\,\mathrm{integer}\\ \mathrm{representation} & \mathrm{value}& & \mathrm{(Exponent)}\\ (m=4\; bits)& & & bias=2^{m-1}-1 \\ \hline \mathtt{0000} & 0 & & Reserved\\ \mathtt{0001} & 1 & & -6\\ \mathtt{0010} & 2 & & -5\\ \mathtt{0011} & 3 & & -4\\ \mathtt{0100} & 4 & & -3\\ \mathtt{0101} & 5 &bias & -2\\ \mathtt{0110} & 6 & \longrightarrow& -1\\ \mathtt{0111} & 7 & -7& 0\\ \mathtt{1000} & 8 & & 1\\ \mathtt{1001} & 9 & & 2\\ \mathtt{1010} & 10 & & 3\\ \mathtt{1011} & 11 & & 4\\ \mathtt{1100} & 12 & & 5\\ \mathtt{1101} & 13 & & 6\\ \mathtt{1110} & 14 & & 7\\ \mathtt{1111} & 15 & & Reserved\\ \hline \end{array} $$
It uses 32 bits (4 bytes):
It is $\mathtt{0}$ for a positive number and $\mathtt{1}$ for a negative one.
We have $m=8$ bits for the exponent. Therefore there are $2^m=2^8=256$ different combinations and, in principle, we can represent $256$ numbers. As we start at $0$ it will end at $255$. The first number, $\mathtt{0000\,0000}$, and the last one, $\mathtt{1111\,1111}$ are reserved (we will see later for what). And since the representation is biased, we subtract the bias that is
to get the represented value
Thus, the minimum represented exponent is $e_{min} = -126$ and the maximum $e_{max} = 127.$
In binary format, in the normalized representation, the digit to the left of the point is always $\mathtt{1}$. It is not stored and is called the hidden bit. So even though we have $32-1-8=23$ binary digits to store we are using one more and the precision is $24.$
It uses 64 bits (8 bytes):
If the sign is positive $\mathtt{0}$ is used and if the sign is negative $\mathtt{1}$.
We have $m=11$ bits for the exponent. Therefore there are $2^m=2^{11}=2048$ different combinations and, in principle, we can represent $2048$ numbers. As we start at $0$ it will end at $2047$. The first number, $\mathtt{0000\,0000\,000}$, and the last one, $\mathtt{1111\,1111\,111}$ are reserved. And since the representation is biased, we subtract the bias that is
to get the represented value.
And the minimum exponent is $e_{min} = -1022$ and the maximum $e_{max} = 1023.$
In binary format, in the normalized representation, the digit to the left of the point is always $\mathtt{1}$. It is not stored and is called the hidden bit. So even though we have $64-1-11=52$ binary digits to store we are using one more and the precision is $53.$
If the number
$$ \begin{array}{|c|c|c|} \hline \mathrm{Sign} & \mathrm{Exponent} & \mathrm{Mantissa}\\ 1\,\mathrm{bit} & 8\,\mathrm{bits} & 23\,\mathrm{bits} \\ \hline \mathtt{1}&\mathtt{1000\,1101}&\mathtt{0110\,1000\,0000\,0000\,0000\,000}\\ \hline \end{array} $$follows the IEEE 754 single precision floating point representation, give its representation in the decimal base.
As the bit for the sign is $\mathtt{{\color{red}1}}$ $\longrightarrow$ the sign is negative.
The face value of the exponent is $\mathtt{{\color{red}{1000\,1101}}}$. If we take into account the position of the digits
$$ \begin{array}{ccccccc} \tiny{(7)}&\tiny{(6)}&\tiny{(5)}&\tiny{(4)}& \,\tiny{(3)}&\tiny{(2)}&\tiny{(1)}&\tiny{(0)}& & \\ \mathtt{1}&\mathtt{0}&\mathtt{0}&\mathtt{0}&\,\mathtt{1}&\mathtt{1}&\mathtt{0}&\mathtt{1} \end{array} $$we have
$$2^7+2^3+2^2+2^0=128+8+4+1=141$$And, as we have $m=8$ bits, $bias = 2^{m-1}-1 = 2^{8-1}-1=2^7-1=128-1=127$
And the exponent value is $141-127=14$.
$$ \begin{array}{cccc} \hline \mathrm{Binary}\; \mathrm{number} & \mathrm{Face}\; \mathrm{value} & & \mathrm{Represented}\; \mathrm{value}\\ \hline \mathtt{0000\,0000}& 0& & R\\ \mathtt{0000\,0001}& 1& & -126\\ \mathtt{0000\,0010}& 2& & -125\\ \mathtt{0000\,0011}& 3& & -124\\ \cdots & \cdots & -127 & \cdots\\ \cdots & \cdots & \longrightarrow & \cdots\\ \mathtt{{\color{red}{1000\,1101}}}& {\color{red}{141}}& & {\color{red}{14}}\\ \cdots & \cdots & & \cdots\\ \cdots & \cdots & & \cdots\\ \mathtt{1111\,1100}& 252 & & 125 \\ \mathtt{1111\,1101}& 253 & & 126 \\ \mathtt{1111\,1110}& 254 & & 127 \\ \mathtt{1111\,1111}& 255 & & R\\ \hline \end{array} $$The digits for the mantissa are $\mathtt{{\color{ForestGreen}{0110\,1000\,0000\,0000\,0000\,000}}}$ and taking into account the hidden bit, that is $\mathtt{1}$
$$\mathtt{1.{\color{ForestGreen}{0110\,1000\,0000\,0000\,0000\,000}}}$$also
$$\mathtt{1.{\color{ForestGreen}{0110\,1}}}$$$$ \begin{array}{ccccccc} \tiny{(0)}&&\tiny{(-1)}&\tiny{(-2)}&\tiny{(-3)}&\tiny{(-4)}&\tiny{(-5)}\\ \mathtt{1}&.&\mathtt{{\color{ForestGreen}0}}&\mathtt{{\color{ForestGreen}1}}&\mathtt{\color{ForestGreen}1}&\,\mathtt{{\color{ForestGreen}0}}&\mathtt{\color{ForestGreen}1} \end{array} $$Summing up, the number is
$${\color{red}-}\mathtt{1.{\color{ForestGreen}{0110\,1}}}\times 2^{\color{red}{14}} \quad \longrightarrow \quad -(1+2^{-2}+2^{-3}+2^{-5})\times2^{14} = \fbox{$-$23040} $$