In the decimal system the number $105.8125$ means:
$$105.8125=1 \cdot 10^2+5\cdot 10^0+8\cdot10^{-1}+1\cdot10^{-2}+2\cdot10^{-3}+5\cdot10^{-4}$$i.e., it is a linear combination of powers of $10$ multiplied by one of the $10$ digits in $0,1,\ldots,9$.
Computers use the binary system because it is the natural way an electronic device works: switched on or off. Therefore, only $0's$ and $1's$ need to be stored.
In the binary system numbers are represented as linear combinations of powers of $2$, multiplied by $0's$ and $1's$:
$$(105.8125)_{10}=2^6+2^5+2^3+2^0+2^{-1}+2^{-2}+2^{-4}=(1101001.1101)_2 \qquad (1)$$Conversion of the integer part is achieved by sequentially dividing by $2$. The remainings of these divisions are the digits in base $2$, from lesser to larger significancy.
$$ \begin{array}{lrrrrrrrr} Quotients & 105 & 52 & 26 & 13 & 6 & 3 & 1\\ Remainders & 1 & 0 & 0 & 1 & 0 & 1 & \swarrow & \end{array} $$Then, $(105)_{10}$ in the binary system is $(1101001)_2$.
Remark. Python's function bin
performs this conversion.
bin(105)
%run Exercise1.py
x = 105.8125
binary1 = de2bi_a(x)
print ''.join(binary1) # if binary1 is stored as a strings' list
To convert the decimal part, we multiply by $2$, drop the integer part and repit till reaching $1$.
$$ \begin{array}{lrrrr} Decimal & 0.8125& 0.625 & 0.25 & 0.5 \\ Integer & 1 & 1 & 0 & 1 \end{array} $$Write a function, de2bi_b.py
, to convert the decimal part of a base 10 number into a binary number. Use it for x = 105.8125
. Limit the maximum number of fracional digits in binary to 23
. Try also with x = 14.1
%run Exercise2.py
x = 105.8125
binary2 = de2bi_b(x)
print ''.join(binary2)
And number 105.8125
(in decimal) is written, in binary
print ''.join(binary1 + ['.'] + binary2)
x = 14.1
binary3 = de2bi_a(x)
binary4 = de2bi_b(x)
print ''.join(binary3 + ['.'] + binary4)
Using $(1)$ for converting $(1101001.1101)_2$ to decimal base we get
1*(2**6)+1*(2**5)+0*(2**4)+1*(2**3)+0*(2**2)+0*(2**1)+1*(2**0)+1*(2**-1)+1*(2**-2)+0*(2**-3)+1*(2**-4)
If we have $m$ digits or memory bits then we may store $2^m$ different binary numbers.
If we only consider the positive integers then we may represent numbers between
$$ (00\ldots 00)_2=(0)_{10} \quad \mathrm{and} \quad (11\ldots 11)_2=(2^m-1)_{10}. $$
For instance, for $m=3$ bits we may represent positive integers from $0$ to $7$:
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \mathtt{4} & \mathtt{100}\\ \mathtt{5} & \mathtt{101}\\ \mathtt{6} & \mathtt{110}\\ \mathtt{7} & \mathtt{111}\\ \hline \end{array} $$In signed integers, the first bit is used to store the sign: $0$ for positive and $1$ for negative. The other $m-1$ digits are used for the unsigned number and, therefore, we may write $2^{m-1}-1$ positive numbers, the same amount of negative numbers and two zeros, one with positive sign and another with negative sign. Therefore, the range of representation is $[-2^{m-1}+1, 2^{m-1}-1]$. For instance, if $m=3$ we may represent numbers from $-3$ to $3$:
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-3} & \mathtt{111}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{101}\\ \mathtt{-0} & \mathtt{100}\\ \mathtt{+0} & \mathtt{000}\\ \mathtt{+1} & \mathtt{001}\\ \mathtt{+2} & \mathtt{010}\\ \mathtt{+3} & \mathtt{011}\\ \hline \end{array} $$Example
Compute how $(80)_{10}$ is stored when using a signed 8-bit binary representation.
Solution: The representation is similar than that of Exercise 4, but with the first digit equal to $1$, due to the negative sign. So we have $(-80)_{10}=(11010000)_2$.
To avoid the double representation of zero we define negative integers by taking the digits of the corresponding positive integers and changing $0's$ to $1's$, and $1's$ to $0's$, and adding $1$ to the result. In this way, the sum of a number and its oposite is always $0$.
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{100}\\ \mathtt{-3} & \mathtt{101}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{111}\\ \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \hline \end{array} $$Example. To represent $(-2)_{10}$ we start writing $(2)_{10}$ in binary form, $(010)_2$. Then, we
So the property $(010)_2+(110)_2=(000)_2$ holds.
In this case, the first bit of the negative is $0$ and that of the positive is $1$. We may represent integers in the range $[-2^{m-1},2^{m-1}-1]$.
Example. Compute how $(80)_{10}$ is stored when using a two's complement 8-bit binary representation. Solution: since $(80)_{10}=(01010000)_2$
Negative numbers are represented as consecutive positive values, starting from the lowest negative integer. Positive numbers are the rest. The representation is obtained by adding the bias $2^{m-1}$ to the number $x$, i.e., setting $x_r=x+2^{m-1}\in[0,2^m-1]$.
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{000}\\ \mathtt{-3} & \mathtt{001}\\ \mathtt{-2} & \mathtt{010}\\ \mathtt{-1} & \mathtt{011}\\ \mathtt{0} & \mathtt{100}\\ \mathtt{1} & \mathtt{101}\\ \mathtt{2} & \mathtt{110}\\ \mathtt{3} & \mathtt{111}\\ \hline \end{array} $$We see that the representable rage is again $[-2^{m-1},2^{m-1}-1]$.
Which form is actually used for integer storage? Most machines use two's complement for integer numbers and biased representation for the exponents of floating point numbers (which are integers).
Why? The main reason to use biased representation is its efficiency for comparing numbers. Codes compare floating point numbers very often, and this is done by first comparing their exponents. Only if they are equal, comparison of their mantissas is performed.
In Python, numerical variables may be int
, float
, long
and complex
. Boolean variables are a subtype of integers.
Single integers or int
have at least 32 bits precission, normally 32 or 64 bits, depending on the computer and operative system we are using.
The maximum single integer value in our computer is given by
import sys
sys.maxint
That is, if $m=64$ is the number of bits, the largest integer is given by $2^{m-1}-1.$
2**(64-1)-1
And for negative values, the minimum integer is $-2^{m-1}$
-sys.maxint-1
-2**(64-1)
Thus, in our case the single integer has a 64 bits precission.
If we need to represent a larger integer, Python uses long
integers. Their precission is only limited by the memory available. They are recognized by its last character: L
.
sys.maxint+1
-sys.maxint-2
For real numbers, floating point representation of base $\beta=2$ is used:
$$x_r=(-1)^s \cdot m \cdot \beta^e=(-1)^s \cdot (a_1.a_2\ldots a_t) \cdot \beta^e$$where:
Numbers are stored either in words of 32 bits (single precission), 64 bits (double precission), or 128 bits (extended precission). In most computers, for Pyhthon, the default float
precission is double precission. In this case bits are used as follows:
%matplotlib inline
%run Double_precision.py
With $11$ bits for the signed exponent, we have room for $2^{11}=2048$ binary numbers, $ 0 < E < 2047 $. The
first number, $00000000000$ is reserved for zeros and non-normalized numbers; and the last, $11111111111$, for Inf
and NaN
.
Thus, the exponent take values $ 1 < E < 2046 $, and since the bias is $1023$, these values correspond to the exponents $ -1022 < E - 1023 < 1023 $. Therefore, the maximum exponent is $E_{max}=1023$, and the minimum is $E_{min}=-1022$. Dividing by the minimum value, we get
$$\frac{1}{x_{min}}=\frac{1}{m\beta^{E_{min}}}=\frac{1}{m\beta^{-1022}}=\frac{1}{m}\beta^{1022}<\frac{1}{m}\beta^{1023}$$so the maximum value is not reached, i.e. there is not overflow.
Write a script Exercise3.py
that calculates the single binary representation of 105.8125
, 120.875
, 7.1
and -1.41
, following the standard IEEE 754. Use the scrips written in exercises 1 and 2. Consider only the case where the integer part is greater than zero (the number, in absolute value, is greater than $1$). Round the numbers using truncation.
Note: Remember that single precision uses 32
bits: 1
for the sign , 8
for the exponent and 23
bits for the mantissa. The exponent bias is 127
.
%run Exercise3.py
The features of ours machine float
type are obtained using the command sys.float_info
sys.float_info
The number of digits in the mantissa, mant_dig
, is 53 (52 bits but 53 digits because the hidden bit). That is, Python's float
uses 64 bits and is double precision.
But
max_exp=1024
and min_exp=-1021
does not seems to agree with what it was written above.
But, if we execute
help(sys.float_info)
we obtain
| max_exp | DBL_MAX_EXP -- maximum int e such that radix**(e-1) is representable
| min_exp | DBL_MIN_EXP -- minimum int e such that radix**(e-1) is a normalized float
The largest double precission normalized number that Matlab may store in binary representation is
$$(+1)\cdot (1.11\ldots 11) \cdot 2^{1023}$$Since we do not have to keep the first $1$, there are 52 bits left to store the $1's$ of $0.11\ldots 11$. Therefore, in base $10$ this number is
$$(1+1\cdot 2^{-1}+1\cdot 2^{-2}+1\cdot 2^{-3}+\cdots+1\cdot 2^{-52})\cdot 2^{1023}$$Write a code, Exerxise4.py
, to compute this sum. Perform the sum from lowest to largest terms (we shall see why later on). Its value should coincide with that obtained with sys.float_info.max
. Define the variable output
with the output value.
import sys
%run Exercise4a.py
print '%.16e' % output # exponential format with 16 decimal digits
sys.float_info.max
We check that they are equal
print output == sys.float_info.max
(b) Using a similar procedure than in Exercise4a, write a code Exercise4b.py
to compute the lowest representable normalized floating point number using the expression
Its value should coincide with that obtained from sys.float_info.min
.
%run Exercise4b.py
print '%.16e' % output
sys.float_info.min
The largest consecutive integer which can be stored exactly in binary form with floating point representation is
$$(+1)\cdot (1.11\ldots 11) \cdot 2^{52}$$Therefore, we have $53$ precission digits in binary form. The next integer,
$$(+1)\cdot (1.00\ldots 00) \cdot 2^{53}$$can also be stored exactly. But not the next one, since we would need an extra bit in the mantissa.
print(2**53)
Hence, all 15-digits integers and most of 16-digits integers may be stored exactly.
Observe that, actually, the number
$$(+1)\cdot (1.11\ldots 11) \cdot 2^{62}$$is integer, storable and larger than the largest integer which can be stored!! However, the ten last digits of this number are necessarily zeros. Therefore, from the point of view of precission they are negligible, since they can not be changed.
Denormalized numbers
What happens to numbers with exponents out of the range $[-1022,1023]$?
If the exponent is lower than $-1022$ then the number is non-normalized or zero, and we are using for the bits corresponding to the exponent the special value $00000000000$. Then, the hidden bit is now $0$, instead of $1$.
x = 0.5*(2**-1023)
print(x) # denormalized number
print(1/x)
print(2**-1080) # Underflow
The lowest non-normalized number is
$$(+1)(0.000\ldots 01)\times 2^{-1022}$$that is, $2^{-1022-52}$
num_min = 2.**(-1022-52)
print(num_min)
For any other smaller value, we get zero (underflow):
2.**(-1022-53) # Underflow
If the exponent is larger than $1023$ we get OverflowError
.
2.**1024 # Overflow
Let us compute the lowest number we may add to $1$ using double precission floating point binary representation.
The number $1$, in normalized double precission floating point representation, is
$$(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$with $52$ zeros in the mantissa. The lowest number we may add in floating point non-normalized representation is
$$\epsilon = (+1)\cdot (0.00\ldots 01) \cdot 2^{0}$$which in base $10$ is
1*2.**(-52)
which is obtained from Python by
sys.float_info.epsilon
This value, the lowest number $\epsilon$ such that $1+\epsilon>1$ is called the machine precission, which gives the floating point representation precission.
Since, in double precission, $\epsilon \approx 2.22\cdot 10^{-16}$ we have that it corresponds to, approximately, $16$ decimal digits.
Between
$$1=(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$and
$$1+\epsilon=(+1)\cdot (1.00\ldots 01) \cdot 2^{0}$$we can not represent exactly (and in floating point) any other real number.
To compute the lowest number comparable to $x$ we must add the number eps
x = 10.
eps = np.spacing(x)
print(eps)
x = 100.
eps = np.spacing(x)
print(eps)
x = 1000.
eps = np.spacing(x)
print(eps)
We see that eps
increases with the absolute value of $x$. This means that the difference between two consecutive numbers exactly representable in floating point representation increases as we depart from zero. Therefore, the density of exact floating point numbers is not uniform. It is larger close to zero and smaller far from it. Remind that floating point numbers are actually rational numbers.
Some operations lead to special results
x = 1.e200
y = x*x
print(y)
y/y
nan
means "not a number". In general, it arises when a non valid operation has been performed.
Overflow. It happens when the result of the operation is finite but not representable (larger than the largest representable number).
x = 1.e300
x**2
Often, there is no exact floating point representation of a real number $x$, which lies between two consecutive representable numbers $x^-
The last method is the most usual.
Suppose that we are using a three digits decimal representation and we want to sum the numbers $a=123$ and $b=1.25$. The exact result is $s=124.25$ but, since only three digits are available, we shall round to, for instance, $s_r=124$. An error arises. In general, finite arithmetic operations involve errors.
Example. The operation $3\times 0.1 \times 100$ returns:
3*0.1*100
The result is not the right one ($30$) due to rounding error. These errors may propagate and, in some cases, affect substantially the result. In the previous case the error is due to, as we already saw, the periodicity of $x=0.1$ in the binary base. Thus, its representation is not exact in the computer.
Example. Computing
$$\sum_{k=1}^{10000} 0.1$$gives
Sum = 0.
for i in range(1,10001):
Sum = Sum + 0.1
print '%.16f'% Sum # Decimal format with 16 decimals
with absolute error given by
abs(Sum-1000.)
Example
If we add or substract numbers which are very different in magnitude we always lose accuracy due to rounding error. For instance,
a = 1.e+9
epsilon = 1.e-8
Sum = a
for i in range(1,10001):
Sum = Sum + epsilon
print '%.16e' % Sum # Exponencial format with 16 decimals
However, the result should have been
Sum = a + 10000*epsilon
print '%.16e'% Sum
Example
Let us sum the $N$-first term of the harmonic series
$$\sum_{n=1}^N \frac{1}{n}.$$We use single precission (function float32
) to make the effect more evident. The exact result for the first $1000$-terms is
import numpy as np
from scipy import special
N = 10000
Sum = special.polygamma(0,N+1)-special.polygamma(0,1)
print '%.20f'% Sum
Starting the sum from the term $n=1$, we get
Sum1 = 0.;
for n in range(1,N+1):
Sum1 += 1./n
print '%.20f'% Sum1
error1 = abs(Sum1 - Sum)
print error1
Starting from the last term, we get
Sum2 = 0.;
for n in range(N,0,-1):
Sum2 += 1./n
print '%.20f'% Sum2
Thus, the absolute difference is
error2 = abs(Sum2 - Sum)
print error2
The error is largest in the first sum because the addition starts with the largest term and continues adding subsequent lower terms. Thus we lose accuracy because the different magnitudes between the accumulated sum and the new added terms.
It arises in the substraction of large numbers of similar value. For instance,
$$\sqrt{x^2+\epsilon^2}-x$$import numpy as np
x = 1000000.
ep = 0.0001
a = np.sqrt(x**2 + ep) - x
print a
In this case, cancellation error may be avoided by using an equivalent mathematical expression
$$\sqrt{x^2+\epsilon^2}-x=\frac{(\sqrt{x^2+\epsilon^2}-x)(\sqrt{x^2+\epsilon^2}+x)}{\sqrt{x^2+\epsilon^2}+x}=\frac{\epsilon^2}{\sqrt{x^2+\epsilon^2}+x}\approx \frac{\epsilon^2}{\sqrt{x^2}+x}=\frac{\epsilon^2}{2x} $$b = ep / (np.sqrt(x**2 + ep) + x)
print b
The relative error with the first expression is
Er = abs((a - b)/b)
print Er
a = 1.; b = 10.**8; c = 1.
x1 = (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)
print x1
But after substitution back into the polynomial we get that the residual
residual1 = a*x1**2 + b*x1 + c
print residual1
is relatively large. Find an equivalent mathematical expression to improve the result and write a code for it Exercise5a.py
%run Exercise5a.py
print x2
print residual2
(b)
When solving
with the formula
$$x_2=\frac{-b-\sqrt{b^2-4ac}}{2a}$$we obtain
a = 1.; b = -10.**8; c = 1
x3 = (- b - np.sqrt(b**2 - 4*a*c)) / (2*a)
print x3
but is we substitute this value in the equation
residual3 = a*x3**2 + b*x3 + c
print residual3
that, again, it is an big relative error.
(b) Find an equivalent mathematical expression to improve the result and write a code for it Exercise5b.py
.
%run Exercise5b.py
print x4
print residual4
Example
Given
$$e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!},$$let's approximate the value of the number $e$, using $x = 1$, with a finite number of addition terms and see how the error evolves
import math
suma = 0.
x = 1.
for n in range(100):
suma += x**n / math.factorial(n)
if n in np.array([5,10,15,20,40,60,80,100])-1:
error = abs(np.exp(1.) - suma)
print "Number of terms %i" % (n+1)
print "Error %.16e" % error
print ""
The error first decreases but then, it stagnates and does not decrease anymore. Why? If we would be using symbolic calculus, the error would decrease as iterations increase. But we are using finite arithmetic, and the error is limited by the spacing between representable numbers, which around number $e$ is
Ea = np.spacing(np.exp(1.))
print Ea
Let's compute the number of terms we need to get $e^1$ with the lowest possible error.
import math
Ea = np.spacing(np.exp(1.))
n = 0
Sum = 1.
itermax = 100
error = np.abs(np.exp(1.) - Sum)
while (error > Ea and n < itermax):
n += 1
Sum += 1. / math.factorial(n)
error = np.abs(np.exp(1.) - Sum)
print "Number of iterations %i" % n
print "Error %.16e" % error
print "Ea %.16e" % Ea
So after $17$ iterations the truncation error is machine-neglegible.
%run Exercise6.py
print "Number of iterations %i" % n
print "Error %.16e" % error
print "Spacing around ln(1+x) %.16e" % Ea