In the decimal system the number $105.8125$ means:
$$105.8125=1 \cdot 10^2+5\cdot 10^0+8\cdot10^{-1}+1\cdot10^{-2}+2\cdot10^{-3}+5\cdot10^{-4}$$i.e., it is a linear combination of powers of $10$ multiplied by one of the $10$ digits in $0,1,\ldots,9$.
Computers use the binary system because it is the natural way an electronic device works: switched on or off. Therefore, only $0's$ and $1's$ need to be stored.
In the binary system numbers are represented as linear combinations of powers of $2$, multiplied by $0's$ and $1's$:
$$(105.8125)_{10}=2^6+2^5+2^3+2^0+2^{-1}+2^{-2}+2^{-4}=(1101001.1101)_2 \qquad (1)$$Conversion of the integer part is achieved by sequentially dividing by $2$. The remainings of these divisions are the digits in base $2$, from lesser to larger significancy.
$$ \begin{array}{lrrrrrrrr} Quotients & 105 & 52 & 26 & 13 & 6 & 3 & 1\\ Remainders & 1 & 0 & 0 & 1 & 0 & 1 & \swarrow & \end{array} $$Then, $(105)_{10}$ in the binary system is $(1101001)_2$.
Remark. Python's function bin
performs this conversion.
bin(105)
Write a function, de2bi_a(x)
, that converts the integer part of a base 10 number x
into a binary number.
The output will be stored as a string
s = ''
s = (str(d)) + s
where d
is 0
or 1
Use it for x = 105.8125
and x = 310.671875
.
Hint: the following functions may be useful
np.fix(x)
: rounds x
towards 0
.a//b
: gives the quotient of the division.a%b
: gives the remainder of the division.The structure of the script will be
import numpy as np
def de2bi_a(x):
... # <- write here your code
def main():
x = 105.8125
binary1 = de2bi_a(x)
print('The integer part of ', x, ' in binary is ', binary1)
if __name__ == "__main__":
main()
This way you will be able to import the function debi_a
from the file Exercise1a.py
without executing again the code in main
%run Exercise1a
To convert the decimal part, we multiply by $2$, drop the integer part and repit till reaching $1$.
$$ \begin{array}{lrrrr} Fractionary & 0.8125& 0.625 & 0.25 & 0.5 \\ Integer & 1 & 1 & 0 & 1 \end{array} $$Write a function, de2bi_b(x)
, to convert the decimal part of a base 10 number into a binary number.
The output will be stored as a string
s = ''
s = s + (str(d))
where d
is 0
or 1
Use it for x = 105.8125
, x = 310.671875
and x = 14.1
Limit the maximum number of fracional digits in binary to 23
(truncation).
Using the functions de2bi_a(x)
and de2bi_b(x)
write the number x
in binary for the three numbers
The structure of the script will be
import numpy as np
import Exercise1a as ex1a
def de2bi_b(x):
... # <-- write here your code
def main():
x = 105.8125
binary1 = ex1a.de2bi_a(x)
binary2 = de2bi_b(x)
print('The fractional part of ', x, ' in binary is ', '0.'+binary2)
print('The number ', x, 'in binary is ', binary1+'.'+binary2)
if __name__ == "__main__":
main()
%run Exercise1b
Using $(1)$ for converting $(1101001.1101)_2$ to decimal base we get
1*(2**6)+1*(2**5)+0*(2**4)+1*(2**3)+0*(2**2)+0*(2**1)+1*(2**0)+1*(2**-1)+1*(2**-2)+0*(2**-3)+1*(2**-4)
If we have $m$ digits or memory bits then we may store $2^m$ different binary numbers.
If we only consider the positive integers then we may represent numbers between
$$ (00\ldots 00)_2=(0)_{10} \quad \mathrm{and} \quad (11\ldots 11)_2=(2^m-1)_{10}. $$
For instance, for $m=3$ bits we may represent positive integers from $0$ to $7$:
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \mathtt{4} & \mathtt{100}\\ \mathtt{5} & \mathtt{101}\\ \mathtt{6} & \mathtt{110}\\ \mathtt{7} & \mathtt{111}\\ \hline \end{array} $$In signed integers, the first bit is used to store the sign: $0$ for positive and $1$ for negative. The other $m-1$ digits are used for the unsigned number and, therefore, we may write $2^{m-1}-1$ positive numbers, the same amount of negative numbers and two zeros, one with positive sign and another with negative sign. Therefore, the range of representation is $[-2^{m-1}+1, 2^{m-1}-1]$. For instance, if $m=3$ we may represent numbers from $-3$ to $3$:
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-3} & \mathtt{111}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{101}\\ \mathtt{-0} & \mathtt{100}\\ \mathtt{+0} & \mathtt{000}\\ \mathtt{+1} & \mathtt{001}\\ \mathtt{+2} & \mathtt{010}\\ \mathtt{+3} & \mathtt{011}\\ \hline \end{array} $$Example
Compute how $(80)_{10}$ is stored when using a signed 8-bit binary representation.
Solution: The representation is similar than that of Exercise 4, but with the first digit equal to $1$, due to the negative sign. So we have $(-80)_{10}=(11010000)_2$.
To avoid the double representation of zero we define negative integers by taking the digits of the corresponding positive integers and changing $0's$ to $1's$, and $1's$ to $0's$, and adding $1$ to the result. In this way, the sum of a number and its oposite is always $0$.
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{100}\\ \mathtt{-3} & \mathtt{101}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{111}\\ \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \hline \end{array} $$Example. To represent $(-2)_{10}$ we start writing $(2)_{10}$ in binary form, $(010)_2$. Then, we
So the property $(010)_2+(110)_2=(000)_2$ holds.
In this case, the first bit of the negative is $0$ and that of the positive is $1$. We may represent integers in the range $[-2^{m-1},2^{m-1}-1]$.
Example. Compute how $(80)_{10}$ is stored when using a two's complement 8-bit binary representation. Solution: since $(80)_{10}=(01010000)_2$
Negative numbers are represented as consecutive positive values, starting from the lowest negative integer. Positive numbers are the rest. The representation is obtained by adding the bias $2^{m-1}$ to the number $x$, i.e., setting $x_r=x+2^{m-1}\in[0,2^m-1]$.
$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{000}\\ \mathtt{-3} & \mathtt{001}\\ \mathtt{-2} & \mathtt{010}\\ \mathtt{-1} & \mathtt{011}\\ \mathtt{0} & \mathtt{100}\\ \mathtt{1} & \mathtt{101}\\ \mathtt{2} & \mathtt{110}\\ \mathtt{3} & \mathtt{111}\\ \hline \end{array} $$We see that the representable rage is again $[-2^{m-1},2^{m-1}-1]$.
Which form is actually used for integer storage? Most machines use two's complement for integer numbers and biased representation for the exponents of floating point numbers (which are integers).
Why? The main reason to use biased representation is its efficiency for comparing numbers. Codes compare floating point numbers very often, and this is done by first comparing their exponents. Only if they are equal, comparison of their mantissas is performed.
In Python, numerical variables There are three distinct numeric types: integers
, floating point numbers
, and complex numbers
. In addition, Booleans
are a subtype of integers. Integers have unlimited precision in python 3.
In biased representation, if $m=64$ is the number of bits, the largest integer is given by $2^{m-1}-1.$
m = 64
print(2**(m-1)-1)
print(type(2**(m-1)-1))
If $m=128$ is the number of bits, the largest integer is
m = 128
print(2**(m-1)-1)
print(type(2**(m-1)-1))
If $m=256$ is the number of bits, the largest integer is
m = 256
print(2**(m-1)-1)
print(type(2**(m-1)-1))
For real numbers, floating point representation of base $\beta=2$ is used:
$$x_r=(-1)^s \cdot m \cdot \beta^e=(-1)^s \cdot (a_1.a_2\ldots a_t) \cdot \beta^e$$where:
Numbers are stored either in words of 32 bits (single precision), 64 bits (double precision), or 128 bits (extended precision). In most computers, for Pyhthon, the default float
precision is double precision. In this case bits are used as follows:
%run Double_precision
With $11$ bits for the signed exponent, we have room for $2^{11}=2048$ binary numbers, $ 0 < E < 2047 $. The
first number, $00000000000$ is reserved for zeros and non-normalized numbers; and the last, $11111111111$, for Inf
and NaN
.
Thus, the exponent take values $ 1 < E < 2046 $, and since the bias is $1023$, these values correspond to the exponents $ -1022 < E - 1023 < 1023 $. Therefore, the maximum exponent is $E_{max}=1023$, and the minimum is $E_{min}=-1022$. Dividing by the minimum value, we get
$$\frac{1}{x_{min}}=\frac{1}{m\beta^{E_{min}}}=\frac{1}{m\beta^{-1022}}=\frac{1}{m}\beta^{1022}<\frac{1}{m}\beta^{1023}$$so the maximum value is not reached, i.e. there is not overflow.
Write a script that calculates the single binary representation of 105.8125
, 120.875
, 7.1
and -1.41
, following the standard IEEE 754. Use the scrips written in exercise 1. Consider only the case where the integer part is greater than zero (the number, in absolute value, is greater than $1$). Round the numbers using truncation.
Note: Remember that single precision uses 32
bits: 1
for the sign , 8
for the exponent and 23
bits for the mantissa. The exponent bias is 127
.
%run Exercise2
The features of ours machine float
type are obtained using the command sys.float_info
import sys
sys.float_info
The number of digits in the mantissa, mant_dig
, is 53 (52 bits but 53 digits because the hidden bit). That is, Python's float
uses 64 bits and is double precision.
But
max_exp=1024
and min_exp=-1021
does not seems to agree with what it was written above.
But, if we execute
help(sys.float_info)
we obtain
| max_exp | DBL_MAX_EXP -- maximum int e such that radix**(e-1) is representable
| min_exp | DBL_MIN_EXP -- minimum int e such that radix**(e-1) is a normalized float
The largest double precision normalized number that Matlab may store in binary representation is
$$(+1)\cdot (1.11\ldots 11) \cdot 2^{1023}$$Since we do not have to keep the first $1$, there are 52 bits left to store the $1's$ of $0.11\ldots 11$. Therefore, in base $10$ this number is
$$(1+1\cdot 2^{-1}+1\cdot 2^{-2}+1\cdot 2^{-3}+\cdots+1\cdot 2^{-52})\cdot 2^{1023}$$Write a code, to compute this sum. Perform the sum from lowest to largest terms (we shall see why later on). Its value should coincide with that obtained with sys.float_info.max
. Define the variable output
with the output value.
import sys
%run Exercise3
print('%.16e' % output) # exponential format with 16 decimal digits
sys.float_info.max
We check that they are equal
print (output == sys.float_info.max)
The lowest representable normalized floating point number using the expression
$$(+1)\cdot (1.00\ldots 00) \cdot 2^{-1022}$$output = 2**-1022
print ('%.16e' % output)
Its value should coincide with that obtained from sys.float_info.min
.
sys.float_info.min
The largest consecutive integer which can be stored exactly in binary form with floating point representation is
$$(+1)\cdot (1.11\ldots 11) \cdot 2^{52}$$Therefore, we have $53$ precision digits in binary form. The next integer,
$$(+1)\cdot (1.00\ldots 00) \cdot 2^{53}$$can also be stored exactly. But not the next one, since we would need an extra bit in the mantissa.
print(2**53)
Hence, all 15-digits integers and most of 16-digits integers may be stored exactly.
Observe that, actually, the number
$$(+1)\cdot (1.11\ldots 11) \cdot 2^{62}$$is integer, storable and larger than the largest integer which can be stored!! However, the ten last digits of this number are necessarily zeros. Therefore, from the point of view of precision they are negligible, since they can not be changed.
Denormalized numbers
What happens to numbers with exponents out of the range $[-1022,1023]$?
If the exponent is lower than $-1022$ then the number is non-normalized or zero, and we are using for the bits corresponding to the exponent the special value $00000000000$. Then, the hidden bit is now $0$, instead of $1$.
x = 0.5*(2**-1023)
print(x) # denormalized number
print(1/x)
print(2**-1080) # Underflow
The lowest non-normalized number is
$$(+1)(0.000\ldots 01)\times 2^{-1022}$$that is, $2^{-1022-52}$
num_min = 2.**(-1022-52)
print(num_min)
For any other smaller value, we get zero (underflow):
2.**(-1022-53) # Underflow
If the exponent is larger than $1023$ we get OverflowError
.
2.**1024 # Overflow
Let us compute the lowest number we may add to $1$ using double precision floating point binary representation.
The number $1$, in normalized double precision floating point representation, is
$$(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$with $52$ zeros in the mantissa. The lowest number we may add in floating point non-normalized representation is
$$\epsilon = (+1)\cdot (0.00\ldots 01) \cdot 2^{0}$$which in base $10$ is
1*2.**(-52)
which is obtained from Python by
sys.float_info.epsilon
This value, the lowest number $\epsilon$ such that $1+\epsilon>1$ is called the machine precision, which gives the floating point representation precision.
Since, in double precision, $\epsilon \approx 2.22\cdot 10^{-16}$ we have that it corresponds to, approximately, $16$ decimal digits.
Between
$$1=(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$and
$$1+\epsilon=(+1)\cdot (1.00\ldots 01) \cdot 2^{0}$$we can not represent exactly (and in floating point) any other real number.
To compute the lowest number comparable to $x$ we must add the number eps
x = 10.
eps = np.spacing(x)
print(eps)
x = 100.
eps = np.spacing(x)
print(eps)
x = 1000.
eps = np.spacing(x)
print(eps)
We see that eps
increases with the absolute value of $x$. This means that the difference between two consecutive numbers exactly representable in floating point representation increases as we depart from zero. Therefore, the density of exact floating point numbers is not uniform. It is larger close to zero and smaller far from it. Remind that floating point numbers are actually rational numbers.
Some operations lead to special results
x = 1.e200
y = x*x
print(y)
y/y
nan
means "not a number". In general, it arises when a non valid operation has been performed.
Overflow. It happens when the result of the operation is finite but not representable (larger than the largest representable number).
x = 1.e300
x**2
Often, there is no exact floating point representation of a real number $x$, which lies between two consecutive representable numbers $x^-<x<x^+$. Then, the representation of $x$ is chosen according to the rounding method. IEEE admits five rounding methods:
The last method is the most usual.
Suppose that we are using a three digits decimal representation and we want to sum the numbers $a=123$ and $b=1.25$. The exact result is $s=124.25$ but, since only three digits are available, we shall round to, for instance, $s_r=124$. An error arises. In general, finite arithmetic operations involve errors.
Example. The operation $3\times 0.1 \times 100$ returns:
3*0.1*100
The result is not the right one, $30,$ due to rounding error. These errors may propagate and, in some cases, affect substantially the result. In the previous case the error is due to, as we already saw, the periodicity of $x=0.1$ in the binary base. Thus, its representation is not exact in the computer.
Example. Computing
$$\sum_{k=1}^{10000} 0.1$$gives
Sum = 0.
for i in range(1,10001):
Sum = Sum + 0.1
print('%.16f'% Sum) # Decimal format with 16 decimals
with absolute error given by
abs(Sum-1000.)
Example
If we add or substract numbers which are very different in magnitude we always lose accuracy due to rounding error. For instance,
a = 1.e+9
epsilon = 1.e-8
Sum = a
for i in range(1,10001):
Sum = Sum + epsilon
print('%.16e' % Sum ) # Exponencial format with 16 decimals
However, the result should have been
Sum = a + 10000*epsilon
print('%.16e'% Sum)
Example
Let us sum the $N$-first term of the harmonic series
$$\sum_{n=1}^N \frac{1}{n}.$$We use single precision (function float32
) to make the effect more evident. The exact result for the first $1000$-terms is
from scipy import special
N = 10000
Sum = special.polygamma(0,N+1)-special.polygamma(0,1)
print('%.20f'% Sum)
Starting the sum from the term $n=1$, we get
Sum1 = 0.;
for n in range(1,N+1):
Sum1 += 1./n
print('%.20f'% Sum1)
error1 = abs(Sum1 - Sum)
print(error1)
Starting from the last term, we get
Sum2 = 0.;
for n in range(N,0,-1):
Sum2 += 1./n
print('%.20f'% Sum2)
Thus, the absolute difference is
error2 = abs(Sum2 - Sum)
print(error2)
The error is largest in the first sum because the addition starts with the largest term and continues adding subsequent lower terms. Thus we lose accuracy because the different magnitudes between the accumulated sum and the new added terms.
When we are using finite arithmetic, it arises in the substraction of numbers of similar value. For instance, if we take decimal base and precision 7, if we subtract
we go from precision 7 to precision 1 because we only have left a digit. We can add the other digits as zeros but we really do not know their value because the values we are using in this subtraction are approximations of the stored numbers or results of operations whose number of digits can be, probably would be, many more.
For instance, if we use the formula
$$\sqrt{x^2+\epsilon^2}-x \quad \mathrm{with} \quad x \gg \epsilon$$we will have a cancellation error. If we use this formula, we will have a large error
import numpy as np
x = 1000000.
ep = 0.0001
a = np.sqrt(x**2 + ep) - x
print(a)
In this case, cancellation error may be avoided by using an equivalent mathematical expression
$$\sqrt{x^2+\epsilon^2}-x=\frac{(\sqrt{x^2+\epsilon^2}-x)(\sqrt{x^2+\epsilon^2}+x)}{\sqrt{x^2+\epsilon^2}+x}=\frac{\epsilon^2}{\sqrt{x^2+\epsilon^2}+x}\approx \frac{\epsilon^2}{\sqrt{x^2}+x}=\frac{\epsilon^2}{2x} $$b = ep / (np.sqrt(x**2 + ep) + x)
print(b)
We can also use
b = ep /(2*x)
print(b)
The relative error with the first expression is
Er = abs((a - b)/b)
print(Er)
This example shows that equivalent mathematical expressions are not necessarily equivalent computational expressions.
When solving the second order equation
$$x^2+10^{8}x+1=0$$using the well known formula
$$x_1=\frac{-b+\sqrt{b^2-4ac}}{2a}\qquad (1)$$as
$$a=1,\,b=10^{8},\:c=1\qquad b^{2}\gg 4ac$$and a cancellation error is to be expected. The given solution is
a = 1.; b = 10.**8; c = 1.
x11 = (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)
print(x11)
But after substitution back into the polynomial we get that the residual
residual1 = a*x11**2 + b*x11 + c
print(residual1)
is relatively large. It means that it is not a good solution because we should have
$$ax_{1}^{2}+bx_{1}+c\approx0$$and we are getting
$$ax_{1}^{2}+bx_{1}+c\approx0.255$$We rewrite the formula
And we multiply and divide by the difference
As $(x+y)(x-y)=x^2-y^2$
And the following formula is equivalent to $(1)$ and do not have the cancellation problem
As $b^{2}\gg 4ac$ we can assume that $\sqrt{b^2-4ac}\approx \sqrt{b^2} \approx b$ and
$$x_1\approx\frac{-2c}{b+b}=-\frac{c}{b}$$x12 = -c/b
residual12 = a*x12**2 + b*x12 + c
print(x12)
print(residual12)
And we can see that now the residual es much smaller than the solution
When solving
$$x^2-10^{8}x+1=0$$using the formula
$$x_2=\frac{(-b)-\sqrt{b^2-4ac}}{2a}\qquad (2)$$as we have
$$a=1,\,b=-10^{8},\:c=1\qquad b^{2}\gg4ac$$as $b$ is negative, $-b$ is positive and $-\sqrt{b^2-4ac}$ negative but very close to the value of $-b$ there is a cancellation error, is
If we use it, we obtain
a = 1.; b = -10.**8; c = 1
x21 = (- b - np.sqrt(b**2 - 4*a*c)) / (2*a)
print(x21)
but is we substitute this value in the equation
residual21 = a*x21**2 + b*x21 + c
print(residual21)
that is quite large.
We rewrite the formula
Take into account that if $b$ is negative and $-b$ is positive. Also $-\sqrt{b^2-4ac}$ is always negative. Thus, the formula $(-b)-\sqrt{b^2-4ac}$ for $b^{2}\gg4ac$ has a cancellation problem.
We multiply and divide by the addition
As $(x+y)(x-y)=x^2-y^2$
And the following formula is equivalent to $(2)$ and does not have the cancellation problem (the denominator is the addition of two positive numbers, as $b$ is negative)
As $b^{2}\gg4ac$ we can consider that $\sqrt{b^2-4ac}\approx \sqrt{b^2} = |b|=-b$ and we can simplify the formula
x22 = -c/b
residual22 = a*x22**2 + b*x22 + c
print(x22)
print(residual22)
There are two kinds of numerical methods:
Depending of their source we can have:
It is an error due to the method. It is the error that we get when we apply an iterative method. Instead of generating infinite terms of the sequence of solutions, we stop at some point and assume that the solution contains an error but that is "good enough"
For example, given
$$e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!},$$we should get the exact solution if we compute infinite terms (if we use infinite precision.) If we compute only a finite number of terms we will have a truncation error.
Let's approximate the value of the number $e$, using $x = 1$, with a finite number of addition terms and see how the error evolves
suma = 0.
x = 1.
fact = 1.
for n in range(100):
suma += x**n / fact
fact *= n+1
if n in np.array([5,10,15,20,40,60,80,100])-1:
error = abs(np.exp(1.) - suma)
print("Number of terms %i" % (n+1))
print("Error %.16e\n" % error)
The error first decreases but then, it stagnates and does not decrease anymore. Why? If we would be using symbolic calculus, the error would decrease as iterations increase. But we are using finite arithmetic, and the error is limited by the spacing between representable numbers, which around number $e$ is
Ea = np.spacing(np.exp(1.))
print(Ea)
Let's compute the number of terms we need to get $e^1$ with the lowest possible error.
Ea = np.spacing(np.exp(1.))
n = 0
Sum = 1.
fact = 1.
itermax = 100
error = np.abs(np.exp(1.) - Sum)
while (error > Ea and n < itermax):
n += 1
Sum += 1. / fact
fact *= n+1
error = np.abs(np.exp(1.) - Sum)
print("Number of iterations %i" % n)
print("Error %.16e" % error)
print("Ea %.16e" % Ea)
So after $17$ iterations the truncation error is machine-neglegible.
%run Exercise4
print("Number of iterations %i" % n)
print("Error %.16e" % error)
print("Spacing around ln(1+x) %.16e" % Ea)
(a)
(b)
(c)
%run Exercise5