Finite arithmetic and error analysis


Decimal and binary systems

In the decimal system the number $105.8125$ means:

$$105.8125=1 \cdot 10^2+5\cdot 10^0+8\cdot10^{-1}+1\cdot10^{-2}+2\cdot10^{-3}+5\cdot10^{-4}$$

i.e., it is a linear combination of powers of $10$ multiplied by one of the $10$ digits in $0,1,\ldots,9$.

Computers use the binary system because it is the natural way an electronic device works: switched on or off. Therefore, only $0's$ and $1's$ need to be stored.

In the binary system numbers are represented as linear combinations of powers of $2$, multiplied by $0's$ and $1's$:

$$(105.8125)_{10}=2^6+2^5+2^3+2^0+2^{-1}+2^{-2}+2^{-4}=(1101001.1101)_2 \qquad (1)$$

Decimal to binary conversion

Conversion of the integer part is achieved by sequentially dividing by $2$. The remainings of these divisions are the digits in base $2$, from lesser to larger significancy.

$$ \begin{array}{lrrrrrrrr} Quotients & 105 & 52 & 26 & 13 & 6 & 3 & 1\\ Remainders & 1 & 0 & 0 & 1 & 0 & 1 & \swarrow & \end{array} $$

Then, $(105)_{10}$ in the binary system is $(1101001)_2$.

Remark. Python's function bin performs this conversion.

In [1]:

Exercise 1

Write a function,, to convert the integer part of a base 10 number into a binary number. Use it for x = 105.8125.

Hint: the following functions may be useful

  • np.fix(x): rounds x towards 0.
  • a//b : gives the quotient of the division.
  • a%b : gives the remainder of the division.
In [2]:
x = 105.8125
binary1 = de2bi_a(x)
print ''.join(binary1)  # if binary1 is stored as a strings' list

To convert the decimal part, we multiply by $2$, drop the integer part and repit till reaching $1$.

$$ \begin{array}{lrrrr} Decimal & 0.8125& 0.625 & 0.25 & 0.5 \\ Integer & 1 & 1 & 0 & 1 \end{array} $$

Ejercicio 2

Write a function,, to convert the decimal part of a base 10 number into a binary number. Use it for x = 105.8125. Limit the maximum number of fracional digits in binary to 23. Try also with x = 14.1

In [3]:
x = 105.8125
binary2 = de2bi_b(x)
print ''.join(binary2)

And number 105.8125 (in decimal) is written, in binary

In [4]:
print ''.join(binary1 + ['.'] + binary2)
In [5]:
x = 14.1

binary3 = de2bi_a(x)
binary4 = de2bi_b(x)

print ''.join(binary3 + ['.'] + binary4)

Binary to decimal conversion

Using $(1)$ for converting $(1101001.1101)_2$ to decimal base we get

In [6]:

Integer representation

If we have $m$ digits or memory bits then we may store $2^m$ different binary numbers.

Positive integers

If we only consider the positive integers then we may represent numbers between

$$ (00\ldots 00)_2=(0)_{10} \quad \mathrm{and} \quad (11\ldots 11)_2=(2^m-1)_{10}. $$

For instance, for $m=3$ bits we may represent positive integers from $0$ to $7$:

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \mathtt{4} & \mathtt{100}\\ \mathtt{5} & \mathtt{101}\\ \mathtt{6} & \mathtt{110}\\ \mathtt{7} & \mathtt{111}\\ \hline \end{array} $$

Signed integers

In signed integers, the first bit is used to store the sign: $0$ for positive and $1$ for negative. The other $m-1$ digits are used for the unsigned number and, therefore, we may write $2^{m-1}-1$ positive numbers, the same amount of negative numbers and two zeros, one with positive sign and another with negative sign. Therefore, the range of representation is $[-2^{m-1}+1, 2^{m-1}-1]$. For instance, if $m=3$ we may represent numbers from $-3$ to $3$:

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-3} & \mathtt{111}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{101}\\ \mathtt{-0} & \mathtt{100}\\ \mathtt{+0} & \mathtt{000}\\ \mathtt{+1} & \mathtt{001}\\ \mathtt{+2} & \mathtt{010}\\ \mathtt{+3} & \mathtt{011}\\ \hline \end{array} $$


Compute how $(80)_{10}$ is stored when using a signed 8-bit binary representation.

Solution: The representation is similar than that of Exercise 4, but with the first digit equal to $1$, due to the negative sign. So we have $(-80)_{10}=(11010000)_2$.

Signed integers: Two's complement representation

To avoid the double representation of zero we define negative integers by taking the digits of the corresponding positive integers and changing $0's$ to $1's$, and $1's$ to $0's$, and adding $1$ to the result. In this way, the sum of a number and its oposite is always $0$.

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{100}\\ \mathtt{-3} & \mathtt{101}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{111}\\ \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \hline \end{array} $$

Example. To represent $(-2)_{10}$ we start writing $(2)_{10}$ in binary form, $(010)_2$. Then, we

  • invert its digits $\rightarrow (101)_2$
  • add $001$ (take into account that $0+0=0, \quad 0+1=1, \quad 1+1=10$) $\rightarrow (110)_2$

So the property $(010)_2+(110)_2=(000)_2$ holds.

In this case, the first bit of the negative is $0$ and that of the positive is $1$. We may represent integers in the range $[-2^{m-1},2^{m-1}-1]$.

Example. Compute how $(80)_{10}$ is stored when using a two's complement 8-bit binary representation. Solution: since $(80)_{10}=(01010000)_2$

  • we interchange $0's$ and $1's$ $\rightarrow 10101111, $
  • and adding $1$ we obtain the representation $(-80)_{10}=(1011000)_2$

Signed integers: biased representation

Negative numbers are represented as consecutive positive values, starting from the lowest negative integer. Positive numbers are the rest. The representation is obtained by adding the bias $2^{m-1}$ to the number $x$, i.e., setting $x_r=x+2^{m-1}\in[0,2^m-1]$.

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{000}\\ \mathtt{-3} & \mathtt{001}\\ \mathtt{-2} & \mathtt{010}\\ \mathtt{-1} & \mathtt{011}\\ \mathtt{0} & \mathtt{100}\\ \mathtt{1} & \mathtt{101}\\ \mathtt{2} & \mathtt{110}\\ \mathtt{3} & \mathtt{111}\\ \hline \end{array} $$

We see that the representable rage is again $[-2^{m-1},2^{m-1}-1]$.

Which form is actually used for integer storage? Most machines use two's complement for integer numbers and biased representation for the exponents of floating point numbers (which are integers).

Why? The main reason to use biased representation is its efficiency for comparing numbers. Codes compare floating point numbers very often, and this is done by first comparing their exponents. Only if they are equal, comparison of their mantissas is performed.

Integer representation in Python

In Python, numerical variables may be int, float, long and complex. Boolean variables are a subtype of integers.

Single integers or int have at least 32 bits precission, normally 32 or 64 bits, depending on the computer and operative system we are using.

The maximum single integer value in our computer is given by

In [7]:
import sys

That is, if $m=64$ is the number of bits, the largest integer is given by $2^{m-1}-1.$

In [8]:

And for negative values, the minimum integer is $-2^{m-1}$

In [9]:
In [10]:

Thus, in our case the single integer has a 64 bits precission.

If we need to represent a larger integer, Python uses long integers. Their precission is only limited by the memory available. They are recognized by its last character: L.

In [11]:
In [12]:

Real number representation

For real numbers, floating point representation of base $\beta=2$ is used:

$$x_r=(-1)^s \cdot m \cdot \beta^e=(-1)^s \cdot (a_1.a_2\ldots a_t) \cdot \beta^e$$


  • $s$ is the sign: $1$ for negative numbers and $0$ for positive.
  • $m$ is the mantissa which when normalized has values in $1\leq m < \beta$ and a nonzero first digit. In binary base this digit only can be $1$, i.e. in a normalized mantissa we always have $a_1=1$, so it is unnecesary to store it (hidden bit method).
  • $e$ is the exponent, a signed integer (biased binary representation).

Numbers are stored either in words of 32 bits (single precission), 64 bits (double precission), or 128 bits (extended precission). In most computers, for Pyhthon, the default float precission is double precission. In this case bits are used as follows:

  • 1 bit for the sign.
  • 11 bits for the exponent,
  • 52 bits for the mantissa.
In [13]:
%matplotlib inline

With $11$ bits for the signed exponent, we have room for $2^{11}=2048$ binary numbers, $ 0 < E < 2047 $. The first number, $00000000000$ is reserved for zeros and non-normalized numbers; and the last, $11111111111$, for Inf and NaN.

Thus, the exponent take values $ 1 < E < 2046 $, and since the bias is $1023$, these values correspond to the exponents $ -1022 < E - 1023 < 1023 $. Therefore, the maximum exponent is $E_{max}=1023$, and the minimum is $E_{min}=-1022$. Dividing by the minimum value, we get


so the maximum value is not reached, i.e. there is not overflow.

Exercise 3

Write a script that calculates the single binary representation of 105.8125, 120.875, 7.1 and -1.41, following the standard IEEE 754. Use the scrips written in exercises 1 and 2. Consider only the case where the integer part is greater than zero (the number, in absolute value, is greater than $1$). Round the numbers using truncation.

Note: Remember that single precision uses 32 bits: 1 for the sign , 8 for the exponent and 23 bits for the mantissa. The exponent bias is 127.

In [14]:
105.8125 --->  1101001.1101
[Sign]:  0  [Exponent]:  10000101  [Mantissa]:  10100111010000000000000
120.875  --->  1111000.111
[Sign]:  0  [Exponent]:  10000101  [Mantissa]:  11100011100000000000000
7.1      --->  111.00011001100110011001100
[Sign]:  0  [Exponent]:  10000001  [Mantissa]:  11000110011001100110011
-1.41    --->  1.01101000111101011100001
[Sign]:  1  [Exponent]:  01111111  [Mantissa]:  01101000111101011100001

The features of ours machine float type are obtained using the command sys.float_info

In [15]:
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

The number of digits in the mantissa, mant_dig, is 53 (52 bits but 53 digits because the hidden bit). That is, Python's float uses 64 bits and is double precision.


max_exp=1024 and min_exp=-1021

does not seems to agree with what it was written above.

But, if we execute


we obtain

| max_exp | DBL_MAX_EXP -- maximum int e such that radix**(e-1) is representable

| min_exp | DBL_MIN_EXP -- minimum int e such that radix**(e-1) is a normalized float

Ejercicio 4

The largest double precission normalized number that Matlab may store in binary representation is

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{1023}$$

Since we do not have to keep the first $1$, there are 52 bits left to store the $1's$ of $0.11\ldots 11$. Therefore, in base $10$ this number is

$$(1+1\cdot 2^{-1}+1\cdot 2^{-2}+1\cdot 2^{-3}+\cdots+1\cdot 2^{-52})\cdot 2^{1023}$$

Write a code,, to compute this sum. Perform the sum from lowest to largest terms (we shall see why later on). Its value should coincide with that obtained with sys.float_info.max. Define the variable output with the output value.

In [16]:
import sys

print '%.16e' % output      # exponential format with 16 decimal digits
In [17]:

We check that they are equal

In [18]:
print output == sys.float_info.max

(b) Using a similar procedure than in Exercise4a, write a code to compute the lowest representable normalized floating point number using the expression

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{-1022}$$

Its value should coincide with that obtained from sys.float_info.min.

In [19]:
print '%.16e' % output
In [20]:

The largest consecutive integer which can be stored exactly in binary form with floating point representation is

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{52}$$

Therefore, we have $53$ precission digits in binary form. The next integer,

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{53}$$

can also be stored exactly. But not the next one, since we would need an extra bit in the mantissa.

In [21]:

Hence, all 15-digits integers and most of 16-digits integers may be stored exactly.

Observe that, actually, the number

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{62}$$

is integer, storable and larger than the largest integer which can be stored!! However, the ten last digits of this number are necessarily zeros. Therefore, from the point of view of precission they are negligible, since they can not be changed.

Denormalized numbers

What happens to numbers with exponents out of the range $[-1022,1023]$?

If the exponent is lower than $-1022$ then the number is non-normalized or zero, and we are using for the bits corresponding to the exponent the special value $00000000000$. Then, the hidden bit is now $0$, instead of $1$.

In [22]:
x = 0.5*(2**-1023)   
print(x)               # denormalized number
In [23]:
print(2**-1080)        # Underflow

The lowest non-normalized number is

$$(+1)(0.000\ldots 01)\times 2^{-1022}$$

that is, $2^{-1022-52}$

In [24]:
num_min = 2.**(-1022-52)

For any other smaller value, we get zero (underflow):

In [25]:
2.**(-1022-53)    # Underflow

If the exponent is larger than $1023$ we get OverflowError.

In [26]:
2.**1024          # Overflow
OverflowError                             Traceback (most recent call last)
<ipython-input-26-38590efc1dcd> in <module>()
----> 1 2.**1024          # Overflow

OverflowError: (34, 'Result too large')

Machine precission

Let us compute the lowest number we may add to $1$ using double precission floating point binary representation.

The number $1$, in normalized double precission floating point representation, is

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$

with $52$ zeros in the mantissa. The lowest number we may add in floating point non-normalized representation is

$$\epsilon = (+1)\cdot (0.00\ldots 01) \cdot 2^{0}$$

which in base $10$ is

In [27]:

which is obtained from Python by

In [28]:

This value, the lowest number $\epsilon$ such that $1+\epsilon>1$ is called the machine precission, which gives the floating point representation precission.

Since, in double precission, $\epsilon \approx 2.22\cdot 10^{-16}$ we have that it corresponds to, approximately, $16$ decimal digits.


$$1=(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$


$$1+\epsilon=(+1)\cdot (1.00\ldots 01) \cdot 2^{0}$$

we can not represent exactly (and in floating point) any other real number.

To compute the lowest number comparable to $x$ we must add the number eps

In [29]:
x = 10.
eps = np.spacing(x)
In [30]:
x = 100.
eps = np.spacing(x)
In [31]:
x = 1000.
eps = np.spacing(x)

We see that eps increases with the absolute value of $x$. This means that the difference between two consecutive numbers exactly representable in floating point representation increases as we depart from zero. Therefore, the density of exact floating point numbers is not uniform. It is larger close to zero and smaller far from it. Remind that floating point numbers are actually rational numbers.

Special cases

Some operations lead to special results

In [32]:
x = 1.e200
y = x*x
In [33]:

nan means "not a number". In general, it arises when a non valid operation has been performed.

Overflow. It happens when the result of the operation is finite but not representable (larger than the largest representable number).

In [34]:
x = 1.e300
OverflowError                             Traceback (most recent call last)
<ipython-input-34-35043931f28e> in <module>()
      1 x = 1.e300
----> 2 x**2

OverflowError: (34, 'Result too large')

Approximation and rounding error


Often, there is no exact floating point representation of a real number $x$, which lies between two consecutive representable numbers $x^-rounding method. IEEE admits five rounding methods:

  • Round toward 0 ("truncation").
  • Round toward $+ \infty$.
  • Round toward $- \infty$.
  • Round toward the nearest number which is furthest from zero.
  • Round toward the nearest even number ("rounding").

The last method is the most usual.

Loss of digits

Suppose that we are using a three digits decimal representation and we want to sum the numbers $a=123$ and $b=1.25$. The exact result is $s=124.25$ but, since only three digits are available, we shall round to, for instance, $s_r=124$. An error arises. In general, finite arithmetic operations involve errors.

Example. The operation $3\times 0.1 \times 100$ returns:

In [35]:

The result is not the right one ($30$) due to rounding error. These errors may propagate and, in some cases, affect substantially the result. In the previous case the error is due to, as we already saw, the periodicity of $x=0.1$ in the binary base. Thus, its representation is not exact in the computer.

Example. Computing

$$\sum_{k=1}^{10000} 0.1$$


In [36]:
Sum = 0.
for i in range(1,10001):
    Sum = Sum + 0.1
print '%.16f'% Sum       # Decimal format with 16 decimals

with absolute error given by

In [37]:


If we add or substract numbers which are very different in magnitude we always lose accuracy due to rounding error. For instance,

In [38]:
a = 1.e+9
epsilon = 1.e-8

Sum = a
for i in range(1,10001):
    Sum = Sum + epsilon
print '%.16e' % Sum      # Exponencial format with 16 decimals

However, the result should have been

In [39]:
Sum = a + 10000*epsilon
print '%.16e'% Sum


Let us sum the $N$-first term of the harmonic series

$$\sum_{n=1}^N \frac{1}{n}.$$

We use single precission (function float32) to make the effect more evident. The exact result for the first $1000$-terms is

In [40]:
import numpy as np
from scipy import special
N = 10000
Sum = special.polygamma(0,N+1)-special.polygamma(0,1)
print '%.20f'% Sum

Starting the sum from the term $n=1$, we get

In [41]:
Sum1 = 0.;
for n in range(1,N+1):
    Sum1 += 1./n
print '%.20f'% Sum1 
In [42]:
error1 = abs(Sum1 - Sum)
print error1

Starting from the last term, we get

In [43]:
Sum2 = 0.;
for n in range(N,0,-1):
    Sum2 += 1./n
print '%.20f'% Sum2

Thus, the absolute difference is

In [44]:
error2 = abs(Sum2 - Sum)
print error2

The error is largest in the first sum because the addition starts with the largest term and continues adding subsequent lower terms. Thus we lose accuracy because the different magnitudes between the accumulated sum and the new added terms.

Cancellation error

It arises in the substraction of large numbers of similar value. For instance,

In [45]:
import numpy as np
x  = 1000000.
ep = 0.0001
a  = np.sqrt(x**2 + ep) - x
print a

In this case, cancellation error may be avoided by using an equivalent mathematical expression

$$\sqrt{x^2+\epsilon^2}-x=\frac{(\sqrt{x^2+\epsilon^2}-x)(\sqrt{x^2+\epsilon^2}+x)}{\sqrt{x^2+\epsilon^2}+x}=\frac{\epsilon^2}{\sqrt{x^2+\epsilon^2}+x}\approx \frac{\epsilon^2}{\sqrt{x^2}+x}=\frac{\epsilon^2}{2x} $$
In [46]:
b = ep / (np.sqrt(x**2 + ep) + x)
print b

The relative error with the first expression is

In [47]:
Er = abs((a - b)/b)
print Er

This example shows that equivalent mathematical expressions are not necessarily equivalent computational expressions.

Ejercicio 5

(a) When solving the second order equation


using the well known formula


one of the solutions is

In [48]:
a = 1.; b = 10.**8; c = 1.
x1 = (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)
print x1

But after substitution back into the polynomial we get that the residual

In [49]:
residual1 = a*x1**2 + b*x1 + c
print residual1

is relatively large. Find an equivalent mathematical expression to improve the result and write a code for it

In [50]:
print x2
print residual2

When solving


with the formula


we obtain

In [51]:
a = 1.; b = -10.**8; c = 1
x3 = (- b - np.sqrt(b**2 - 4*a*c)) / (2*a)
print x3

but is we substitute this value in the equation

In [52]:
residual3 = a*x3**2 + b*x3 + c
print residual3

that, again, it is an big relative error.

(b) Find an equivalent mathematical expression to improve the result and write a code for it

In [53]:
print x4
print residual4




let's approximate the value of the number $e$, using $x = 1$, with a finite number of addition terms and see how the error evolves

In [1]:
import math

suma = 0.
x = 1.

for n in range(100):
    suma += x**n / math.factorial(n)
    if n in np.array([5,10,15,20,40,60,80,100])-1:
        error   = abs(np.exp(1.) - suma)
        print "Number of terms           %i" % (n+1)
        print "Error                     %.16e" % error
        print ""
Number of terms           5
Error                     9.9484951257120535e-03

Number of terms           10
Error                     3.0288585284310443e-07

Number of terms           15
Error                     8.1490370007486490e-13

Number of terms           20
Error                     4.4408920985006262e-16

Number of terms           40
Error                     4.4408920985006262e-16

Number of terms           60
Error                     4.4408920985006262e-16

Number of terms           80
Error                     4.4408920985006262e-16

Number of terms           100
Error                     4.4408920985006262e-16

The error first decreases but then, it stagnates and does not decrease anymore. Why? If we would be using symbolic calculus, the error would decrease as iterations increase. But we are using finite arithmetic, and the error is limited by the spacing between representable numbers, which around number $e$ is

In [54]:
Ea = np.spacing(np.exp(1.))
print Ea

Let's compute the number of terms we need to get $e^1$ with the lowest possible error.

In [55]:
import math

Ea = np.spacing(np.exp(1.))
n = 0
Sum = 1.

itermax = 100
error   = np.abs(np.exp(1.) - Sum)

while (error > Ea and n < itermax):
    n += 1
    Sum += 1. / math.factorial(n)
    error = np.abs(np.exp(1.) - Sum)
print "Number of iterations     %i" % n
print "Error                    %.16e" % error
print "Ea                       %.16e" % Ea
Number of iterations     17
Error                    4.4408920985006262e-16
Ea                       4.4408920985006262e-16

So after $17$ iterations the truncation error is machine-neglegible.

Exercise 6

Taking into account that

$$\log (1+x) = \dfrac{x^1}{1}-\dfrac{x^2}{2}+\dfrac{x^3}{3}-\dfrac{x^4}{4}+\dfrac{x^5}{5}-\cdots \quad \mathrm{para} \quad |x|\lt 1$$

write a code ( to compute the number of terms we need to get $\log 1.5$ with the lowest possible error.

In [2]:
print "Number of iterations     %i" % n
print "Error                    %.16e" % error
print "Spacing around ln(1+x)   %.16e" % Ea
Number of iterations     45
Error                    5.5511151231257827e-17
Spacing around ln(1+x)   5.5511151231257827e-17