Course Webpage

Finite arithmetic and error analysis

Contents

Decimal and binary systems

In the decimal system the number $105.8125$ means:

$$105.8125=1 \cdot 10^2+5\cdot 10^0+8\cdot10^{-1}+1\cdot10^{-2}+2\cdot10^{-3}+5\cdot10^{-4}$$

i.e., it is a linear combination of powers of $10$ multiplied by one of the $10$ digits in $0,1,\ldots,9$.

Computers use the binary system because it is the natural way an electronic device works: switched on or off. Therefore, only $0's$ and $1's$ need to be stored.

In the binary system numbers are represented as linear combinations of powers of $2$, multiplied by $0's$ and $1's$:

$$(105.8125)_{10}=2^6+2^5+2^3+2^0+2^{-1}+2^{-2}+2^{-4}=(1101001.1101)_2 \qquad (1)$$

Decimal to binary conversion

Conversion of the integer part is achieved by sequentially dividing by $2$. The remainings of these divisions are the digits in base $2$, from lesser to larger significancy.

$$ \begin{array}{lrrrrrrrr} Quotients & 105 & 52 & 26 & 13 & 6 & 3 & 1\\ Remainders & 1 & 0 & 0 & 1 & 0 & 1 & \swarrow & \end{array} $$

Then, $(105)_{10}$ in the binary system is $(1101001)_2$.

Remark. Python's function bin performs this conversion.

In [1]:
bin(105)
Out[1]:
'0b1101001'

Exercise 1a

Write a function, de2bi_a(x), that converts the integer part of a base 10 number x into a binary number.

The output will be stored as a string

  • Start with an empty string s = ''
  • Add elements to the left of the previous string with s = (str(d)) + swhere d is 0 or 1

Use it for x = 105.8125 and x = 310.671875.

Hint: the following functions may be useful

  • np.fix(x): rounds x towards 0.
  • a//b : gives the quotient of the division.
  • a%b : gives the remainder of the division.

The structure of the script will be

import numpy as np

def de2bi_a(x):
    ...  # <- write here your code

def main():
    x = 105.8125
    binary1 = de2bi_a(x)
    print('The integer part of ', x, '   in binary is ', binary1)

if __name__ == "__main__":
    main()

This way you will be able to import the function debi_a from the file Exercise1a.py without executing again the code in main

In [2]:
%run Exercise1a
The integer part of  105.8125    in binary is  1101001
The integer part of  310.671875  in binary is  100110110

To convert the decimal part, we multiply by $2$, drop the integer part and repit till reaching $1$.

$$ \begin{array}{lrrrr} Fractionary & 0.8125& 0.625 & 0.25 & 0.5 \\ Integer & 1 & 1 & 0 & 1 \end{array} $$

Exercise 1b

Write a function, de2bi_b(x), to convert the decimal part of a base 10 number into a binary number.

The output will be stored as a string

  • Start with an empty string s = ''
  • Add elements to the right of the previous string with s = s + (str(d))where d is 0 or 1

Use it for x = 105.8125, x = 310.671875 and x = 14.1

Limit the maximum number of fracional digits in binary to 23 (truncation).

Using the functions de2bi_a(x) and de2bi_b(x) write the number x in binary for the three numbers

The structure of the script will be

import numpy as np
import Exercise1a as ex1a

def de2bi_b(x):
    ...   # <-- write here your code               

def main():
    x = 105.8125
    binary1 = ex1a.de2bi_a(x)
    binary2 = de2bi_b(x)
    print('The fractional part of ', x, ' in binary is ', '0.'+binary2)
    print('The number ', x, 'in binary is ', binary1+'.'+binary2)

if __name__ == "__main__":
    main()
In [3]:
%run Exercise1b
The fractional part of  105.8125  in binary is  0.1101
The number  105.8125 in binary is  1101001.1101

The fractional part of  310.671875  in binary is  0.101011
The number  310.671875 in binary is  100110110.101011

The fractional part of  14.1  in binary is  0.00011001100110011001100
The number  14.1 in binary is  1110.00011001100110011001100

Binary to decimal conversion

Using $(1)$ for converting $(1101001.1101)_2$ to decimal base we get

In [4]:
1*(2**6)+1*(2**5)+0*(2**4)+1*(2**3)+0*(2**2)+0*(2**1)+1*(2**0)+1*(2**-1)+1*(2**-2)+0*(2**-3)+1*(2**-4)
Out[4]:
105.8125

Integer representation

If we have $m$ digits or memory bits then we may store $2^m$ different binary numbers.

Positive integers

If we only consider the positive integers then we may represent numbers between

$$ (00\ldots 00)_2=(0)_{10} \quad \mathrm{and} \quad (11\ldots 11)_2=(2^m-1)_{10}. $$

For instance, for $m=3$ bits we may represent positive integers from $0$ to $7$:

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \mathtt{4} & \mathtt{100}\\ \mathtt{5} & \mathtt{101}\\ \mathtt{6} & \mathtt{110}\\ \mathtt{7} & \mathtt{111}\\ \hline \end{array} $$

Signed integers

In signed integers, the first bit is used to store the sign: $0$ for positive and $1$ for negative. The other $m-1$ digits are used for the unsigned number and, therefore, we may write $2^{m-1}-1$ positive numbers, the same amount of negative numbers and two zeros, one with positive sign and another with negative sign. Therefore, the range of representation is $[-2^{m-1}+1, 2^{m-1}-1]$. For instance, if $m=3$ we may represent numbers from $-3$ to $3$:

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-3} & \mathtt{111}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{101}\\ \mathtt{-0} & \mathtt{100}\\ \mathtt{+0} & \mathtt{000}\\ \mathtt{+1} & \mathtt{001}\\ \mathtt{+2} & \mathtt{010}\\ \mathtt{+3} & \mathtt{011}\\ \hline \end{array} $$

Example

Compute how $(80)_{10}$ is stored when using a signed 8-bit binary representation.

Solution: The representation is similar than that of Exercise 4, but with the first digit equal to $1$, due to the negative sign. So we have $(-80)_{10}=(11010000)_2$.

Signed integers: Two's complement representation

To avoid the double representation of zero we define negative integers by taking the digits of the corresponding positive integers and changing $0's$ to $1's$, and $1's$ to $0's$, and adding $1$ to the result. In this way, the sum of a number and its oposite is always $0$.

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{100}\\ \mathtt{-3} & \mathtt{101}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{111}\\ \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \hline \end{array} $$

Example. To represent $(-2)_{10}$ we start writing $(2)_{10}$ in binary form, $(010)_2$. Then, we

  • invert its digits $\rightarrow (101)_2$
  • add $001$ (take into account that $0+0=0, \quad 0+1=1, \quad 1+1=10$) $\rightarrow (110)_2$

So the property $(010)_2+(110)_2=(000)_2$ holds.

In this case, the first bit of the negative is $0$ and that of the positive is $1$. We may represent integers in the range $[-2^{m-1},2^{m-1}-1]$.

Example. Compute how $(80)_{10}$ is stored when using a two's complement 8-bit binary representation. Solution: since $(80)_{10}=(01010000)_2$

  • we interchange $0's$ and $1's$ $\rightarrow 10101111, $
  • and adding $1$ we obtain the representation $(-80)_{10}=(1011000)_2$

Signed integers: biased representation

Negative numbers are represented as consecutive positive values, starting from the lowest negative integer. Positive numbers are the rest. The representation is obtained by adding the bias $2^{m-1}$ to the number $x$, i.e., setting $x_r=x+2^{m-1}\in[0,2^m-1]$.

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{000}\\ \mathtt{-3} & \mathtt{001}\\ \mathtt{-2} & \mathtt{010}\\ \mathtt{-1} & \mathtt{011}\\ \mathtt{0} & \mathtt{100}\\ \mathtt{1} & \mathtt{101}\\ \mathtt{2} & \mathtt{110}\\ \mathtt{3} & \mathtt{111}\\ \hline \end{array} $$

We see that the representable rage is again $[-2^{m-1},2^{m-1}-1]$.

Which form is actually used for integer storage? Most machines use two's complement for integer numbers and biased representation for the exponents of floating point numbers (which are integers).

Why? The main reason to use biased representation is its efficiency for comparing numbers. Codes compare floating point numbers very often, and this is done by first comparing their exponents. Only if they are equal, comparison of their mantissas is performed.

Integer representation in Python

In Python, numerical variables There are three distinct numeric types: integers, floating point numbers, and complex numbers. In addition, Booleans are a subtype of integers. Integers have unlimited precision in python 3.

In biased representation, if $m=64$ is the number of bits, the largest integer is given by $2^{m-1}-1.$

In [5]:
m = 64
print(2**(m-1)-1)
print(type(2**(m-1)-1))
9223372036854775807
<class 'int'>

If $m=128$ is the number of bits, the largest integer is

In [6]:
m = 128
print(2**(m-1)-1)
print(type(2**(m-1)-1))
170141183460469231731687303715884105727
<class 'int'>

If $m=256$ is the number of bits, the largest integer is

In [7]:
m = 256
print(2**(m-1)-1)
print(type(2**(m-1)-1))
57896044618658097711785492504343953926634992332820282019728792003956564819967
<class 'int'>

Real number representation

For real numbers, floating point representation of base $\beta=2$ is used:

$$x_r=(-1)^s \cdot m \cdot \beta^e=(-1)^s \cdot (a_1.a_2\ldots a_t) \cdot \beta^e$$

where:

  • $s$ is the sign: $1$ for negative numbers and $0$ for positive.
  • $m$ is the mantissa which when normalized has values in $1\leq m < \beta$ and a nonzero first digit. In binary base this digit only can be $1$, i.e. in a normalized mantissa we always have $a_1=1$, so it is unnecesary to store it (hidden bit method).
  • $e$ is the exponent, a signed integer (biased binary representation).

Numbers are stored either in words of 32 bits (single precision), 64 bits (double precision), or 128 bits (extended precision). In most computers, for Pyhthon, the default float precision is double precision. In this case bits are used as follows:

  • 1 bit for the sign.
  • 11 bits for the exponent,
  • 52 bits for the mantissa.
In [8]:
%run Double_precision

With $11$ bits for the signed exponent, we have room for $2^{11}=2048$ binary numbers, $ 0 < E < 2047 $. The first number, $00000000000$ is reserved for zeros and non-normalized numbers; and the last, $11111111111$, for Inf and NaN.

Thus, the exponent take values $ 1 < E < 2046 $, and since the bias is $1023$, these values correspond to the exponents $ -1022 < E - 1023 < 1023 $. Therefore, the maximum exponent is $E_{max}=1023$, and the minimum is $E_{min}=-1022$. Dividing by the minimum value, we get

$$\frac{1}{x_{min}}=\frac{1}{m\beta^{E_{min}}}=\frac{1}{m\beta^{-1022}}=\frac{1}{m}\beta^{1022}<\frac{1}{m}\beta^{1023}$$

so the maximum value is not reached, i.e. there is not overflow.


Exercise 2

Write a script that calculates the single binary representation of 105.8125, 120.875, 7.1 and -1.41, following the standard IEEE 754. Use the scrips written in exercise 1. Consider only the case where the integer part is greater than zero (the number, in absolute value, is greater than $1$). Round the numbers using truncation.

Note: Remember that single precision uses 32 bits: 1 for the sign , 8 for the exponent and 23 bits for the mantissa. The exponent bias is 127.

In [9]:
%run Exercise2
105.8125 --->   1101001.1101
[Sign]:   0  [Exponent]:  10000101  [Mantissa]:  10100111010000000000000
 
120.875  --->   1111000.111
[Sign]:   0  [Exponent]:  10000101  [Mantissa]:  11100011100000000000000
 
7.1      --->   111.00011001100110011001100
[Sign]:   0  [Exponent]:  10000001  [Mantissa]:  11000110011001100110011
 
-1.41    --->   1.01101000111101011100001
[Sign]:   1  [Exponent]:  01111111  [Mantissa]:  01101000111101011100001

The features of ours machine float type are obtained using the command sys.float_info

In [10]:
import sys
sys.float_info
Out[10]:
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

The number of digits in the mantissa, mant_dig, is 53 (52 bits but 53 digits because the hidden bit). That is, Python's float uses 64 bits and is double precision.

But

max_exp=1024 and min_exp=-1021

does not seems to agree with what it was written above.

But, if we execute

help(sys.float_info)

we obtain

| max_exp | DBL_MAX_EXP -- maximum int e such that radix**(e-1) is representable

| min_exp | DBL_MIN_EXP -- minimum int e such that radix**(e-1) is a normalized float


Exercise 3

The largest double precision normalized number that Matlab may store in binary representation is

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{1023}$$

Since we do not have to keep the first $1$, there are 52 bits left to store the $1's$ of $0.11\ldots 11$. Therefore, in base $10$ this number is

$$(1+1\cdot 2^{-1}+1\cdot 2^{-2}+1\cdot 2^{-3}+\cdots+1\cdot 2^{-52})\cdot 2^{1023}$$

Write a code, to compute this sum. Perform the sum from lowest to largest terms (we shall see why later on). Its value should coincide with that obtained with sys.float_info.max. Define the variable output with the output value.

In [12]:
import sys

%run Exercise3
print('%.16e' % output)      # exponential format with 16 decimal digits
1.7976931348623157e+308
In [13]:
sys.float_info.max
Out[13]:
1.7976931348623157e+308

We check that they are equal

In [14]:
print (output == sys.float_info.max)
True

The lowest representable normalized floating point number using the expression

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{-1022}$$
In [15]:
output = 2**-1022
print ('%.16e' % output)
2.2250738585072014e-308

Its value should coincide with that obtained from sys.float_info.min.

In [16]:
sys.float_info.min
Out[16]:
2.2250738585072014e-308

The largest consecutive integer which can be stored exactly in binary form with floating point representation is

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{52}$$

Therefore, we have $53$ precision digits in binary form. The next integer,

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{53}$$

can also be stored exactly. But not the next one, since we would need an extra bit in the mantissa.

In [17]:
print(2**53)
9007199254740992

Hence, all 15-digits integers and most of 16-digits integers may be stored exactly.

Observe that, actually, the number

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{62}$$

is integer, storable and larger than the largest integer which can be stored!! However, the ten last digits of this number are necessarily zeros. Therefore, from the point of view of precision they are negligible, since they can not be changed.

Denormalized numbers

What happens to numbers with exponents out of the range $[-1022,1023]$?

If the exponent is lower than $-1022$ then the number is non-normalized or zero, and we are using for the bits corresponding to the exponent the special value $00000000000$. Then, the hidden bit is now $0$, instead of $1$.

In [18]:
x = 0.5*(2**-1023)   
print(x)               # denormalized number
print(1/x)             
5.562684646268003e-309
inf
In [19]:
print(2**-1080)        # Underflow
0.0

The lowest non-normalized number is

$$(+1)(0.000\ldots 01)\times 2^{-1022}$$

that is, $2^{-1022-52}$

In [20]:
num_min = 2.**(-1022-52)
print(num_min)
5e-324

For any other smaller value, we get zero (underflow):

In [21]:
2.**(-1022-53)    # Underflow
Out[21]:
0.0

If the exponent is larger than $1023$ we get OverflowError.

In [22]:
2.**1024          # Overflow
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
/tmp/ipykernel_6224/1756102542.py in <module>
----> 1 2.**1024          # Overflow

OverflowError: (34, 'Numerical result out of range')

Machine precision

Let us compute the lowest number we may add to $1$ using double precision floating point binary representation.

The number $1$, in normalized double precision floating point representation, is

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$

with $52$ zeros in the mantissa. The lowest number we may add in floating point non-normalized representation is

$$\epsilon = (+1)\cdot (0.00\ldots 01) \cdot 2^{0}$$

which in base $10$ is

In [23]:
1*2.**(-52)
Out[23]:
2.220446049250313e-16

which is obtained from Python by

In [24]:
sys.float_info.epsilon
Out[24]:
2.220446049250313e-16

This value, the lowest number $\epsilon$ such that $1+\epsilon>1$ is called the machine precision, which gives the floating point representation precision.

Since, in double precision, $\epsilon \approx 2.22\cdot 10^{-16}$ we have that it corresponds to, approximately, $16$ decimal digits.

Between

$$1=(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$

and

$$1+\epsilon=(+1)\cdot (1.00\ldots 01) \cdot 2^{0}$$

we can not represent exactly (and in floating point) any other real number.

To compute the lowest number comparable to $x$ we must add the number eps

In [25]:
x = 10.
eps = np.spacing(x)
print(eps)
1.7763568394002505e-15
In [26]:
x = 100.
eps = np.spacing(x)
print(eps)
1.4210854715202004e-14
In [27]:
x = 1000.
eps = np.spacing(x)
print(eps)
1.1368683772161603e-13

We see that eps increases with the absolute value of $x$. This means that the difference between two consecutive numbers exactly representable in floating point representation increases as we depart from zero. Therefore, the density of exact floating point numbers is not uniform. It is larger close to zero and smaller far from it. Remind that floating point numbers are actually rational numbers.

Special cases

Some operations lead to special results

In [28]:
x = 1.e200
y = x*x
print(y)
inf
In [29]:
y/y
Out[29]:
nan

nan means "not a number". In general, it arises when a non valid operation has been performed.

Overflow. It happens when the result of the operation is finite but not representable (larger than the largest representable number).

In [30]:
x = 1.e300
x**2
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
/tmp/ipykernel_6224/3996900289.py in <module>
      1 x = 1.e300
----> 2 x**2

OverflowError: (34, 'Numerical result out of range')

Approximation and rounding error

Rounding

Often, there is no exact floating point representation of a real number $x$, which lies between two consecutive representable numbers $x^-<x<x^+$. Then, the representation of $x$ is chosen according to the rounding method. IEEE admits five rounding methods:

  • Round toward 0 ("truncation").
  • Round toward $+ \infty$.
  • Round toward $- \infty$.
  • Round toward the nearest number which is furthest from zero.
  • Round toward the nearest even number ("rounding").

The last method is the most usual.

Loss of digits

Suppose that we are using a three digits decimal representation and we want to sum the numbers $a=123$ and $b=1.25$. The exact result is $s=124.25$ but, since only three digits are available, we shall round to, for instance, $s_r=124$. An error arises. In general, finite arithmetic operations involve errors.

Example. The operation $3\times 0.1 \times 100$ returns:

In [31]:
3*0.1*100
Out[31]:
30.000000000000004

The result is not the right one, $30,$ due to rounding error. These errors may propagate and, in some cases, affect substantially the result. In the previous case the error is due to, as we already saw, the periodicity of $x=0.1$ in the binary base. Thus, its representation is not exact in the computer.

Example. Computing

$$\sum_{k=1}^{10000} 0.1$$

gives

In [32]:
Sum = 0.
for i in range(1,10001):
    Sum = Sum + 0.1
print('%.16f'% Sum)       # Decimal format with 16 decimals
1000.0000000001588205

with absolute error given by

In [33]:
abs(Sum-1000.)
Out[33]:
1.588205122970976e-10

Example

If we add or substract numbers which are very different in magnitude we always lose accuracy due to rounding error. For instance,

In [34]:
a = 1.e+9
epsilon = 1.e-8

Sum = a
for i in range(1,10001):
    Sum = Sum + epsilon
print('%.16e' % Sum )     # Exponencial format with 16 decimals
1.0000000000000000e+09

However, the result should have been

In [35]:
Sum = a + 10000*epsilon
print('%.16e'% Sum)
1.0000000000001000e+09

Example

Let us sum the $N$-first term of the harmonic series

$$\sum_{n=1}^N \frac{1}{n}.$$

We use single precision (function float32) to make the effect more evident. The exact result for the first $1000$-terms is

In [36]:
from scipy import special
N = 10000
Sum = special.polygamma(0,N+1)-special.polygamma(0,1)
print('%.20f'% Sum)
9.78760603604438372827

Starting the sum from the term $n=1$, we get

In [37]:
Sum1 = 0.;
for n in range(1,N+1):
    Sum1 += 1./n
print('%.20f'% Sum1) 
9.78760603604434820113
In [38]:
error1 = abs(Sum1 - Sum)
print(error1)
3.552713678800501e-14

Starting from the last term, we get

In [39]:
Sum2 = 0.;
for n in range(N,0,-1):
    Sum2 += 1./n
print('%.20f'% Sum2)
9.78760603604438550462

Thus, the absolute difference is

In [40]:
error2 = abs(Sum2 - Sum)
print(error2)
1.7763568394002505e-15

The error is largest in the first sum because the addition starts with the largest term and continues adding subsequent lower terms. Thus we lose accuracy because the different magnitudes between the accumulated sum and the new added terms.

Cancellation error

When we are using finite arithmetic, it arises in the substraction of numbers of similar value. For instance, if we take decimal base and precision 7, if we subtract

\begin{array}{cc} & \mathtt{1.453417}\\ - & \mathtt{1.453416}\\ \hline & \mathtt{0.000001} \end{array}

we go from precision 7 to precision 1 because we only have left a digit. We can add the other digits as zeros but we really do not know their value because the values we are using in this subtraction are approximations of the stored numbers or results of operations whose number of digits can be, probably would be, many more.

Example

For instance, if we use the formula

$$\sqrt{x^2+\epsilon^2}-x \quad \mathrm{with} \quad x \gg \epsilon$$

we will have a cancellation error. If we use this formula, we will have a large error

In [41]:
import numpy as np
x  = 1000000.
ep = 0.0001
a  = np.sqrt(x**2 + ep) - x
print(a)
1.1641532182693481e-10

In this case, cancellation error may be avoided by using an equivalent mathematical expression

$$\sqrt{x^2+\epsilon^2}-x=\frac{(\sqrt{x^2+\epsilon^2}-x)(\sqrt{x^2+\epsilon^2}+x)}{\sqrt{x^2+\epsilon^2}+x}=\frac{\epsilon^2}{\sqrt{x^2+\epsilon^2}+x}\approx \frac{\epsilon^2}{\sqrt{x^2}+x}=\frac{\epsilon^2}{2x} $$
In [42]:
b = ep / (np.sqrt(x**2 + ep) + x)
print(b)
5e-11

We can also use

In [43]:
b = ep /(2*x)
print(b)
5e-11

The relative error with the first expression is

In [44]:
Er = abs((a - b)/b)
print(Er)
1.328306436538696

This example shows that equivalent mathematical expressions are not necessarily equivalent computational expressions.


Example

When solving the second order equation

$$x^2+10^{8}x+1=0$$

using the well known formula

$$x_1=\frac{-b+\sqrt{b^2-4ac}}{2a}\qquad (1)$$

as

$$a=1,\,b=10^{8},\:c=1\qquad b^{2}\gg 4ac$$

and a cancellation error is to be expected. The given solution is

In [45]:
a = 1.; b = 10.**8; c = 1.
x11 = (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)
print(x11)
-7.450580596923828e-09

But after substitution back into the polynomial we get that the residual

In [46]:
residual1 = a*x11**2 + b*x11 + c
print(residual1)
0.2549419403076172

is relatively large. It means that it is not a good solution because we should have

$$ax_{1}^{2}+bx_{1}+c\approx0$$

and we are getting

$$ax_{1}^{2}+bx_{1}+c\approx0.255$$

We rewrite the formula

$$x_1=\frac{-b+\sqrt{b^2-4ac}}{2a}=\frac{1}{2a}\left(\sqrt{b^2-4ac}-b\right)$$

And we multiply and divide by the difference

$$x_1=\frac{1}{2a}\left(\sqrt{b^2-4ac}-b\right)\frac{\sqrt{b^2-4ac}+b}{\sqrt{b^2-4ac}+b}=\frac{1}{2a}\frac{\left(\sqrt{b^2-4ac}-b\right)\left(\sqrt{b^2-4ac}+b\right)}{\sqrt{b^2-4ac}+b}$$

As $(x+y)(x-y)=x^2-y^2$

$$x_1=\frac{1}{2a}\frac{(b^2-4ac)-b^2}{\sqrt{b^2-4ac}+b}=\frac{1}{2a}\frac{-4ac}{\sqrt{b^2-4ac}+b}$$

And the following formula is equivalent to $(1)$ and do not have the cancellation problem

$$x_1=\frac{-2c}{\sqrt{b^2-4ac}+b}$$

As $b^{2}\gg 4ac$ we can assume that $\sqrt{b^2-4ac}\approx \sqrt{b^2} \approx b$ and

$$x_1\approx\frac{-2c}{b+b}=-\frac{c}{b}$$
In [47]:
x12 = -c/b
residual12 = a*x12**2 + b*x12 + c 
print(x12)
print(residual12)
-1e-08
1.1102230246251565e-16

And we can see that now the residual es much smaller than the solution


Example

When solving

$$x^2-10^{8}x+1=0$$

using the formula

$$x_2=\frac{(-b)-\sqrt{b^2-4ac}}{2a}\qquad (2)$$

as we have

$$a=1,\,b=-10^{8},\:c=1\qquad b^{2}\gg4ac$$

as $b$ is negative, $-b$ is positive and $-\sqrt{b^2-4ac}$ negative but very close to the value of $-b$ there is a cancellation error, is

If we use it, we obtain

In [48]:
a = 1.; b = -10.**8; c = 1
x21 = (- b - np.sqrt(b**2 - 4*a*c)) / (2*a)
print(x21)
7.450580596923828e-09

but is we substitute this value in the equation

In [49]:
residual21 = a*x21**2 + b*x21 + c
print(residual21)
0.2549419403076172

that is quite large.

We rewrite the formula

$$x_2=\frac{(-b)-\sqrt{b^2-4ac}}{2a}=\frac{1}{2a}\left((-b)-\sqrt{b^2-4ac}\right)$$

Take into account that if $b$ is negative and $-b$ is positive. Also $-\sqrt{b^2-4ac}$ is always negative. Thus, the formula $(-b)-\sqrt{b^2-4ac}$ for $b^{2}\gg4ac$ has a cancellation problem.

We multiply and divide by the addition

$$x_2=\frac{1}{2a}\left((-b)-\sqrt{b^2-4ac}\right)\frac{(-b)+\sqrt{b^2-4ac}}{(-b)+\sqrt{b^2-4ac}}=\frac{1}{2a}\frac{\left((-b)-\sqrt{b^2-4ac}\right)\left((-b)+\sqrt{b^2-4ac}\right)}{(-b)+\sqrt{b^2-4ac}}$$

As $(x+y)(x-y)=x^2-y^2$

$$x_2=\frac{1}{2a}\frac{(-b)^2-(b^2-4ac)}{(-b)+\sqrt{b^2-4ac}}=\frac{1}{2a}\frac{4ac}{(-b)+\sqrt{b^2-4ac}}$$

And the following formula is equivalent to $(2)$ and does not have the cancellation problem (the denominator is the addition of two positive numbers, as $b$ is negative)

$$x_2=\frac{2c}{(-b)+\sqrt{b^2-4ac}}$$

As $b^{2}\gg4ac$ we can consider that $\sqrt{b^2-4ac}\approx \sqrt{b^2} = |b|=-b$ and we can simplify the formula

$$x_2\approx\frac{2c}{(-b)+(-b)}=-\frac{c}{b}$$
In [50]:
x22 = -c/b
residual22 = a*x22**2 + b*x22 + c
print(x22)
print(residual22)
1e-08
1.1102230246251565e-16

Sources of error

There are two kinds of numerical methods:

  • Direct methods. We arrive to the solution in a finite number of steps like in the previous example (roots of a second degree polynomial).
  • Iterative methods. We generate a sequence of approximate solutions, that tend to the exact solution.

Depending of their source we can have:

  1. Rouding errors. The affect all the methods if we use finite arithmetic to solve them.
  2. Errors due to the method. They affect only the iterative methods, not the direct methods.
  3. Errors due to errors in the data. They affect all the methods.

Truncation error

It is an error due to the method. It is the error that we get when we apply an iterative method. Instead of generating infinite terms of the sequence of solutions, we stop at some point and assume that the solution contains an error but that is "good enough"

For example, given

$$e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!},$$

we should get the exact solution if we compute infinite terms (if we use infinite precision.) If we compute only a finite number of terms we will have a truncation error.

Let's approximate the value of the number $e$, using $x = 1$, with a finite number of addition terms and see how the error evolves

In [51]:
suma = 0.
x = 1.
fact = 1.

for n in range(100):
    suma += x**n / fact
    fact *= n+1
    if n in np.array([5,10,15,20,40,60,80,100])-1:
        error   = abs(np.exp(1.) - suma)
        print("Number of terms           %i" % (n+1))
        print("Error                     %.16e\n" % error)
Number of terms           5
Error                     9.9484951257120535e-03

Number of terms           10
Error                     3.0288585284310443e-07

Number of terms           15
Error                     8.1490370007486490e-13

Number of terms           20
Error                     4.4408920985006262e-16

Number of terms           40
Error                     4.4408920985006262e-16

Number of terms           60
Error                     4.4408920985006262e-16

Number of terms           80
Error                     4.4408920985006262e-16

Number of terms           100
Error                     4.4408920985006262e-16

The error first decreases but then, it stagnates and does not decrease anymore. Why? If we would be using symbolic calculus, the error would decrease as iterations increase. But we are using finite arithmetic, and the error is limited by the spacing between representable numbers, which around number $e$ is

In [52]:
Ea = np.spacing(np.exp(1.))
print(Ea)
4.440892098500626e-16

Let's compute the number of terms we need to get $e^1$ with the lowest possible error.

In [53]:
Ea = np.spacing(np.exp(1.))
n = 0
Sum = 1.
fact = 1.

itermax = 100
error   = np.abs(np.exp(1.) - Sum)

while (error > Ea and n < itermax):
    n += 1
    Sum += 1. / fact
    fact *= n+1
    error = np.abs(np.exp(1.) - Sum)
    
print("Number of iterations     %i" % n)
print("Error                    %.16e" % error)
print("Ea                       %.16e" % Ea)
Number of iterations     17
Error                    4.4408920985006262e-16
Ea                       4.4408920985006262e-16

So after $17$ iterations the truncation error is machine-neglegible.


Exercise 4

Taking into account that

$$\log (1+x) = \dfrac{x^1}{1}-\dfrac{x^2}{2}+\dfrac{x^3}{3}-\dfrac{x^4}{4}+\dfrac{x^5}{5}-\cdots \quad \mathrm{para} \quad |x|\lt 1$$

write a code to compute the number of terms we need to get $\log 1.5$ with the lowest possible error.

In [54]:
%run Exercise4
print("Number of iterations     %i" % n)
print("Error                    %.16e" % error)
print("Spacing around ln(1+x)   %.16e" % Ea)
Number of iterations     45
Error                    5.5511151231257827e-17
Spacing around ln(1+x)   5.5511151231257827e-17

Exercise 5

Write an script to compute the number of terms we need to get $f(x_0)$ for each function with the lowest possible error.

(a)

$$\sin x=\sum_{n=0}^{\infty}\frac{(-1)^n}{(2n+1)!}x^{2n+1}=x-\frac{x^3}{6}+\frac{x^5}{120}-\cdots\quad \quad x_0=\frac{\pi}{3}$$

(b)

$$\cos x=\sum_{n=0}^{\infty}\frac{(-1)^n}{(2n)!}x^{2n}=1-\frac{x^2}{2}+\frac{x^4}{24}-\cdots\quad \quad x_0=\frac{\pi}{4}$$

(c)

$$\arcsin x=\sum_{n=0}^{\infty}\frac{(2n)!}{4^n(n!)^2(2n+1)}x^{2n+1}=x+\frac{x^3}{6}+\frac{3x^5}{40}+\cdots\quad \quad x_0=0.7$$
In [55]:
%run Exercise5
-----------------------  a   -----------------------

Number of iterations            8
Error                         1.1102230246251565e-16
Spacing around f(x0)          1.1102230246251565e-16
f(x0)                         8.6602540378443860e-01

-----------------------  b   -----------------------

Number of iterations            8
Error                         1.1102230246251565e-16
Spacing around f(x0)          1.1102230246251565e-16
f(x0)                         7.0710678118654757e-01

-----------------------  c   -----------------------

Number of iterations            40
Error                         1.1102230246251565e-16
Spacing around f(x0)          1.1102230246251565e-16
f(x0)                         7.7539749661075297e-01

References