Finite arithmetic and error analysis

Decimal and binary systems
Integer representation
Real numbers representation
Approximation and rounding error
Referencias

Decimal and binary systems

In the decimal system the number \(107.625\) means: \[107.625=1 \cdot 10^2+7\cdot 10^0+6\cdot10^{-1}+2\cdot10^{-2}+5\cdot10^{-3}\] i.e., it is a linear combination of powers of \(10\) multiplied by one of the \(10\) digits in \(0,1,\ldots,9\).

Computers use the binary system because it is the natural way an electronic device works: switched on or off. Therefore, only \(0's\) and \(1's\) need to be stored. In the binary system numbers are represented as linear combinations of powers of \(2\), multiplied by \(0's\) and \(1's\): \[(107.625)_{10}=2^6+2^5+2^3+2^1+2^0+2^{-1}+2^{-3}=(1101011.101)_2 \qquad (1)\]

Decimal to Binary conversion

Conversion of the integer part is achieved by sequentially dividing by \(2\). The rests of these divisions are the digits in base \(2\), from lesser to larger significancy.

\((107)_{10}\) in the binary system is \((1101011)_2\):

\[ \begin{array}{lrrrrrrrr} Quotients & 107 & 53 & 26 & 13 & 6 & 3 & 1 & 0\\ Remainders & & 1 & 1 & 0 & 1 & 0 & 1 & 1 \end{array} \]

Remark. Some versions of Matlab use function de2bi to perform these conversions.

Exercise 1. Make a function, de2bi_a.m, to convert the integer part of a base 10 number into a binary number with N digits. Apply it on \((107.625)_{10} \) and on \((0.1)_{10}\), with N=8.

Hint: Use Matlab functions

fix(x): rounds x to 0.
rem(x,n): remainder of x divided by n.
zeros(1,m): gives a vector of m zeros.

x=107.625;
N=8;
b=de2bi_a(x,N)

b =

     0     1     1     0     1     0     1     1

To convert the decimal part, we multiply by \(2\), drop the integer part and repit till reaching \(1\). Observe that this algorithm may lead to an infinite loop, for instance, for \( (0.1)_{10} \) .

\[ \begin{array}{lrrrr} Decimal & 0.625 & 0.25 & 0.5 & 0\\ Integer & & 1 & 0 & 1 \end{array} \]

Exercise 2 Make a function, de2bi_b.m, to convert the decimal part of a base 10 number into a binary number with N digits. Apply it on \((107.625)_{10} \) and on \((0.1)_{10}\), with N=5.

x=107.625;
N=5;
b=de2bi_b(x,N)

b =

     1     0     1     0     0

Exercise 3 Compute the binary representation of \((0.1)_{10}\) with 30 decimal digits.

exercise3

b =

  Columns 1 through 13

     0     0     0     1     1     0     0     1     1     0     0     1     1

  Columns 14 through 26

     0     0     1     1     0     0     1     1     0     0     1     1     0

  Columns 27 through 30

     0     1     1     0

Note: \((0.1)_{10}=(0.000110011001100110011001100110\ldots)_2\) is a periodic number in the binary base. Therefore, computers can not store it in an exact way.

Binary to Decimal conversion

Using \((1)\) for converting \((1101011.101)_2\) to decimal base we get

1*(2^6)+1*(2^5)+0*(2^4)+1*(2^3)+0*(2^2)+1*(2^1)+1*(2^0)+1*(2^-1)+0*(1^-2)+1*(2^-3)

ans =

     1.076250000000000e+02

Integer representation

If we have \(m\) digits or memory bits then we may store \(2^m\) different binary numbers.

Positive integers

If we only consider the positive integers then we may represent numbers between \( (00\ldots 00)_2=(0)_{10}\) and \( (11\ldots 11)_2=(2^m-1)_{10}\). For instance, for \(m=3\) bits we may represent positive integers from \(0\) to \(7\):

\[ \begin{array}{rr} Number_{10} & Number_{2}\\ 0 & 000\\ 1 & 001\\ 2 & 010\\ 3 & 011\\ 4 & 100\\ 5 & 101\\ 6 & 110\\ 7 & 111 \end{array} \]

Exercise 4. Write a code, exercise2_4.m to compute how \((80)_{10}\) is stored when using an unsigned 8-bit binary representation. Hint: Use your function debi.

Signed integers

In signed integers, the first bit is used to store the sign: \(0\) for positive and \(1\) for negative. The other \(m-1\) digits are used for the unsigned number and, therefore, we may write \(2^{m-1}-1\) positive numbers, the same amount of negative numbers and two zeros, one with positive sign and another with negative sign. Therefore, the range of representation is \([-2^{m-1}+1, 2^{m-1}-1]\). For instance, if \(m=3\) we may represent numbers from \(-3\) to \(3\):

\[ \begin{array}{rr} Number_{10} & Number_{2}\\ -3 & 111\\ -2 & 110\\ -1 & 101\\ -0 & 100\\ +0 & 000\\ +1 & 001\\ +2 & 010\\ +3 & 011 \end{array} \]

Example. Compute how \((80)_{10}\) is stored when using a signed 8-bit binary representation.

Solution: The representation is similar than that of Exercise 4, but with the first digit equal to \(1\), due to the negative sign. So we have \((-80)_{10}=(11010000)_2\).

Signed integers: Two's complement representation

To avoid the double representation of zero we define negative integers by taking the digits of the corresponding positive integers and changing \(0's\) to \(1's\), and \(1's\) to \(0's\), and adding \(1\) to the result. In this way, the sum of a number and its oposite is \(0\).

\[ \begin{array}{rr} Number_{10} & Number_{2}\\ -3 & 101\\ -2 & 110\\ -1 & 111\\ 0 & 000\\ 1 & 001\\ 2 & 010\\ 3 & 011 \end{array} \]

Example. To represent \((-2)_{10}\) we start writing \((2)_{10}\) in binary form, \((010)_2\). Then

invert its digits \(\rightarrow (101)_2\)
add \(001\) (take into account that \(0+0=0, \quad 0+1=1, \quad 1+1=10\)) \(\rightarrow (110)_2\)

So the property \((010)_2+(110)_2=(000)_2\) holds.

In this case, the first bit of the negative is \(0\) and that of the positive is \(1\). We may represent integers in the range \([-2^{m-1},2^{m-1}-1]\).

Example. Compute how \((80)_{10}\) is stored when using a two's complement 8-bit binary representation.

Solution: since \((80)_{10}=(01010000)_2\)

we interchange \(0's\) and \(1's\) \(\rightarrow 10101111 \)
and adding \(1\) we obtain the representation \((-80)_{10}=(1011000)_2\)

Signed integers: biased representation

Negative numbers are represented as consecutive positive values, starting from the lowest negative integer. Positive numbers are the rest. The representation is obtained by adding the bias \(2^{m-1}\) to the number \(x\), i.e., setting \(x_r=x+2^{m-1}\in[0,2^m-1]\).

\[ \begin{array}{rc} Number_{10} & Number_{2}\\ -4 & 000\\ -3 & 001\\ -2 & 010\\ -1 & 011\\ 0 & 100\\ 1 & 101\\ 2 & 110\\ 3 & 111 \end{array} \]

Example. Compute how \((80)_{10}\) is stored when using a biased 8-bit binary representation.

Solution: we have to add it the bias \(x_r=x+2^{m-1}\). Then \(x_r=80+2^{8-1}\)

x=80;
xr=80+2^7;    % Add the bias
b=de2bi(xr);  % Pass to binary
b=b(end:-1:1) % Invert (recall that Matlab gives the oposite ordering of digits)

b =

     1     1     0     1     0     0     0     0

Integer representation in Matlab

Matlab default representation mode is double precision floating point. To define an integer number we must use a function (Mathworks-integers). Matlab has \(4\) classes of both signed and unsigned integers, with the storing possibilities of \(1\), \(2\), \(3\) and \(4\) bytes.

Matlab uses the last two methods to store integers, depending on their use

if the integer number is the exponent of a floating point number, then biased representation is used,
in other case, following the IEEE standard, two's complement representation is used.

Remark. The main reason to use biased representation is its efficiency for comparing numbers. Matlab codes must compare floating point numbers very often, and this is done by first comparing their exponents. Only if they are equal, comparison of their mantissas is performed.

Exercise 5. Fill the table with the range of values for the following classes.

Clase	Rango
Signed-8 bits	\([-2^{7},2^{7}-1]=\left[-128,127\right]\)
Signed-16 bits	\(\left[-2^{15},2^{15}-1\right]=\cdots\)
Signed-32 bits
Signed-64 bits
Unsigned-8 bits	\(\left[0,2^{8}-1\right]=\cdots\)
Unsigned-16 bits
Unsigned-32 bits
Unsigned-64 bits

(Solution in Mathworks-integers)

Observe that when conversion from floating point to integer is performed for numbers out of the representation range, Matlab gives the corresponding limit value:

out1=int8(300); % maximum
out2=int8(-300); % minimum
fprintf('out1=%d, out2=%d',out1,out2)

out1=127, out2=-128

Real numbers representation

For real numbers, floating point representation of base \(\beta=2\) is used. \[x_r=(-1)^s \cdot m \cdot \beta^e=(-1)^s \cdot (a_1.a_2\ldots a_t) \cdot \beta^e\] where:

\(s\) is the sign. Stored with \(1\) for negative numbers and \(0\) for positive.
\(m\) is the mantissa which when normalized has values in \(1\leq m < \beta\) and a nonzero first digit. In binary base this digit only can be \(1\), i.e. in a normalized mantissa we always have \(a_1=1\), so it is unnecesary to store it (hidden bit method).
\(e\) is the exponent, a signed integer (biased binary representation).

Numbers are stored either in words of 32 bits (single precission), 64 bits (double precission), or 128 bits (extended precission). Matlab's default is double precission. In this case bits are used as follows:

1 bit for the sign.
52 bits for the mantissa.
11 bits for the exponent.

With \(11\) bits for the signed exponent we have values in the range \([-2^{10},2^{10}-1]=[-1024,1023]\). The larger exponent is \(E_{max}=1023\). But the smaller is taken as \(E_{min}=-1022\) to have well defined the cocient \(1/x_{min}\):

\[\frac{1}{x_{min}}=\frac{1}{m\beta^{E_{min}}}=\frac{1}{m\beta^{-1022}}=\frac{1}{m}\beta^{1022}<\frac{1}{m}\beta^{1023}\]

so the maximum value is not achieved, i.e. there is no overflow.

Exercise 6. The largest double precission normalized number that Matlab may store in binary representation is \[(+1)\cdot (1.11\ldots 11) \cdot 2^{1023}\] Since we do not have to keep the first \(1\), there are 52 bits left to store the \(1's\) of \(0.11\ldots 11\). Therefore, in base \(10\) this number is \[(1+1\cdot 2^{-1}+1\cdot 2^{-2}+1\cdot 2^{-3}+\cdots+1\cdot 2^{-52})\cdot 2^{1023}\] Write a code exercise2_6.m to compute this sum. Perform the sum from lowest to largest terms (we shall see why later on). Its value should coincide with Matlab constant realmax

format long
realmax

ans =

    1.797693134862316e+308

Exercise 7. The largest integer which can be stored in binary form with floating point representation is \[(+1)\cdot (1.11\ldots 11) \cdot 2^{52}\] Therefore, we have \(53\) precission digits in binary form. Write a program Exercise2_7.m to compute the above number in base \(10\).

Hence, all 15-digits integers and most of 16-digits integers may be stored.

Observe that, actually, the number \[(+1)\cdot (1.11\ldots 11) \cdot 2^{62}\] is integer, storable and larger than the largest integer which can be stored!! However, the ten last digits of this number are necessarily zeros. Therefore, from the point of view of precission they are negligible, since can not be changed.

Exercise 8. Using a similar procedure than in Exercise 7, write a code Exercise2_8.m to compute realmin from the expression \[(+1)\cdot (1.00\ldots 00) \cdot 2^{-1022}\]

realmin     % Normalized representation minimum floating point number

ans =

    2.225073858507201e-308

What happens to numbers with exponents out of the range \([-1022,1023]\)?

If the exponent is lower than \(-1022\) then the number is non-normalized or zero.

x=2^-1024 % non-normalized
1/x

x =

    5.562684646268003e-309


ans =

   Inf

2^-1080   %zero

ans =

     0

If the exponent is larger than \(1023\) then it is set to infinity

2^1024

ans =

   Inf

-2^1025

ans =

  -Inf

Machine precission

Let us compute the lowest number we may add to \(1\) using double precission floating point binary representation.

\(1\), in normalized double precission floating point representation, is \[(+1)\cdot (1.00\ldots 00) \cdot 2^{0}\] with \(52\) zeros in the mantissa. The lowest number we may add in floating point non-normalized representation is \[(+1)\cdot (0.00\ldots 01) \cdot 2^{0}\] which in base \(10\) is

1*2^(-52)

ans =

     2.220446049250313e-16

which coincides with

eps

ans =

     2.220446049250313e-16

This value, the lowest number \(\epsilon\) such that \(1+\epsilon>1\) is called the machine precission, which gives the floating point representation precission.

Since, in double precission, \(\epsilon \approx 2.22\cdot 10^{-16}\) we have that it corresponds to, approximately, \(16\) decimal digits.

Between \[1=(+1)\cdot (1.00\ldots 00) \cdot 2^{0}\] and \[1+\epsilon=(+1)\cdot (1.00\ldots 01) \cdot 2^{0}\] we can not represent exactly (and in floating point) any other real number.

To compute the lowest number comparable to \(x\) we use the funtion eps(x), which gives the positive distance from \({\rm abs}(x)\) to the next larger in magnitude floating point number of the same precision as \(x\).

x=10;
eps(x)

ans =

     1.776356839400250e-15

x=100;
eps(x)

ans =

     1.421085471520200e-14

x=1000;
eps(x)

ans =

     1.136868377216160e-13

We see that eps increases with the absolute value of \(x\). This means that the difference between two consecutive numbers exactly representable in floating point representation increases as we depart from zero. Therefore, the density of exact floating point numbers is not uniform. It is larger close to zero and smaller far from it. Remind that floating point numbers are actually rational numbers.

Special cases

Some operations lead to special results

3/0

ans =

   Inf

-3/0

ans =

  -Inf

0/0

ans =

   NaN

NaN means "Not a Number". In general, it arises when a non valid operation has been performed.

Overflow. It happens when the result of the operation is finite but not representable (larger than the largest representable number).

x=1e300
x^2

x =

    1.000000000000000e+300


ans =

   Inf

Underflow. It happens when the result of the operation is lower than the lowest normalized representable number.

1/x^2

ans =

     0

Approximation and rounding error

First, a game

Download and run, after clearing all the variables, the code error_rectangle. To get a hint of what is happening, download and run hint.

Rounding

Often, there is no exact floating point representation of a real number \(x\), which lies between two consecutive representable numbers \(x^-<x<x^+\). Then, the representation of \(x\) is chosen according to the rounding method. IEEE admits four rounding methods:

To the largest.
To the lowest.
To zero ("truncation").
To the closest.

Matlab uses the last one.

Loss of digits

Suppose that we are using a three digits decimal representation and we want to sum the numbers \(a=123\) and \(b=1.25\). The exact result is \(s=124.25\) but, since only three digits are available, we shall round to, for instance, \(s_r=124\). An error arises. In general, arithmetic operations involve errors.

Example. The operation \(3\times 0.1 \times 100\) returns:

3*0.1*100

ans =

  30.000000000000004

The result is not the right one (\(30\)) due to rounding error. These errors may propagate and, in some cases, affect substantially the result. In the previous case the error is due to, as we already saw, the periodicity of \(x=0.1\) in the binary base. Thus, its representation is not exact in the computer.

Example. Computing \(\sum_{k=1}^{10000} 0.1\) gives

s=0;
for i=1:10000
    s=s+0.1;
end
disp(s)

     1.000000000000159e+03

Absolute and relative errors are

AE=abs(s-1000);
RE=abs(s-1000)/1000;
[AE,RE]

ans =

   1.0e-09 *

   0.158820512297098   0.000158820512297

Example. If we add or substract numbers which are very different in magnitude we always lose accuracy due to rounding error. For instance,

a=1e+9;
epsilon=1e-8;
s=a;
for n=1:10000
    s=s+epsilon;
end
disp(s)

     1.000000000000000e+09

However, the result should have been

s=a+10000*epsilon

s =

     1.000000000000100e+09

Example. Let us sum the \(N\)-first term of the harmonic series \(\sum_{n=1}^N \frac{1}{n}\). We use single precission (function single) to make the effect more extreme. The exact result for the first \(1000\)-terms is

N=1000;
s=single((psi(N+1)-psi(1)))

s =

   7.4854708

Starting the sum from the the first term, \(n=1\)

s1=single(0.0);
for n=1:N
    s1=s1+single(1)/single(n);
end
error1=abs(s1-s);
fprintf('s1 = %10.8f  error1 = %10.8f',s1,error1)

s1 = 7.48547840  error1 = 0.00000763

Starting form the last term

s2=single(0.0);
for n=N:-1:1
    s2=s2+single(1)/single(n);
end
error2=abs(s2-s);
fprintf('s2 = %10.8f  error2 = %10.8f',s2,error2)

s2 = 7.48547173  error2 = 0.00000095

The error is largest in the first sum because the addition starts with the largest term and continues adding subsequent lower terms. Thus we lose accuracy because the different magnitudes between the accumulated sum and the new added terms.

Cancellation error

It arises in the substraction of large numbers of similar value. For instance, \[\sqrt{x^2+\epsilon^2}-x\]

x=1000000;ep=0.0001;
a=sqrt(x^2+ep)-x

a =

     1.164153218269348e-10

In this case, cancellation error may be avoided by using an equivalent mathematical expression \[\sqrt{x^2+\epsilon^2}-x=\frac{(\sqrt{x^2+\epsilon^2}-x)(\sqrt{x^2+\epsilon^2}+x)}{\sqrt{x^2+\epsilon^2}+x}=\frac{\epsilon}{\sqrt{x^2+\epsilon^2}+x}\]

b=ep/(sqrt(x^2+ep)+x)

b =

     5.000000000000000e-11

The relative error with the first expression is

Er=abs((a-b)/b)

Er =

   1.328306436538696

This example shows that equivalent mathematical expressions are not necessarily equivalent computational expressions.

Exercise 9. When solving the second order equation \[x^2+10^{8}x+1=0\] using the usual formula \[x_1=\frac{-b+\sqrt{b^2-4ac}}{2a}\]

we obtain

a=1;b=10^8;c=1;
x1=(-b+sqrt(b^2-4*a*c))/(2*a)

x1 =

    -7.450580596923828e-09

But after substitution back into the polynomial we get

pol=a*x1^2+b*x1+c

pol =

   0.254941940307617

which is a large absolute error. Find an equivalent mathematical expression to improve the result and write a code for it (exercise2_9.m).

Exercise 10. In the context of Exercise 9, the other root computed according to the formula \[x_2=\frac{-b-\sqrt{b^2-4ac}}{2a}\]

gives

a=1;b=-10^8;c=1;
x2=(-b-sqrt(b^2-4*a*c))/(2*a)
pol=a*x2^2+b*x2+c

x2 =

     7.450580596923828e-09


pol =

   0.254941940307617

Find again an equivalent mathematical expression to improve the result and write a code for it (exercise2_10.m).

Example. Taking into account that \[e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!}\] compute the number of terms we need to get \(e^1\) with the lowest possible error.

The lowest number comparable to \(e^1\) is

Ea=eps(exp(1))

Ea =

     4.440892098500626e-16

And the number of iterations we need is

n=0;
s=1;
%
itermax=100;
error=abs(exp(1)-s);
%
while( error > Ea && n < itermax )
    n=n+1;
    s=s+1/factorial(n);
    error=abs(exp(1)-s);
end
fprintf('n = %d   error = %10.8f',n,error)

n = 17   error = 0.00000000

So after \(17\) iterations the truncation error is machine-neglegible.

Exercise 11. Taking into account that \[\cos x=\sum_{n=0}^{\infty}\frac{(-1)^{n}x^{2n}}{(2n)!}\] write a code (exercise2_11.m) to compute the number of terms we need to get \(\cos 30^o\) with the lowest possible error (remind that arguments of trigonometric functions must be given in radians).

Finite arithmetic and error analysis

Contents

Decimal and binary systems

Integer representation

Real numbers representation

Approximation and rounding error

Referencias