In the **decimal system** the number $105.8125$ means:

i.e., it is a linear combination of powers of $10$ multiplied by one of the $10$ digits in $0,1,\ldots,9$.

Computers use the **binary system** because it is the natural way an electronic device works: switched on or off. Therefore, only $0's$ and $1's$ need to be stored.

In the binary system numbers are represented as linear combinations of powers of $2$, multiplied by $0's$ and $1's$:

$$(105.8125)_{10}=2^6+2^5+2^3+2^0+2^{-1}+2^{-2}+2^{-4}=(1101001.1101)_2 \qquad (1)$$Conversion of the integer part is achieved by sequentially dividing by $2$. The remainings of these divisions are the digits in base $2$, from lesser to larger significancy.

$$ \begin{array}{lrrrrrrrr} Quotients & 105 & 52 & 26 & 13 & 6 & 3 & 1\\ Remainders & 1 & 0 & 0 & 1 & 0 & 1 & \swarrow & \end{array} $$Then, $(105)_{10}$ in the binary system is $(1101001)_2$.

**Remark.** Python's function `bin`

performs this conversion.

In [1]:

```
bin(105)
```

Out[1]:

In [2]:

```
%run Exercise1.py
x = 105.8125
binary1 = de2bi_a(x)
print ''.join(binary1) # if binary1 is stored as a strings' list
```

To convert the decimal part, we multiply by $2$, drop the integer part and repit till reaching $1$.

$$ \begin{array}{lrrrr} Decimal & 0.8125& 0.625 & 0.25 & 0.5 \\ Integer & 1 & 1 & 0 & 1 \end{array} $$Write a function, `de2bi_b.py`

, to convert the decimal part of a base 10 number into a binary number. Use it for `x = 105.8125`

. Limit the maximum number of fracional digits in binary to `23`

. Try also with `x = 14.1`

In [3]:

```
%run Exercise2.py
x = 105.8125
binary2 = de2bi_b(x)
print ''.join(binary2)
```

And number `105.8125`

(in decimal) is written, in binary

In [4]:

```
print ''.join(binary1 + ['.'] + binary2)
```

In [5]:

```
x = 14.1
binary3 = de2bi_a(x)
binary4 = de2bi_b(x)
print ''.join(binary3 + ['.'] + binary4)
```

Using $(1)$ for converting $(1101001.1101)_2$ to decimal base we get

In [6]:

```
1*(2**6)+1*(2**5)+0*(2**4)+1*(2**3)+0*(2**2)+0*(2**1)+1*(2**0)+1*(2**-1)+1*(2**-2)+0*(2**-3)+1*(2**-4)
```

Out[6]:

If we have $m$ digits or memory bits then we may store $2^m$ different binary numbers.

If we only consider the positive integers then we may represent numbers between

$$ (00\ldots 00)_2=(0)_{10} \quad \mathrm{and} \quad (11\ldots 11)_2=(2^m-1)_{10}. $$

For instance, for $m=3$ bits we may represent positive integers from $0$ to $7$:

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \mathtt{4} & \mathtt{100}\\ \mathtt{5} & \mathtt{101}\\ \mathtt{6} & \mathtt{110}\\ \mathtt{7} & \mathtt{111}\\ \hline \end{array} $$In *signed integers*, the first bit is used to store the sign: $0$ for positive and $1$ for negative. The other $m-1$ digits are used for the unsigned number and, therefore, we may write $2^{m-1}-1$ positive numbers,
the same amount of negative numbers and two zeros, one with positive sign and another with negative sign. Therefore, the range of representation is
$[-2^{m-1}+1, 2^{m-1}-1]$. For instance, if $m=3$ we may represent numbers from $-3$ to $3$:

**Example**

Compute how $(80)_{10}$ is stored when using a signed 8-bit binary representation.

*Solution*: The representation is similar than that of Exercise 4, but with the first digit equal to $1$, due to the negative sign. So we have
$(-80)_{10}=(11010000)_2$.

To avoid the double representation of zero we define negative integers by taking the digits of the corresponding positive integers and changing $0's$ to $1's$, and $1's$ to $0's$, and adding $1$ to the result. In this way, the sum of a number and its oposite is always $0$.

$$ \begin{array}{|c|c|} \hline Base \; 10 & Base \; 2\\ \hline \mathtt{-4} & \mathtt{100}\\ \mathtt{-3} & \mathtt{101}\\ \mathtt{-2} & \mathtt{110}\\ \mathtt{-1} & \mathtt{111}\\ \mathtt{0} & \mathtt{000}\\ \mathtt{1} & \mathtt{001}\\ \mathtt{2} & \mathtt{010}\\ \mathtt{3} & \mathtt{011}\\ \hline \end{array} $$**Example.** To represent $(-2)_{10}$ we start writing $(2)_{10}$ in binary form, $(010)_2$. Then, we

- invert its digits $\rightarrow (101)_2$
- add $001$ (take into account that $0+0=0, \quad 0+1=1, \quad 1+1=10$) $\rightarrow (110)_2$

So the property $(010)_2+(110)_2=(000)_2$ holds.

In this case, the first bit of the negative is $0$ and that of the positive is $1$. We may represent integers in the range $[-2^{m-1},2^{m-1}-1]$.

**Example.** Compute how $(80)_{10}$ is stored when using a two's complement 8-bit binary representation.
*Solution*: since $(80)_{10}=(01010000)_2$

- we interchange $0's$ and $1's$ $\rightarrow 10101111, $
- and adding $1$ we obtain the representation $(-80)_{10}=(1011000)_2$

Negative numbers are represented as consecutive positive values, starting from the lowest negative integer. Positive numbers are the rest. The representation is obtained by adding the *bias* $2^{m-1}$ to the number $x$, i.e., setting
$x_r=x+2^{m-1}\in[0,2^m-1]$.

We see that the representable rage is again $[-2^{m-1},2^{m-1}-1]$.

Which form is actually used for integer storage? Most machines use two's complement for integer numbers and biased representation for the exponents of floating point numbers (which are integers).

Why? The main reason to use biased representation is its efficiency for comparing numbers. Codes compare floating point numbers very often, and this is done by first comparing their exponents. Only if they are equal, comparison of their mantissas is performed.

In Python, numerical variables may be `int`

, `float`

, `long`

and `complex`

. Boolean variables are a subtype of integers.

*Single integers* or `int`

have at least 32 bits precission, normally 32 or 64 bits, depending on the computer and operative system we are using.

The maximum single integer value in our computer is given by

In [7]:

```
import sys
sys.maxint
```

Out[7]:

That is, if $m=64$ is the number of bits, the largest integer is given by $2^{m-1}-1.$

In [8]:

```
2**(64-1)-1
```

Out[8]:

And for negative values, the minimum integer is $-2^{m-1}$

In [9]:

```
-sys.maxint-1
```

Out[9]:

In [10]:

```
-2**(64-1)
```

Out[10]:

Thus, in our case the single integer has a 64 bits precission.

If we need to represent a larger integer, Python uses `long`

integers. Their precission is only limited by the memory available. They are recognized by its last character: `L`

.

In [11]:

```
sys.maxint+1
```

Out[11]:

In [12]:

```
-sys.maxint-2
```

Out[12]:

For real numbers, **floating point** representation of **base** $\beta=2$ is used:

where:

- $s$ is the
**sign**: $1$ for negative numbers and $0$ for positive. - $m$ is the
**mantissa**which when*normalized*has values in $1\leq m < \beta$ and a nonzero first digit. In binary base this digit only can be $1$, i.e. in a normalized mantissa we always have $a_1=1$, so it is unnecesary to store it (*hidden bit method*). - $e$ is the
**exponent**, a signed integer (biased binary representation).

Numbers are stored either in words of 32 bits (single precission), 64 bits (double precission), or 128 bits (extended precission). In most computers, for Pyhthon, the default `float`

precission is double precission. In this case bits are used as follows:

- 1 bit for the sign.
- 11 bits for the exponent,
- 52 bits for the mantissa.

In [13]:

```
%matplotlib inline
%run Double_precision.py
```

With $11$ bits for the signed exponent, we have room for $2^{11}=2048$ binary numbers, $ 0 < E < 2047 $. The
first number, $00000000000$ is reserved for zeros and non-normalized numbers; and the last, $11111111111$, for `Inf`

and `NaN`

.

Thus, the exponent take values $ 1 < E < 2046 $, and since the bias is $1023$, these values correspond to the exponents $ -1022 < E - 1023 < 1023 $. Therefore, the maximum exponent is $E_{max}=1023$, and the minimum is $E_{min}=-1022$. Dividing by the minimum value, we get

$$\frac{1}{x_{min}}=\frac{1}{m\beta^{E_{min}}}=\frac{1}{m\beta^{-1022}}=\frac{1}{m}\beta^{1022}<\frac{1}{m}\beta^{1023}$$so the maximum value is not reached, i.e. there is not overflow.

Write a script `Exercise3.py`

that calculates the single binary representation of `105.8125`

, `120.875`

, `7.1`

and `-1.41`

, following the standard IEEE 754. Use the scrips written in exercises 1 and 2. Consider only the case where the integer part is greater than zero (the number, in absolute value, is greater than $1$). Round the numbers using truncation.

**Note:** Remember that single precision uses `32`

bits: `1`

for the sign , `8`

for the exponent and `23`

bits for the mantissa. The exponent bias is `127`

.

In [14]:

```
%run Exercise3.py
```

The features of ours machine `float`

type are obtained using the command sys.float_info

In [15]:

```
sys.float_info
```

Out[15]:

The number of digits in the mantissa, `mant_dig`

, is 53 (52 bits but 53 digits because the hidden bit). That is, Python's `float`

uses 64 bits and is double precision.

But

`max_exp=1024`

and `min_exp=-1021`

does not seems to agree with what it was written above.

But, if we execute

`help(sys.float_info)`

we obtain

| max_exp | DBL_MAX_EXP -- maximum int e such that radix**(e-1) is representable

| min_exp | DBL_MIN_EXP -- minimum int e such that radix**(e-1) is a normalized float

The largest double precission normalized number that Matlab may store in binary representation is

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{1023}$$Since we do not have to keep the first $1$, there are 52 bits left to store the $1's$ of $0.11\ldots 11$. Therefore, in base $10$ this number is

$$(1+1\cdot 2^{-1}+1\cdot 2^{-2}+1\cdot 2^{-3}+\cdots+1\cdot 2^{-52})\cdot 2^{1023}$$Write a code, `Exerxise4.py`

, to compute this sum. Perform the sum from lowest to largest terms (we shall see why later on). Its value should coincide with that obtained with `sys.float_info.max`

. Define the variable `output`

with the output value.

In [16]:

```
import sys
%run Exercise4a.py
print '%.16e' % output # exponential format with 16 decimal digits
```

In [17]:

```
sys.float_info.max
```

Out[17]:

We check that they are equal

In [18]:

```
print output == sys.float_info.max
```

(b) Using a similar procedure than in Exercise4a, write a code `Exercise4b.py`

to compute the lowest representable normalized floating point number using the expression

Its value should coincide with that obtained from `sys.float_info.min`

.

In [19]:

```
%run Exercise4b.py
print '%.16e' % output
```

In [20]:

```
sys.float_info.min
```

Out[20]:

**The largest consecutive integer which can be stored exactly** in binary form with floating point representation is

Therefore, we have $53$ precission digits in binary form. The next integer,

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{53}$$can also be stored exactly. But not the next one, since we would need an extra bit in the mantissa.

In [21]:

```
print(2**53)
```

Hence, all 15-digits integers and most of 16-digits integers may be stored exactly.

Observe that, actually, the number

$$(+1)\cdot (1.11\ldots 11) \cdot 2^{62}$$is integer, storable and larger than the largest integer which can be stored!! However, the ten last digits of this number are necessarily zeros. Therefore, from the point of view of precission they are negligible, since they can not be changed.

**Denormalized numbers**

What happens to numbers with exponents out of the range $[-1022,1023]$?

If the exponent is lower than $-1022$ then the number is non-normalized or zero, and we are using for the bits corresponding to the exponent the special value $00000000000$. Then, the hidden bit is now $0$, instead of $1$.

In [22]:

```
x = 0.5*(2**-1023)
print(x) # denormalized number
print(1/x)
```

In [23]:

```
print(2**-1080) # Underflow
```

The lowest non-normalized number is

$$(+1)(0.000\ldots 01)\times 2^{-1022}$$that is, $2^{-1022-52}$

In [24]:

```
num_min = 2.**(-1022-52)
print(num_min)
```

For any other smaller value, we get zero (underflow):

In [25]:

```
2.**(-1022-53) # Underflow
```

Out[25]:

If the exponent is larger than $1023$ we get `OverflowError`

.

In [26]:

```
2.**1024 # Overflow
```

Let us compute the lowest number we may add to $1$ using double precission floating point binary representation.

The number $1$, in normalized double precission floating point representation, is

$$(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$with $52$ zeros in the mantissa. The lowest number we may add in floating point non-normalized representation is

$$\epsilon = (+1)\cdot (0.00\ldots 01) \cdot 2^{0}$$which in base $10$ is

In [27]:

```
1*2.**(-52)
```

Out[27]:

which is obtained from Python by

In [28]:

```
sys.float_info.epsilon
```

Out[28]:

This value, the lowest number $\epsilon$ such that $1+\epsilon>1$ is called the *machine precission*, which gives the floating point representation precission.

Since, in double precission, $\epsilon \approx 2.22\cdot 10^{-16}$ we have that it corresponds to, approximately, $16$ decimal digits.

Between

$$1=(+1)\cdot (1.00\ldots 00) \cdot 2^{0}$$and

$$1+\epsilon=(+1)\cdot (1.00\ldots 01) \cdot 2^{0}$$we can not represent exactly (and in floating point) any other real number.

To compute the lowest number comparable to $x$ we must add the number `eps`

In [29]:

```
x = 10.
eps = np.spacing(x)
print(eps)
```

In [30]:

```
x = 100.
eps = np.spacing(x)
print(eps)
```

In [31]:

```
x = 1000.
eps = np.spacing(x)
print(eps)
```

`eps`

increases with the absolute value of $x$. This means that the difference between two consecutive numbers exactly representable in floating point representation increases as we depart from zero. Therefore, the *density* of exact floating point numbers is not uniform. It is larger close to zero and smaller far from it. Remind that floating point numbers are actually rational numbers.

Some operations lead to special results

In [32]:

```
x = 1.e200
y = x*x
print(y)
```

In [33]:

```
y/y
```

Out[33]:

`nan`

means "not a number". In general, it arises when a non valid operation has been performed.

**Overflow**. It happens when the result of the operation is finite but not representable (larger than the largest representable number).

In [34]:

```
x = 1.e300
x**2
```

Often, there is no exact floating point representation of a real number $x$, which lies between two consecutive representable numbers $x^-

- Round toward 0 ("truncation").
- Round toward $+ \infty$.
- Round toward $- \infty$.
- Round toward the nearest number which is furthest from zero.
- Round toward the nearest even number ("rounding").

The last method is the most usual.

Suppose that we are using a three digits decimal representation and we want to sum the numbers $a=123$ and $b=1.25$. The exact result is $s=124.25$ but, since only three digits are available, we shall round to, for instance, $s_r=124$. An error arises. In general, finite arithmetic operations involve errors.

**Example.** The operation $3\times 0.1 \times 100$ returns:

In [35]:

```
3*0.1*100
```

Out[35]:

The result is not the right one ($30$) due to rounding error. These errors may propagate and, in some cases, affect substantially the result. In the previous case the error is due to, as we already saw, the periodicity of $x=0.1$ in the binary base. Thus, its representation is not exact in the computer.

**Example.** Computing

gives

In [36]:

```
Sum = 0.
for i in range(1,10001):
Sum = Sum + 0.1
print '%.16f'% Sum # Decimal format with 16 decimals
```

with absolute error given by

In [37]:

```
abs(Sum-1000.)
```

Out[37]:

**Example**

If we add or substract numbers which are very different in magnitude we always lose accuracy due to rounding error. For instance,

In [38]:

```
a = 1.e+9
epsilon = 1.e-8
Sum = a
for i in range(1,10001):
Sum = Sum + epsilon
print '%.16e' % Sum # Exponencial format with 16 decimals
```

However, the result should have been

In [39]:

```
Sum = a + 10000*epsilon
print '%.16e'% Sum
```

**Example**

Let us sum the $N$-first term of the harmonic series

$$\sum_{n=1}^N \frac{1}{n}.$$We use single precission (function `float32`

) to make the effect more evident. The exact result for the first $1000$-terms is

In [40]:

```
import numpy as np
from scipy import special
N = 10000
Sum = special.polygamma(0,N+1)-special.polygamma(0,1)
print '%.20f'% Sum
```

Starting the sum from the term $n=1$, we get

In [41]:

```
Sum1 = 0.;
for n in range(1,N+1):
Sum1 += 1./n
print '%.20f'% Sum1
```

In [42]:

```
error1 = abs(Sum1 - Sum)
print error1
```

Starting from the last term, we get

In [43]:

```
Sum2 = 0.;
for n in range(N,0,-1):
Sum2 += 1./n
print '%.20f'% Sum2
```

Thus, the absolute difference is

In [44]:

```
error2 = abs(Sum2 - Sum)
print error2
```

It arises in the substraction of large numbers of similar value. For instance,

$$\sqrt{x^2+\epsilon^2}-x$$In [45]:

```
import numpy as np
x = 1000000.
ep = 0.0001
a = np.sqrt(x**2 + ep) - x
print a
```

In this case, cancellation error may be avoided by using an equivalent mathematical expression

$$\sqrt{x^2+\epsilon^2}-x=\frac{(\sqrt{x^2+\epsilon^2}-x)(\sqrt{x^2+\epsilon^2}+x)}{\sqrt{x^2+\epsilon^2}+x}=\frac{\epsilon^2}{\sqrt{x^2+\epsilon^2}+x}\approx \frac{\epsilon^2}{\sqrt{x^2}+x}=\frac{\epsilon^2}{2x} $$In [46]:

```
b = ep / (np.sqrt(x**2 + ep) + x)
print b
```

The relative error with the first expression is

In [47]:

```
Er = abs((a - b)/b)
print Er
```

In [48]:

```
a = 1.; b = 10.**8; c = 1.
x1 = (-b + np.sqrt(b**2 - 4*a*c)) / (2*a)
print x1
```

But after substitution back into the polynomial we get that the residual

In [49]:

```
residual1 = a*x1**2 + b*x1 + c
print residual1
```

`Exercise5a.py`

In [50]:

```
%run Exercise5a.py
print x2
print residual2
```

(b)

When solving

with the formula

$$x_2=\frac{-b-\sqrt{b^2-4ac}}{2a}$$we obtain

In [51]:

```
a = 1.; b = -10.**8; c = 1
x3 = (- b - np.sqrt(b**2 - 4*a*c)) / (2*a)
print x3
```

but is we substitute this value in the equation

In [52]:

```
residual3 = a*x3**2 + b*x3 + c
print residual3
```

that, again, it is an big relative error.

`Exercise5b.py`

.

In [53]:

```
%run Exercise5b.py
print x4
print residual4
```

**Example**

Given

$$e^x=\sum_{n=0}^{\infty}\frac{x^n}{n!},$$let's approximate the value of the number $e$, using $x = 1$, with a finite number of addition terms and see how the error evolves

In [1]:

```
import math
suma = 0.
x = 1.
for n in range(100):
suma += x**n / math.factorial(n)
if n in np.array([5,10,15,20,40,60,80,100])-1:
error = abs(np.exp(1.) - suma)
print "Number of terms %i" % (n+1)
print "Error %.16e" % error
print ""
```

In [54]:

```
Ea = np.spacing(np.exp(1.))
print Ea
```

Let's compute the number of terms we need to get $e^1$ with the lowest possible error.

In [55]:

```
import math
Ea = np.spacing(np.exp(1.))
n = 0
Sum = 1.
itermax = 100
error = np.abs(np.exp(1.) - Sum)
while (error > Ea and n < itermax):
n += 1
Sum += 1. / math.factorial(n)
error = np.abs(np.exp(1.) - Sum)
print "Number of iterations %i" % n
print "Error %.16e" % error
print "Ea %.16e" % Ea
```

So after $17$ iterations the truncation error is machine-neglegible.

In [2]:

```
%run Exercise6.py
print "Number of iterations %i" % n
print "Error %.16e" % error
print "Spacing around ln(1+x) %.16e" % Ea
```