*Machine Learning* applies automatic data-driven learning methods to obtain accurate predictions from observations with previous data.

Data automated classification is one of the aims of machine learning. We can consider three types of classification algorithms:

**Supervised classification:**We have a dataset (i.e., handwritten digit images) that we will call*training data*where each datum is associated with a*label*(that tell us which digit, 0, 1,..,9, corresponds to that image). In the training stage, we build a model with this dataset (training dataset) using the labels, that helps us to assess the correct or incorrect classification of an image while building the model. Once we have the model, we can use it to classify new data. At this stage we do not need the labels, unless we want to assess the accuracy of the classification.**Unsupervised classification:**The dataset comes without labels (or we will not use them) and the data is classified using their inner structure (properties, characteristics).**Semi-supervised classification:**we can apply it when some data comes with labels, but not all of them. This is typical when our data consist of images: we have access to many images but they are mostly untagged. These algorithms can be considered a variant of the supervised classification with a strategy to overcome the lack of labels for part of the data.

In this lab we will see some application examples of the unsupervised classification algorithm *k-means* for image classification and processing.

*K-means* is an unsupervised classification algorithm, also called clusterization, that groups objects into *k* groups based on their characteristics. The grouping is done minimizing the sum of the distances between each object and the group or cluster centroid. The distance usually used is the quadratic or euclidean distance.

The algorithm has three steps:

**Initialization:**once the number of groups,*k*has been chosen,*k*centroids are established in the data space, for instance, choosing them randomly.**Assignment of objects to the centroids:**each object of the data is assigned to its nearest centroid.**Centroids update:**The position of the centroid of each group is updated taking as the new centroid the average position of the objects belonging to said group.

Repeat steps 2 and 3 until the centroids do not move, or move below a threshold distance in each step.

The *k-means* algorithm solves an **optimization problem** and the function to be optimized (minimized) is the sum of the quadratic distances from each object to its cluster centroid.

The objects are represented with $d$ dimension vectors $\left(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}\right)$
and the algorithm *k-means* builds $k$ groups where the sum of the distances of the objects to its centroid is minimized within each group $\mathbf{S}=\left\{ S_{1},S_{2},\ldots,S_{k}\right\}$. The problem can be formulated:

$$
\underset{\mathbf{S}}{\mathrm{min}}\; E\left(\boldsymbol{\mu_{i}}\right)=\underset{\mathbf{S}}{\mathrm{min}}\sum_{i=1}^{k}\sum_{\mathbf{x}_{j}\in S_i}\left\Vert \mathbf{x}_{j}-\boldsymbol{\mu}_{i}\right\Vert ^{2}
\quad (1)$$

where $\mathbf{S}$ is the dataset whose elements are the objects $\mathbf{x}_{j}$ represented by vectors, where each of its elements represents a characteristic or attribute. We will have $k$ groups or clusters with their corresponding centroid $\boldsymbol{\mu_{i}}$.

In each centroid update, from the mathematical point of view, we impose the extreme (minimum, in this case) necessary condition to the function $E\left(\boldsymbol{\mu_{i}}\right)$ that, for this quadratic function $(1)$ is

$$
\frac{\partial E}{\partial\boldsymbol{\mu}_{i}}=0\;\Longrightarrow\;\boldsymbol{\mu}_{i}^{(t+1)}=\frac{1}{\left|S_{i}^{(t)}\right|}\sum_{\mathbf{x}_{j}\in S_{i}^{(t)}}\mathbf{x}_{j}
$$

and the solution is to take each group element average as a new centroid. We have used the gradient descent method.

The main advantages of the *k-means* method are that it is simple and fast. But it is necessary to decide the value of $k$ and the final result depends on the initialization of the centroids. Also, it does not necessarily converge to the global minimum but to a local minimum.

Write a script that uses the *k-means* method to classify two dimensional data. Classify the data generated below in three groups.

We generate 2D random data for three clusters with the function

```
def generate_data():
np.random.seed(7)
x1 = np.random.standard_normal((100,2))*0.6+np.ones((100,2))
x2 = np.random.standard_normal((100,2))*0.5-np.ones((100,2))
x3 = np.random.standard_normal((100,2))*0.4-2*np.ones((100,2))+5
X = np.concatenate((x1,x2,x3),axis=0)
return X
```

We generate the $k=3$ initial centroids in $[0,1]\times [0,1]$ with the function

```
def generate_centroids(X,k):
cx = np.random.rand(k)
cy = np.random.rand(k)
centroids = np.zeros((k,2))
centroids[:,0] = cx
centroids[:,1] = cy
return centroids
```

If we want to scale a value $x$ from $[0,1]$ to $[a,b]$

$$x_s = a + (b-a)\,x$$**(a)** Modify the previous function in such a way that the ** cx** values are between the maximum and minimum value of the first coordinate in

`X`

`cy`

`X`

**Note** ** np.min(v)** and

`np.max(v)`

`v`

**(b)** Create a function ** assign_centroid(x,centroids)** that given a point of the data (a row of

`X`

`l`

**(c)** Create a function ** reallocate_centroids(X,labels,centroids)** that recalculates the centroids as average values of the

`X`

**(d)** Create a funcion ** kmeans(k)** that, calling the adequate functions:

- Generates the data.
- Generates the centroids.
- Initializes the one-dimensional numpy array
with the length the number of points in`labels`

with 9.`X`

- Plots the data with the initial centroids.
- Assigns centroids to the points and plot the clusters.
- Reallocates the centroids and plot them with the clusters.

Repeat steps 5 and 6 eight times

Plot the data and the centroids using the following function.

```
def plot_data(X,labels,centroids,s):
plt.figure()
plt.plot(X[labels==9,0],X[labels==9,1],'k.')
plt.plot(X[labels==0,0],X[labels==0,1],'r.', label='cluster 1')
plt.plot(X[labels==1,0],X[labels==1,1],'b.', label='cluster 2')
plt.plot(X[labels==2,0],X[labels==2,1],'g.', label='cluster 3')
plt.plot(centroids[:,0],centroids[:,1],'mo',markersize=8, label='centroids')
plt.legend()
plt.title(s)
plt.show()
```

In [1]:

```
%run Exercise1.py
```

In [2]:

```
import numpy as np
import matplotlib.pyplot as plt
```

We will use `KMeans`

from `sklearn`

library in the other exercises.

In [3]:

```
from sklearn.cluster import KMeans
```

We group the points with *k-means* and $k=3$

In [28]:

```
n = 3
X = generate_data()
k_means = KMeans(n_clusters=n)
model = k_means.fit(X)
```

The result is three ** centroids** around which the points are grouped and each point

`label`

In [5]:

```
centroids = k_means.cluster_centers_
labels= k_means.labels_
```

Now we draw the points and the centroids, using a different color for the points of each cluster.

In [6]:

```
plt.figure()
plt.plot(X[labels==0,0],X[labels==0,1],'r.', label='cluster 1')
plt.plot(X[labels==1,0],X[labels==1,1],'b.', label='cluster 2')
plt.plot(X[labels==2,0],X[labels==2,1],'g.', label='cluster 3')
plt.plot(centroids[:,0],centroids[:,1],'mo',markersize=8, label='centroids')
plt.legend(loc='best')
plt.show()
```

Let us classify digits of the database contained in `sklearn`

library of python using the *k-means* algorithm.

We import the usual libraries

In [7]:

```
import numpy as np
import matplotlib.pyplot as plt
```

and the library that contains the function *k-means*

In [8]:

```
from sklearn.cluster import KMeans
```

We import the library that contains our dataset

In [9]:

```
from sklearn.datasets import load_digits
```

We load the digit images

In [10]:

```
digits = load_digits()
data = digits.data
```

In [11]:

```
print(data.shape)
```

Pixels from the square image of $8\times 8$ pixels have been reshped in a row of $ 64 $ elements. Therefore, each row is an object or data. The characteristics or properties of each object are the gray intensities of each pixel. That is, we have, for each image, $64$ properties.

To improve the visualization, we invert the colors

In [12]:

```
data = 255-data
```

We fix the seed to obtain the initial centroids, so the results obtained here are repeatable.

In [13]:

```
np.random.seed(1)
```

Since we have 10 different digits (from 0 to 9) we choose to group the images in $10$ clusters

In [14]:

```
n = 10
```

We classify the data with *k-means*

In [15]:

```
kmeans = KMeans(n_clusters=n,init='random')
kmeans.fit(data)
Z = kmeans.predict(data)
```

We plot the resulting clusters

In [16]:

```
for i in range(0,n):
row = np.where(Z==i)[0] # row in Z for elements of cluster i
num = row.shape[0] # number of elements for each cluster
r = int(np.floor(num/10.)) # number of rows in the figure of the cluster
print("cluster "+str(i))
print(str(num)+" elements")
plt.figure(figsize=(10,10))
for k in range(0, num):
plt.subplot(r+1, 10, k+1)
image = data[row[k], ]
image = image.reshape(8, 8)
plt.imshow(image, cmap='gray')
plt.axis('off')
plt.show()
```