Course Webpage

Image classification with ELM method

Contents

Introduction

The aim of this lab is pattern recognition, in our case of hand-written digits and characters.

To do this, we use two sets of given data: training data and testing data. Both sets contain images of digits and the right answer to what digit is written in each image (labels).

First, we construct a model from the training data, using a method of machine learning. Afterwords, we check the quality of this model applying it to the testing data.

We shall proceed by steps:

  1. We solve a minimization problem, inspired in least squares, to choose the best linear representation for our training data.
  2. With the model obtained in Step 1, we produce a prediction for the testing data set.
  3. Finally, we give an interpretation of these predictions (images) replacing them by the corresponding numbers (labels) that the model assigns to each image.

We shall use the Extreme Learning Machine (ELM) method for image classification. This method minimizes two quantities:

  • the difference between the prediction and the correct answer, for the training data, and
  • the norm of the solution.

Thus, we set the following problem for the training data:

$$ \underset{\mathbf{\beta}}{Min:}\quad L_{ELM}=\frac{1}{2}\Vert\mathbf{\boldsymbol{\beta}}\Vert^{2}+\frac{C}{2}\sum_{i=1}^{N}\Vert\mathbf{\xi}_{i}\Vert^{2} =\frac{1}{2}\Vert\mathbf{\boldsymbol{\beta}}\Vert^{2}+\frac{C}{2}\Vert\mathbf{Y}-\mathbf{H}\boldsymbol{\beta}\Vert^{2}. $$

Here, $\Vert\cdot\Vert$ denotes the Frobenius norm of a matrix, $N$ is the number of training samples, $C$ is a regularizing parameter, and

  • $\mathbf{\boldsymbol{\beta}}$ is the matrix of coefficients,
  • $\mathbf{\xi}_{i}$ are the training errors for each sample,
  • $\mathbf{H}$ is the matrix containing the training data,
  • $\mathbf{H}\boldsymbol{\beta}$ is the prediction of the model,
  • $\mathbf{Y}$ is the matrix of labels for the training data.

Classification with a linear kernel

Since the minimization problem is quadratic, its solution is given by $$ \overset{\star}{\mathbf{\beta}}=\mathbf{H}^{T}\mathbf{W} $$ where $\mathbf{W}$, the weights, are defined as $$ \mathbf{W}=\left(\frac{\mathbf{I}}{C}+\mathbf{\Omega}\right)^{-1}\mathbf{Y} $$ being $\mathbf{\Omega}=\mathbf{H}\mathbf{H^{T}}$ the linear kernel matrix. In an expanded form, the elements of $\mathbf{\Omega}$ are written as $$ \mathbf{\Omega}_{ij}=\mathbf{x}_i \centerdot \mathbf{x}_j = \sum_{k=1}^{n} x_{ik} x_{jk}, $$ where $\mathbf{x}_i$ denotes the $i-$row of $\mathbf{H}$ (training data). We shall see later other (nonlinear) choices for the kernel matrix.

The value for the testing samples is then given by $$ {\mathbf{Y}_{te}}=\mathbf{H}_{te} \overset{\star}{\mathbf{\beta}} = \mathbf{H}_{te}\mathbf{H}^{T}\mathbf{W}={\mathbf{\Omega}_{te}}\mathbf{W}. $$

with

$$ {\mathbf{\Omega}_{te}}= \mathbf{H}_{te}\mathbf{H}^{T}. $$

In our application, the samples are images of size $28\times28$ of a hand-written digit. In the training matrix, rows correspond to images (reshaped to size $784\times 1$).

We have radomly chosen, from the MNIST data set, $900$ images for training and $100$ images for testing. They are saved, together with their corresponding labels, in file data_numbers.zip.

In [2]:
import numpy as np
import matplotlib.pyplot as plt

We read the data

In [3]:
data_train = np.loadtxt('data_train1.txt')
labels_train = np.loadtxt('labels_train1.txt')

data_test = np.loadtxt('data_test1.txt')
labels_test = np.loadtxt('labels_test1.txt')

Let's have a look on the training images. First, we have to reshape to size $28\times28$.

In [4]:
plt.figure(figsize=(8,8))
for k in range(0, 100):
    plt.subplot(10, 10, k+1)
    image = data_train[k, ]
    image = image.reshape(28, 28)
    plt.imshow(image, cmap='gray')
    plt.axis('off')
plt.show()

We define some parameters

In [5]:
C = 1                            # regularization
number_classes = 10              # classes of digits

n = data_train.shape[0]          # number of rows (images) in training matrix
m = data_test.shape[0]           # number of rows (images) in testing matrix  
I = np.identity(n)

And change type (integer to float) and normalize to $[0,1]$ both the training data and the test data:

In [7]:
H = data_train/255.   #the images are in [0,255]
H_te = data_test/255.

Now we define the label matrices from the labels contained in labels_train and labels_test, which are vectors containing the correct labels. These matrices (one for training and another for testing) have $10$ columns (number of classes) and $900$ or $100$ rows (training and testing samples).

For the sample $k$ we fill the $k$ row with zeros except in the column corresponding to the correct class (first column for label $0$, second column for label $1$, etc.), where we write $1$:

In [8]:
Y = np.zeros((n, number_classes))
for i in range(0, n):
    Y[i, int(labels_train[i])] = 1

We then compute the prediction given by the model,

$$ {\mathbf{Y}_{te}} = \mathbf{H}_{te}\mathbf{H}^{T}\mathbf{W},\quad \mathrm{with}\quad {\mathbf{\Omega}_{test}}=\mathbf{H}_{test}\mathbf{H}^{T}, $$

for which we need to compute $\mathbf{W}$, which is given as the solution to

$$ \left(\frac{\mathbf{I}}{C}+\mathbf{\Omega}\right)\mathbf{W}=\mathbf{Y}\quad \mathrm{with}\quad {\mathbf{\Omega}}=\mathbf{H}\mathbf{H}^{T} $$
In [9]:
Omega = np.dot(H, H.transpose())
W = np.linalg.solve(I/C + Omega, Y)

Therefore

In [10]:
Omega_te = np.dot( H_te, H.transpose())
Y_te = np.dot(Omega_te, W)

The label predicted for each testing image is obtained by extracting the maximum value of the corresponding column of Y_test

In [11]:
predicted_test  = Y_te.argmax(axis=1)

We now check the success percentage:

In [12]:
ttsp = np.sum(predicted_test == labels_test)/float(m)*100.
print('Testing success = %.1f%%' % ttsp)
Testing success = 76.0%

To look into the results with more detail we will compute the confusion matrix.

In [13]:
from sklearn.metrics import confusion_matrix

mc = confusion_matrix(labels_test, predicted_test)

print('Confusion matrix')
print(mc)

plt.figure(figsize=(6,6))
ticks = range(10)
plt.xticks(ticks)
plt.yticks(ticks)
plt.imshow(mc,cmap=plt.cm.Blues)
plt.colorbar(shrink=0.8)
w, h = mc.shape
for i in range(w):
    for j in range(h):
        plt.annotate(str(mc[i][j]), xy=(j, i), 
                    horizontalalignment='center',
                    verticalalignment='center')
plt.xlabel('Predicted label')
plt.ylabel('Actual label')
plt.title('Confusion matrix')
plt.show()
Confusion matrix
[[ 9  0  0  0  0  0  0  0  0  1]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0  1  5  1  0  0  2  1  0  0]
 [ 1  0  1  7  0  0  0  0  0  1]
 [ 0  0  0  0  9  0  0  0  1  0]
 [ 0  0  0  0  0  8  0  0  2  0]
 [ 1  0  1  0  0  0  7  0  0  1]
 [ 1  0  0  0  1  0  0  8  0  0]
 [ 0  0  0  2  0  2  0  0  6  0]
 [ 0  0  0  0  2  0  0  1  0  7]]

Let us see the corresponding images:

In [14]:
print('\nLabels predicted for testing samples')
for i in range(0,m):
    if i%10 == 9:
        print(predicted_test[i]),
    else:
        print(predicted_test[i], end=" "),

print('\n')
print('Images corresponding to the above labels')
for k in range(0, 100):
    plt.subplot(10, 10, k+1)
    image = data_test[k, ]
    image = image.reshape(28, 28)
    plt.imshow(image, cmap=plt.cm.gray)
    plt.axis('off')
plt.show()
Labels predicted for testing samples
2 7 1 0 9 9 5 7 1 9
9 5 9 4 3 4 5 7 0 3
1 8 4 3 1 3 5 2 7 7
0 3 5 5 5 8 3 0 4 4
9 7 9 2 4 2 2 9 8 7
6 0 5 3 1 3 7 9 4 6
4 4 6 8 6 1 0 0 8 9
0 6 7 1 0 4 6 0 1 2
6 3 6 1 3 5 7 4 2 8
5 1 1 8 6 8 0 4 8 0


Images corresponding to the above labels

Classification with nonlinear kernels

For computing the kernel matrices $\Omega$ and $\Omega_{te}$ we used the dot product between elements of the training and testing data matrices $H$ and $H_{te}$. In machine learning, this is known as a linear kernel. Generically,

$$ K_{Lin}(\mathbf{x}_i,\mathbf{x}_j)=\mathbf{x}_i \centerdot \mathbf{x}_j = \sum_{k=1}^{n} x_{ik} x_{jk}. $$

However, we may use a large class of kernel functions (satisying some suitable properties) among which the most usual are the Gaussian kernel, also called RBF (radial basis function), given by

$$ K_{RBF}(\mathbf{x}_i,\mathbf{x}_j)=\mathrm{exp}\left(-\dfrac{||\mathbf{x}_i-\mathbf{x}_j||^{2}}{\sigma}\right), $$

or the polynomial kernel

$$ K_{Poly}(\mathbf{x}_i,\mathbf{x}_j)=(\mathbf{x}_i\centerdot\mathbf{x}_j+a)^{b}. $$

In these nonlinear kernels we are introducing parameters that must be fixed in advance: $a$ and $b$ for the polynomial kernel and $\sigma$ for the RBF kernel.

Let's see if the previous result using the linear kernel may be improved. For instance, using the polynomial kernel,

In [20]:
C = 1    # Regularization
a = 1    # polynomial kernel
b = 3 

We have already constructed the matrices $\mathbf{H}$, $\mathbf{H}_{test}$ e $\mathbf{Y}$.

In [22]:
# Polynomial kernel function

KernelPoly = lambda X, Y : (np.dot(X,Y.T)+a)**b

# Omega's 

OmegaP = KernelPoly(H, H)  
W = np.linalg.solve(I/C + OmegaP, Y)

OmegaP_te = KernelPoly(H_te, H)
YP_te = np.dot(OmegaP_te, W)

# prediction

predictedP_test = YP_te.argmax(axis=1)

# success percentage

percent = np.sum(predictedP_test == labels_test)/float(m)*100.
print('Testing success = %.1f%%' % percent)

# confusion matrix
mc = confusion_matrix(labels_test, predictedP_test)

print('Confusion matrix')
print(mc)

plt.figure(figsize=(6,6))
ticks = range(10)
plt.xticks(ticks)
plt.yticks(ticks)
plt.imshow(mc,cmap=plt.cm.Blues)
plt.colorbar(shrink=0.8)
w, h = mc.shape
for i in range(w):
    for j in range(h):
        plt.annotate(str(mc[i][j]), xy=(j, i), 
                    horizontalalignment='center',
                    verticalalignment='center')
plt.xlabel('Predicted label')
plt.ylabel('Actual label')
plt.title('Confusion matrix')
plt.show()

# Viewing results
print('\r')
print('Labels predicted for testing samples')
for i in range(0,m):
    if i%10 == 9:
        print(predictedP_test[i]),
    else:
        print(predictedP_test[i], end=" "),

print('\n')
print('Images corresponding to the above labels')
for k in range(0, 100):
    plt.subplot(10, 10, k+1)
    image = data_test[k, ]
    image = image.reshape(28, 28)
    plt.imshow(image, cmap=plt.cm.gray)
    plt.axis('off')
plt.show()
Testing success = 93.0%
Confusion matrix
[[10  0  0  0  0  0  0  0  0  0]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0  0  8  1  0  0  0  0  1  0]
 [ 0  0  0  9  0  1  0  0  0  0]
 [ 0  0  0  0  9  0  1  0  0  0]
 [ 0  0  0  0  0 10  0  0  0  0]
 [ 0  0  0  0  1  0  9  0  0  0]
 [ 0  0  0  0  0  0  0 10  0  0]
 [ 0  0  0  0  0  0  0  0  9  1]
 [ 0  0  0  0  0  0  0  1  0  9]]
Labels predicted for testing samples
2 7 1 0 9 9 5 7 1 9
3 8 0 7 5 7 5 7 0 3
1 5 4 3 1 3 8 2 7 7
0 3 5 5 5 5 3 0 4 4
9 9 9 6 4 3 2 4 8 7
6 0 5 8 1 3 8 9 4 2
4 4 6 9 6 1 3 0 8 9
0 6 7 2 0 6 6 0 1 2
6 8 6 1 3 5 7 9 2 8
5 1 1 4 2 8 6 4 8 7


Images corresponding to the above labels

Exercises


Exercise 1

Using your function with the RBF kernel

def KernelRBF(X,Y,g):
    """ 
    Computes the kernel matrix
    for the gaussian kernel
    """

    m = X.shape[0]
    n = Y.shape[0]
    K = np.zeros((m,n))

    for i in range(m):
        for j in range(n):
            dif = np.linalg.norm(X[i,:]-Y[j,:])
            K[i,j] = np.exp(-dif**2/g)

    return K

choose the best parameters $(C,\sigma)$, in the sense of maximum percentage success, where $(C,\sigma)$ may take the following values

In [23]:
C_list = [ 1., 10., 100., 1000.]
sigma_list = [ 1., 10., 100., 1000.]

Give a table with the success percentages, like the following (running may take several minutes). Give the predicted labels.

In [28]:
%run Exercise1
C =  1.0  sigma =  1.0
Success rate = 89.0%
C =  1.0  sigma =  10.0
Success rate = 94.0%
C =  1.0  sigma =  100.0
Success rate = 95.0%
C =  1.0  sigma =  1000.0
Success rate = 85.0%
C =  10.0  sigma =  1.0
Success rate = 89.0%
C =  10.0  sigma =  10.0
Success rate = 94.0%
C =  10.0  sigma =  100.0
Success rate = 96.0%
C =  10.0  sigma =  1000.0
Success rate = 93.0%
C =  100.0  sigma =  1.0
Success rate = 89.0%
C =  100.0  sigma =  10.0
Success rate = 94.0%
C =  100.0  sigma =  100.0
Success rate = 97.0%
C =  100.0  sigma =  1000.0
Success rate = 92.0%
C =  1000.0  sigma =  1.0
Success rate = 89.0%
C =  1000.0  sigma =  10.0
Success rate = 94.0%
C =  1000.0  sigma =  100.0
Success rate = 96.0%
C =  1000.0  sigma =  1000.0
Success rate = 92.0%

Success rate =  97.0 %, C =  100.0 , sigma =  100.0
Labels predicted for testing samples
2 7 1 0 9 9 5 7 1 9
3 8 0 7 3 7 5 7 0 3
1 5 4 3 1 3 8 2 7 7
0 3 5 5 5 5 3 0 4 4
9 9 9 6 4 3 2 6 8 7
6 0 5 8 1 2 8 9 4 2
4 4 6 8 6 1 3 0 8 9
0 6 7 2 0 6 6 0 1 2
6 8 6 1 3 5 7 9 2 8
5 1 1 4 2 8 6 4 8 7

Exercise 2

The file data_char.txt contains images of size $20\times16$ of hand-written characters, each of them stored as a row. The file labels_char.txt contains the corresponding labels (26 labels from 0 to 25, for the corresponding alphabetic characters).

Use $90\%$ of the samples for training and $10\%$ for testing under a RBF kernel, with $\sigma = 100$ and $C=1$. Plot the testing images together with the predicted labels.

In [29]:
%run Exercise2
Success rate = 80.0%
Labels predicted for testing samples
P Q Y W F Y B L M Q
J W V T V U L I L K
F M X K E N K U I X
U Q R A D A C B X Y
P U E K E W G E E V
V W E J X B Q A Z U
O U J O F Y Z E S Z
A N N L P H W X H Y
T P V F T K J V A M
G M G N W Z M E D X


References