We start at an initial guess point $(x_0, y_0).$
We look for the direction of the gradient.
We orient ourselves in the opposite direction to the gradient, since the gradient is in the direction of the steepest ascent of the function, that is, upwards, and we are looking for a minimum, downwards.
We move in a straight line. We walk down.
As soon as we find the minimum that is on our straight path (when we start to climb again), we stop. We will be in $(x_1, y_1).$
We look for the direction of the gradient again and repeat the previous steps until a stop condition is met (for example, do $n$ iterations, small enough gradient).

This process is illustrated in the following graphics

In the first graph we see a surface whose minimum is at $(0,0).$ It is a paraboloid.
In the second graph we see the gradients at some points on the surface. We can observe that for each point
- They are oriented in the direction of maximum growth of the function.
- The modulus depends on how quickly the function grows.
The third graph illustrates the Steepest Descent method starting at point $(2,1).$
The fourth graph gives the height of the first straight path as a function of $h.$ We go down until we reach the redpoint, which is the point of minimum height within our trajectory.

Using the gradient method, let's approximate the minimum of the function $f\left(x,y\right)=x^{2}+3y^{2}$ using as initial guess $(2,1).$

We will iterate with the gradient method using the formula

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} x_{0}\\ y_{0} \end{array}\right)-h\nabla f\left(x_{0},y_{0}\right)\qquad (1) $$

and we will search $h$ where

$$ g(h)=f\left(x_{0}-h\,f_{x}(x_0,y_0),y_{0}-h\,f_{y}(x_0,y_0)\right) $$

has a minimum.

As

$$ \nabla f=\left(f_{x},f_{y}\right)=\left(2x,6y\right) $$

and substituting the value of the starting point

$$ \nabla f\left(x_{0},y_{0}\right)=\nabla f\left(2,1\right)=\left(4,6\right). $$

and substituting the value of the initial point and the gradient at the initial point in the function $g$

$$ g(h)=f\left(x_{0}-h\,f_{x}(x_0,y_0),y_{0}-h\,f_{y}(x_0,y_0)\right)=f\left(2-4h,1-6h\right)=\left(2-4h\right)^{2}+3\left(1-6h\right)^{2} $$

Now we would have to find the minimum of this one variable function in a numerical way. But for the sake of simplicity, we will skip this step and do it analytically.

Let's calculate the minimum of this function taking into account that the necessary condition of minimum is that $g'(h)=0$

$$ g'(h)=2(-4)\left(2-4h\right)+6(-6)\left(1-6h\right)=248h-52=0\Longrightarrow h=0.2097\approx 0.21 $$

and using this value in equation (1)

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} 2\\ 1 \end{array}\right)-0.21\left(\begin{array}{c} 4\\ 6 \end{array}\right)=\left(\begin{array}{r} 1.16\\ -0.26 \end{array}\right) $$

There has been an improvement because $f\left(x_{0},y_{0}\right)=f(2,1)=7$ and $f\left(x_{1},y_{1}\right)=f(1.16,-0.26)=1.55$ which is a smaller value.

For the next iteration, we use the formula

$$ \left(\begin{array}{c} x_{2}\\ y_{2} \end{array}\right)=\left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)-h\nabla f\left(x_{1},y_{1}\right)\qquad (2) $$

We need to compute $h$ again so

$$ g(h)=f\left(x_{1}-h\,f_{x}(x_1,y_1),y_{1}-h\,f_{y}(x_1,y_1)\right) $$

is minimum. And then, we substitute this $h$ in (2).

With learning rate 0.1¶

In this case, we do not optimize the step, we always use the same $h$, and this parameter is called $\eta$ or learning rate. We must fix it. If it is too small, the algorithm will converge too slowly and if it is too large it may not converge.

This method has the advantage of its simplicity and the drawback that you have to search for a suitable $\eta$.

At each step, we use the formula

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} x_{0}\\ y_{0} \end{array}\right)-\eta\nabla f_{\left(x_{0},y_{0}\right)} $$

As

$$ \nabla f=\left(f_{x},f_{y}\right)=\left(2x,6y\right) $$

we have

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} x_{0}\\ y_{0} \end{array}\right)-\eta \left(\begin{array}{c} 2x_{0}\\ 6y_{0} \end{array}\right) \quad \quad \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} 2\\ 1 \end{array}\right)-0.1 \left(\begin{array}{c} 2(2)\\ 6(1) \end{array}\right)= \left(\begin{array}{c} 1.6\\ 0.4 \end{array}\right) $$$$ f(1.6,0.4)= 3.04 $$$$ \left(\begin{array}{c} x_{2}\\ y_{2} \end{array}\right)=\left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)-\eta \left(\begin{array}{c} 2x_{1}\\ 6y_{1} \end{array}\right) \quad \quad \left(\begin{array}{c} x_{2}\\ y_{2} \end{array}\right)=\left(\begin{array}{c} 1.6\\ 0.4 \end{array}\right)-0.1 \left(\begin{array}{c} 2(1.6)\\ 6(0.4) \end{array}\right)= \left(\begin{array}{c} 1.28\\ 0.16 \end{array}\right) $$$$ f(1.28, 0.16)= 1.7152 $$$$ \left(\begin{array}{c} x_{3}\\ y_{3} \end{array}\right)=\left(\begin{array}{c} x_{2}\\ y_{2} \end{array}\right)-\eta \left(\begin{array}{c} 2x_{2}\\ 6y_{2} \end{array}\right) \quad \quad \left(\begin{array}{c} x_{3}\\ y_{3} \end{array}\right)=\left(\begin{array}{c} 1.28\\ 0.16 \end{array}\right)-0.1 \left(\begin{array}{c} 2(1.28)\\ 6(0.16) \end{array}\right)= \left(\begin{array}{c} 1.024\\ 0.064 \end{array}\right) $$$$ f(1.024, 0.064)= 1.060864 $$

The Newton's method¶

Introduction¶

The Newton's method to find a root of the nonlinear equation $f(x)=0$, starting from an initial value $x_0$, is

$$ x_{k+1} = x_k-\frac{f(x_k)}{f'(x_k)} $$

and we obtain

$$ x_{1} = x_0-\frac{f(x_0)}{f'(x_0)},\qquad x_{2} = x_1-\frac{f(x_1)}{f'(x_1)}, \qquad x_{3} = x_2-\frac{f(x_2)}{f'(x_2)},\quad \ldots \quad \rightarrow \quad \alpha $$

If we want to find a maximum or a minimum of the function $ f $, for $f$ smooth, the necessary condition is $$ f'(x)=0 $$

That is, we look for a zero of this equation and, using Newton, we could solve it, taking an initial value $x_0$, and using the formula

$$ x_{k+1} = x_k-\frac{f'(x_k)}{f''(x_k)} $$

Similarly, for functions of $ n $ variables $f:\Omega\subset \mathbb{R}^n\to \mathbb{R}$, taking into account that the components of the gradient vector of a function $\nabla f(\mathbf{x})$ are its first partial derivatives, and the components of the Hessian matrix $H(\mathbf{x})$ are the second partial derivatives, we can formulate Newton starting from an initial value $\mathbf{x}_0$

$$ \mathbf{x}_{k+1} = \mathbf{x}_{k}-H^{-1}(\mathbf{x}_{k}) \cdot \nabla f(\mathbf{x}_{k}) $$

This is not the construction of the method. In both the one-dimensional and multidimensional cases, Newton's method can be built from the corresponding Taylor formula.

Since our function is a two-variable function, we will iterate with Newton's method using the formula

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} x_{0}\\ y_{0} \end{array}\right)-H^{-1}\left(x_{0},y_{0}\right)\cdot\nabla f\left(x_{0},y_{0}\right) $$

If we take into account that

$$ H\left(x_{0},y_{0}\right)\left(\begin{array}{c} c_{1}\\ c_{2} \end{array}\right)=\nabla f\left(x_{0},y_{0}\right)\qquad (1) $$

then

$$ \left(\begin{array}{c} c_{1}\\ c_{2} \end{array}\right)=H^{-1}\left(x_{0},y_{0}\right)\cdot\nabla f\left(x_{0},y_{0}\right) $$

and we can write

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} x_{0}\\ y_{0} \end{array}\right)-\left(\begin{array}{c} c_{1}\\ c_{2} \end{array}\right)\qquad (2) $$

where $\left(c_{1},c_{2}\right)^{T}$ is the solution of system (1). In general, it makes more sense to solve system (1) than to compute the inverse matrix $H^{-1}$ and then multiply it by the gradient, because calculating the inverse of a matrix is equivalent to solving $ n $ systems (although with the same matrix of coefficients) and in this way, we are solving only one system.

Exercise¶

The function to minimize is $f\left(x,y\right)=x^{2}+3y^{2}$ and the initial guess is $(2,1)$

The gradient is

$$ \nabla f=\left(\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}\right)=\left(2x,6y\right) $$

and

$$ \nabla f\left(x_{0},y_{0}\right)=\nabla f\left(2,1\right)=\left(4,6\right). $$

Also

$$ H=\left(\begin{array}{cc} \dfrac{\partial^{2}f(x,y)}{\partial x\,\partial x} & \dfrac{\partial^{2}f(x,y)}{\partial x\,\partial y}\\ \dfrac{\partial^{2}f(x,y)}{\partial y\,\partial x} & \dfrac{\partial^{2}f(x,y)}{\partial y\,\partial y} \end{array}\right)=\left(\begin{array}{cc} 2 & 0\\ 0 & 6 \end{array}\right) $$

and

$$ H\left(x_{0},y_{0}\right)=H\left(2,1\right)=\left(\begin{array}{cc} 2 & 0\\ 0 & 6 \end{array}\right) $$

The system (1) is

$$ \left(\begin{array}{cc} 2 & 0\\ 0 & 6 \end{array}\right)\left(\begin{array}{c} c_{1}\\ c_{2} \end{array}\right)=\left(\begin{array}{c} 4\\ 6 \end{array}\right) $$

and solving the system we obtain

$$ \left(\begin{array}{c} c_{1}\\ c_{2} \end{array}\right)=\left(\begin{array}{c} 2\\ 1 \end{array}\right). $$

Therefore (2) is

$$ \left(\begin{array}{c} x_{1}\\ y_{1} \end{array}\right)=\left(\begin{array}{c} 2\\ 1 \end{array}\right)-\left(\begin{array}{c} 2\\ 1 \end{array}\right)=\left(\begin{array}{c} 0\\ 0 \end{array}\right). $$

We have improve because $f\left(x_{0},y_{0}\right)=f(2,1)=7$ and $f\left(x_{1},y_{1}\right)=f(0,0)=0$ which is smaller.

In this case we have arrive to the minimum performing only one iteration.

Golden section search¶

We will optimize a single variable $g$ function. The golden section method is an interval method. If the function reaches a single minimum in the interval, it will converge to that minimum.

If the function reaches a single minimum in $[a,b]$ and we have two inner points $x_1$ and $x_2$ such that $a \lt x_1 \lt x_2 \lt b$ taking into account the values of the function $g$ in $x_1$ and $x_2$, we can discard $[a,x_1)$ or $(x_2,b]$ keeping the minimum within the interval.

We choose the points so that they are at a fraction $\phi$ and $1-\phi$ from one of the extremes, with $\phi\in(0.5,1]$ with the condition that, if we discard $(x_2,b]$ the next $x_2$ is the same as the old $x_1$. And if we discard $[a,x_1)$ the next $x_1$ is the old $x_2.$ Thus, at each iteration, we will only have to evaluate the function at a new point.

Without losing generality, we can assume that the initial interval $[a,b]$ is $[0,1]$ and that in the first iteration we discard the interval on the right.

In the first iteration, and taking into account that $\phi\in(0.5,1]$, we see the points $a,$ $x_1$, $x_2$ and $b$ and their values.
If we discard $(x_2,b]$, in the second iteration, the points $x_1$ and $x_2$ are $1-\phi$ and $\phi$ multiplied by $\phi$ because the interval no longer has length $1$ but $\phi.$
To make the new $x_2$ equal to the old $x_1$, $$\phi^2=1-\phi\qquad\Longrightarrow \qquad \phi = \frac{\sqrt{5}-1}{2}\approx 0.618034$$

Algorithm¶

Given $a_1=a$, $b_1=b$ and $\phi = \dfrac{\sqrt{5}-1}{2}$

For $k=1,2,\ldots,\mathrm{MaxNumIter}$

Compute the points

$x_1=a_k+(1-\phi)(b_k-a_k)$

$x_2=a_k+\phi(b_k-a_k)$

if $g(x_1) \gt g(x_2)$ then:
$ a_{k+1}=x_1,$ $ b_{k+1}=b_k,$

else:
$ a_{k+1}=a_{k},$ $ b_{k+1}=x_2.$

If the stopping criterion is met, stop.

The idea is that the minimum is always within the interval.

Let's apply it to calculate the minimum of the function

$$ g(h)=\left(2-4h\right)^{2}+3\left(1-6h\right)^{2} $$

$$ \begin{array}{|c|cccc|cc|} \hline k & a & x_1 & x_2 & b & g(x_1) & g(x_2)\\ \hline 1& 0.0000& 0.3820& 0.6180& 1.0000& 5.2291&22.2260\\ 2& 0.0000& \fbox{0.2361}& 0.3820& 0.6180& 1.6347& 5.2291\\ 3& 0.0000& \fbox{0.1459}& 0.2361& 0.3820& 2.0528& 1.6347\\ 4& 0.1459& 0.2361& \fbox{0.2918}& 0.3820& 1.6347& 2.3846\\ 5& 0.1459& \fbox{0.2016}& 0.2361& 0.2918& 1.5564& 1.6347\\ 6& 0.1459& \fbox{0.1803}& 0.2016& 0.2361& 1.6551& 1.5564\\ 7& 0.1803& 0.2016& \fbox{0.2148}& 0.2361& 1.5564& 1.5516\\ 8& 0.2016& 0.2148& \fbox{0.2229}& 0.2361& 1.5516& 1.5701\\ 9& 0.2016& \fbox{0.2098}& 0.2148& 0.2229& 1.5484& 1.5516\\ 10& 0.2016& \fbox{0.2067}& 0.2098& 0.2148& 1.5495& 1.5484\\ \hline \end{array} $$

Iteration 1

We calculate $x_1$ and $x_2.$
Since $g(x_1)\lt g(x_2)$ means that the minimum is on the left:
- We get rid of the interval on the right.
- The $x_1$ becomes the new $x_2.$
- We calculate the new $x_1.$

Iteration 2

Since $g(x_1)\lt g(x_2)$ means that the minimum is on the left:
- We get rid of the interval on the right.
- The $x_1$ becomes the new $x_2.$
- We calculate the new $x_1.$

Iteration 3

Since $g(x_1)\gt g(x_2)$ means that the minimum is on the right:
- We get rid of the interval on the left.
- The $x_2$ becomes the new $x_1.$
- We calculate the new $x_2.$

And so on.

Of the last two $x_1$ and $x_2$ we will take as an approximation the one in which the function has a lower value than is $$\fbox{0.2098}$$

Contents

The gradient method¶

Steepest descent¶