Gradient Methods

Gradient descent is one of the most popular optimization algorithms in machine learning. It can be used for both convex and non-convex optimization problems. In this post, we will learn about the key ideas behind gradient descent and how it can be used to solve optimization problems.

This post is based on a course from University of California, Berkeley - Convex Optimization and Approximation. I summarize the key ideas from the course and provide some examples to help you understand the concepts better.

Gradient descent at core is a form of local search. It starts with an initial guess and iteratively improves the guess by moving in the direction of the negative gradient. The size of the step is controlled by a parameter called the learning rate. The algorithm stops when the gradient is zero or when the change in the guess is below a certain threshold. The algorithm is visualized below.

For the broad class of convex function, gradient descent is guaranteed to converge to the global minimum. For non-convex functions, it is possible that the algorithm converges to a local minimum. In practice, it is often used to find a good local minimum.

Convexity

A set is convex if for any two points in the set, the line segment connecting them also lies in the set. A function is convex if for any two points on the function, the line segment connecting them also lies above the function.

For a convex set SRnS \in \mathbb{R}^n, for all x,ySx, y \in S, and all scalar α[0,1]\alpha \in [0, 1], we have:

αx+(1α)yS(1) \alpha x + (1 - \alpha)y \in S \tag{1}

A function f:ΩRf: \Omega \to \mathbb{R} is convex if for all x,yΩx, y \in \Omega and all scalar α[0,1]\alpha \in [0, 1], we have:

f(αx+(1α)y)αf(x)+(1α)f(y)(2) f(\alpha x + (1 - \alpha)y) \leq \alpha f(x) + (1 - \alpha)f(y) \tag{2}

The epigraph of a convex function ff is the set of all points (x,y)(x, y) such that yf(x)y \geq f(x), which means

epi(f)={(x,t)f(x)t}(3) \mathrm{epi}(f) = \{(x, t) | f(x) \leq t\} \tag{3}

The epigraph of a convex function is always a convex set.

Inequality bounds compare
Figure 1. The illustration of epigraph of a convex function (the figure was taken from wikipedia).

We define the gradient of a function f:ΩRf: \Omega \to \mathbb{R} as the vector of partial derivatives:

f(x)=(fxi)i=1n(4) \nabla f(x) = \left(\frac{\partial f}{\partial x_i}\right)_{i=1}^n \tag{4}

According to the definition of derivative, we have

f(x)Ty=f(x+δy)δ(5) \nabla f(x)^T y = \frac{\partial f(x + \delta y)}{\partial \delta } \tag{5}

According to Taylor’s theorem, we have

f(x+δ)=f(x)+δf(x)T+12δ22f(x)+O(δ3)(6) f(x + \delta ) = f(x) + \delta \nabla f(x)^T + \frac{1}{2} \delta^2 \nabla^2 f(x) + \mathcal{O}(\delta^3) \tag{6}

We define the Hessian of a function f:ΩRf: \Omega \to \mathbb{R} as the matrix of second-order partial derivatives:

2f(x)=(2fxixj)i,j=1n(7) \nabla^2 f(x) = \left(\frac{\partial^2 f}{\partial x_i \partial x_j}\right)_{i, j = 1}^n \tag{7}

Schwarz’s theorem implies that the Hessian at a point xx is symmetric and positive definite if and only if ff is continuous and have second derivative at xx.

Much of this post will be about different ways of minimizing a convex function f:ΩRf: \Omega \to \mathbb{R}, where Ω\Omega is a convex set:

minxΩf(x)(8) \min_{x \in \Omega} f(x) \tag{8}

Remark: convex optimization is not necessarily easy to solve (even though the mathematical concepts are very intuitive). For starters, convex sets do not necessarily have a nice representation or enjoy compact descriptions.

For example, the set of all matrices with non-negative entries is convex, but it is not easy to represent it in a computer. Secondly, the problem of finding the minimum of a convex function is not necessarily easy to solve. For example, the problem of finding the minimum of a quadratic function is easy to solve, but the problem of finding the minimum of a quadratic function with a linear constraint is not easy to solve.

When solving computational problems involving convex sets, we need to worry about how to represent the convex set we’re dealing with. Rather than asking for an explicit description of the set, we can instead require a computational abstraction that highlights essential operations that we can carry out. The Separation Theorem motivates an important computational abstraction called separation oracle.

A separation oracle for a convex set KK is a function that takes as input a point xKx \notin K returns a hyperplane separating xx from KK. The separation oracle is a computational abstraction that allows us to solve convex optimization problems without having to explicitly represent the convex set KK.

Inequality bounds compare
Figure 2. The illustration of separation oracle (you can understand the separation oracle as a process of finding boundary of a convex set).

Algorithm efficient: Our goal of minimizing a convex function ff is equivalent to finding a point xx such that

f(x)minxΩf(x)+ϵ, ϵ>0 f(x) \leq \min_{x \in \Omega} f(x) + \epsilon, \quad \forall \ \epsilon > 0

The cost of algorithm depends on our tolerance ϵ\epsilon. Highly practical algorithms often have polynomial time complexity in ϵ\epsilon, such as O(1/ϵ)\mathcal{O}(1/\epsilon). or even O(1/ϵ2)\mathcal{O}(1/\epsilon^2).

In this series of posts, we will make an attempt to highlight both the theoretical performance and practical appeal of an algorithm. Moreover, we will discuss other performance criteria such as robustness to noise. How well an algorithm performs is rarely decided by a single criterion, and usually depends on the application at hand.

Gradient descent

For a differentiable function f:ΩRf: \Omega \to \mathbb{R}, the basic gradient descent method taking an initial point x0x_0 is defined as

xt+1=xtηtf(xt),t=0,1,2,(9) x_{t+1} = x_t - \eta_t \nabla f(x_t), \quad t = 0, 1, 2, \cdots \tag{9}

where the sclae ηt\eta_t is the so-called learning rate or (step size). The learning rate is a hyperparameter that controls the size of the step we take at each iteration. It may vary from iteration to iteration, or it may be fixed.

There are numerous ways of choosing step sizes that have a significant effect on the performance of gradient descent. What we will see in this lecture are several choices of step sizes that ensure the convergence of gradient descent by virtue of a theorem. These step sizes are not necessarily ideal for practical applications.

In equation (9), xt+1x_{t+1} might not be in Ω\Omega (watch the video at the beginning of the post, we cannot guarantee all iterations will stay within the domain). In this case, we can use the separation oracle to find a hyperplane that separates xt+1x_{t+1} from Ω\Omega and project xt+1x_{t+1} onto the hyperplane. This is called projection.

For a point x~\tilde{x} that is not in Ω\Omega, the projection of x~\tilde{x} onto Ω\Omega is defined as

projΩ(x~)=argminxΩxx~22(10) \mathrm{proj}_{\Omega}(\tilde{x}) = \arg \min_{x \in \Omega} \|x - \tilde{x}\|_2^2 \tag{10}

Inequality bounds compare
Figure 4. The illustration of projection (figure was taken from Patrick Rebeschini).

Now we are guaranteed that xt+1x_{t+1} is in Ω\Omega for all iterations. Note that computing the projection may be computationally the hardest part of the problem. However, there are convex sets for which we know explicitly how to compute the projection.

Here is our new gradient descent algorithm:

When we implement the above algorithm, there are many practical issues we have to deal with. One of the most important issues is to make sure that the algorithm converges. If the gradients of objective function are too large, the algorithm may diverge. If the gradients are too small, the algorithm may take too many iterations to converge (see the following video to get some feeling about the convergence of gradient descent).

To help us to quantify the convergence of gradient descent, we need to introduce the notion of gradient Lipschitz constant.

A function f:ΩRf: \Omega \to \mathbb{R} is Lipschitz continuous with constant LL if for all x,yΩx, y \in \Omega, we have

f(x)f(y)Lxy(11) \|f(x) - f(y)\| \leq L \|x - y\| \tag{11}

If the function ff is L-Lipschitz continuous, and convext, then

f(x)L(12) ||\nabla f(x) || \leq L \tag{12}

which gives a loose bound on the norm of the gradient of ff. We have a more formal version of this bound in the following theorem.

Theorem 1. Let f:ΩRf: \Omega \to \mathbb{R} be a convex function that is L-Lipschitz continuous. Let RR be the upper bound on the distance xx\|x - x^*\| between the initial point x0x_0 and the optimal point xx^*. Let x1,,xtx_1, \cdots, x_t be the sequence generated by gradient descent with constant learning rate η=RLt\eta = \frac{R}{L\sqrt{t}}. Then,

f(1ts=1txs)f(x)RLt(13) f\left ( \frac{1}{t} \sum_{s=1}^t x_s \right) - f(x^*) \leq \frac{RL}{\sqrt{t}} \tag{13}

This means that the difference between the functional value of the average point during the optimization process from the optimal value is bounded above by a constant proportional to 1/t1/\sqrt{t}. This is a very useful result, because it tells us that the convergence of gradient descent is very fast. In fact, the convergence rate is O(1/t)\mathcal{O}(1/\sqrt{t}).

The proof of Theorem 1 is not difficult, but it is a bit tedious. We will skip the proof here. If you are interested, you can find the proof in the book Convex Optimization by Boyd and Vandenberghe.

After learning the bound on the convergence rate, we now discuss the smoothness of the objective function.

A continuously differentiable function f:RnRnf: \mathbb{R}^n \to \mathbb{R}^n is β\beta smooth if the gradient map f\nabla f is β\beta-Lipschitz continuous, meaning

f(x)f(y)βxy(14) \| \nabla f(x) - \nabla f(y) \| \leq \beta \|x - y\| \tag{14}

Why do we care about the smoothness of the objective function? In the gradient descent algorithm, we need to decide the step size η\eta. If the gradient changes too quickly, we need to adjust the step size quickly as well.The smoothness of the objective function tells us how quickly the gradient changes. In summary, for the objective function, we care two things:

The first one gives the range of gradient values, and the second one tells us how quickly the gradient changes.

Lemma 1. Let ff be a f β\beta-smooth function. Then, for any x,yRnx, y \in \mathbb{R}^n, we have

f(y)f(x)f(x)T(yx)β2yx2(15) \left | f(y) - f(x) - \nabla f(x)^T (y - x) \right | \leq \frac{\beta}{2} \|y - x\|^2 \tag{15}

The significance of this lemma is that we can choose y=x1βf(x) y = x - \frac{1}{\beta} \nabla f(x) to minimize the right hand side of the above inequality. This gives us:

f(y)f(x)12βf(x)2(16) f(y) - f(x) \leq - \frac{1}{2\beta} || \nabla f(x) ||^2 \tag{16}

This means that the gradient update decreases the function value by an amount proportional to the squared norm of the gradient. This is a very useful result, because it tells us that the gradient descent algorithm is guaranteed to decrease the function value at each iteration.

Lemma 2. Let ff be a β\beta-smooth function. Then, for any x,yRnx, y \in \mathbb{R}^n, we have

f(x)f(y)f(x)T(xy)+β2xy2(17) f(x) - f(y) \leq \nabla f(x)^T (x - y) + \frac{\beta}{2} \| x - y \|^2 \tag{17}

We will show that gradient descent with the update rule

xt+1=xtηtf(xt) x_{t+1} = x_t - \eta_t \nabla f(x_t)

attains a faster convergence rate under the smoothness condition.

Theorem 2. Let ff be a β\beta-smooth function on Rn\mathbb{R}^n then gradient descent with the step size ηt=1βt\eta_t = \frac{1}{\beta t} satisfies

f(xt)f(x)2βx1x2t1(18) f(x_t) - f(x^*) \leq \frac{2 \beta || x_1 - x^* ||^2}{t-1} \tag{18}

which means that the convergence rate of gradient descent is O(1/t)\mathcal{O}(1/t).

Now, we will introduce the notion of strong convexity and combines it with smoothness to develop the concept of condition number. While smoothness gave as an upper bound on the second-order term in Taylor’s approximation, strong convexity will give us a lower bound. Taking together, those two assumptions are very powerful as they lead to a much faster convergence rate of the form exp(αt)\exp(-\alpha t), where α\alpha is a constant.

A function f:ΩRf: \Omega \to \mathbb{R} is strongly convex with constant α>0\alpha > 0 if for all x,yΩx, y \in \Omega, we have

f(y)f(x)+f(x)T(yx)+α2yx2(19) f(y) \geq f(x) + \nabla f(x)^T (y - x) + \frac{\alpha}{2} \|y - x\|^2 \tag{19}

or in different expression

f(y)f(x)f(x)T(yx)+α2xy2(20) f(y) - f(x) \geq \nabla f(x)^T (y - x) + \frac{\alpha}{2} \|x - y\|^2 \tag{20}

Inequality bounds compare
Figure 5. The illustration of β\beta-smooth and α\alpha-strong convexity (figure was taken from Francis Bach).

In figure 5, the blue cure is function ff which is β\beta-smooth and α\alpha-strong convex; it is possible to create global upper and lower quadratic bounds from every point xRnx \in \mathbb{R}^n with respect to curvatures β\beta and α\alpha.

(Condition number) Let ff be a β\beta-smooth and α\alpha-strongly convex function on Rn\mathbb{R}^n. Then, the condition number of ff is defined as

κ(f)=βα(21) \kappa(f) = \frac{\beta}{\alpha} \tag{21}

For a positive-definite quadratic function f , this definition of the condition number corresponds with the perhaps more familiar definition of the condition number of the matrix defining the quadratic.

Now, let’s summarize what we have learned into a table.

  Convex Strongly convex
Lipschitz ϵO(1/t1/2)\epsilon \leq \mathcal{O}(1 / t^{1/2}) ϵO(1/t)\epsilon \leq \mathcal{O}(1/t)
Smooth ϵO(1/t)\epsilon \leq \mathcal{O}(1/t) ϵeαt\epsilon \leq e^{-\alpha t}

Before we move on, let’s elaborate on the lower and upper bounds from equation (17) and (19).

The goal of gradient descent is to find a local minimum of a function ff. Now, let’s assume the optimal value is xx^*, which means f(x)=minf(x)f(x^*) = \min f(x). When we iterate the gradient descent algorithm, we will eventually reach a point xtx_t that is close to xx^*. Let’s rewrite the inequality (17) as

f(xt)f(x)+f(x)T(xtx)+β2xtx2(22) f(x_t) \leq f(x^*) + \nabla f(x^*)^T (x_t - x^*) + \frac{\beta}{2} \|x_t - x^*\|^2 \tag{22}

and the inequality (19) as

f(xt)f(x)+f(x)T(xtx)+α2xtx2(23) f(x_t) \geq f(x^*) + \nabla f(x^*)^T (x_t - x^*) + \frac{\alpha}{2} \| x_t - x^* \|^2 \tag{23}

Inequality bounds compare
Figure 5. The illustration of β\beta-smooth and α\alpha-strong convexity again (figure was taken from zhengwenjie).

The idea is that we are using the parabola approximation of ff around xx^* to bound the value of ff at xtx_t. In practice, having some knowledge of lower and upper bounds on the value of ff at xtx_t is very useful.

To gain deep understanding of the convergence rate of gradient descent, we need to practice more with concrete examples. In the next section, we will discuss some applications of gradient descent and implement them in Python.

Some applications of gradient descent

In this section, we will discuss some applications of gradient descent and implement them in Python.

One of the most fundamental data analysis tools is linear least squares. Given an m×nm \times n matrix AA and an mm-dimensional vector bb, the goal is to find the nn-dimensional vector xx that minimizes the squared error

f(x)=12mAxb2(24) f(x) = \frac{1}{2m} \|Ax - b\|^2 \tag{24}

We can verify

f(x)=AT(Axb);2f(x)=ATA(25) \nabla f(x) = A^T (Ax - b); \quad \nabla^2 f(x) = A^T A \tag{25}

This means ff that ff is β\beta-smooth and α\alpha-strongly convex with

β=λmax(ATA)andα=λmin(ATA)(26) \beta = \lambda_{\max}(A^T A) \quad \text{and} \quad \alpha = \lambda_{\min}(A^T A) \tag{26}

Now, we will simulate a dataset with m=100m = 100 and n=3n = 3 and solve the least squares problem using gradient descent. The following tables gives the true coefficients xx, the initial guess x0x_0, and the final solution xtx_t.

Initial guess (x0) True coefficient (x) Estimated coefficients (xt)
-0.831995 -1.94281 -1.93834
-0.0319888 1.51628 1.53504
-0.187633 -0.737795 -0.736319
Inequality bounds compare
Figure 6. The plot of the loss function and its convergence process.

From the figure 6, we can see that the loss function converges to a minimum after 20 iterations.

The above simulation assumes mnm \gg n. In practice, we often have mnm \ll n and the least squares problem is ill-posed. Since the matrix is not full rank, the objective function now is no longer strongly convex because

λmin(ATA)=0(27) \lambda_{\min}(A^T A) = 0 \tag{27}

The following table gives the first five true coefficients xx, the initial guess x0x_0, and the final solution xtx_t.

Initial guess (x0) True coefficient (x) Estimated coefficients (xt)
-2.38525 0.00156901 -2.3963
-0.877133 -0.00940305 -0.76494
-0.166967 -0.00913699 -0.1627
-0.797044 -0.00410237 -0.264272
-0.611308 -0.0141191 -0.868437
Inequality bounds compare
Figure 7. The plot of the loss function and its convergence process for underdetermined least squares problem.

As it is shown in figure 7, the loss function does not converge to a minimum. This is because the objective function is not strongly convex. In this case, we can use the proximal gradient descent algorithm to solve the problem. The proximal gradient descent algorithm is a variant of gradient descent that adds a proximal term to the objective function.

It is also called the regularized gradient descent algorithm. The proximal term is a regularization term that penalizes the magnitude of the coefficients. The proximal term is defined as

proxg(x)=argminy{12yx2+g(y)}(28) \text{prox}_{g}(x) = \text{argmin}_{y} \left \{ \frac{1}{2} \|y - x\|^2 +g(y) \right \} \tag{28}

Very often, we will focus on the 1\ell_1-norm and the 2\ell_2-norm as the regularization term:

Now, for the least squares problem, we can add the ridge penelization term to the objective function. The new objective function is

f(x)=12mAxb2+λ2x2(29) f(x) = \frac{1}{2m} \|Ax - b\|^2 + \frac{\lambda}{2} \|x\|^2 \tag{29}

Initial guess (x0) True coefficient (x) Estimated coefficients (xt)
1.32252 6.40826e-05 0.00770425
-0.980389 -0.00379448 -0.0115582
-0.011439 0.00818096 0.00947045
-0.0894514 0.0064178 0.00469251
-0.131615 -0.00768739 -0.00732296
Inequality bounds compare
Figure 7. The plot of the loss function and its convergence process for underdetermined least squares problem with ridge penality.

You see that the error doesn’t decrease below a certain level due to the regularization term. This is not a bad thing. In fact, the regularization term gives as strong convexity which leads to convergence in domain again.

Conditional gradient method

Before we move on to the next topic, let’s talk about “projection” in general, especially in the context of linear algebra. The dot product of two vectors is a projection of one vector onto the other. If you look at many equations in the above sections, we have done many dot products. For example, this one

f(y)f(x)f(x)T(yx)+α2yx2 f(y) - f(x) \geq \nabla f(x)^T (y - x) + \frac{\alpha}{2} \|y - x\|^2

The dot product f(x)T(yx)\nabla f(x)^T (y - x) is a projection of the gradient of ff onto the direction of yxy - x. This is a very important concept in optimization.

The derivative of f(x)f(x) not only tells us the direction of the steepest descent, but also the magnitude of the steepest descent. The magnitude of the steepest descent is the projection of the gradient of ff onto the direction of yxy - x.

Again, the optimization problem is to solve the problems of form

minimizexf(x) \text{minimize}_{x} f(x)

Therefore, the optimization problem can be summarized as: finding the minimum of a function by searching xx in the domain of ff. But, how do we search xx in the domain of ff? Any problem of searching is a problem of finding directions.

Let’s state the algorithm of conditional gradient method first, then we will explain it in detail.

Again, we start from the initial guess x0x_0 and we set

xt+1=xt+ηt(x~txt)(30) x_{t+1} = x_t + \eta_t (\tilde{x}_t - x_t) \tag{30}

where

x~t=argminx{f(xt)+f(xt)T(xxt)}(31) \tilde{x}_t = \text{argmin}_{x} \left \{ f(x_t) + \nabla f(x_t)^T(x - x_t) \right \} \tag{31}

Since f(xt)f(x_t) is pre-determined, we can simplify the above equation to

x~t=argminxf(xt)Tx(32) \tilde{x}_t = \text{argmin}_{x} \nabla f(x_t)^Tx \tag{32}

Compared the above equation with the equation of projection:

The gradient descent in the above updating is to search xx^* by following the direction and magnitude of the gradient f(xt)\nabla f(x_t), whereas the conditional gradient method in equation (30) and (32) is to search xx^* by following the projection of the gradient onto the whole domain.

When we are dealing with function in one dimension, the conditional gradient method is equivalent to the gradient descent method. However, when we are dealing with function in higher dimensions, the conditional gradient method is more efficient than the gradient descent method.

Here is the algorithm, starting from x0Rnx_0 \in \mathbb{R}^n, repeat the following steps until convergence:

As it turns out, conditional gradient enjoys a convergence guarantee similar to the one we saw for projected gradient descent.

Remark: we know when two vectors have positive correlation, the dot product is positive. When two vectors have negative correlation, the dot product is negative. When two vectors are orthogonal, the dot product is zero. Therefore, the conditional gradient method is a method that searches xx in the direction of the gradient. It is a method that searches xx in the direction that correlates most with the steepest descent. Be careful that this is not the same as the steepest descent method. The steepest descent method searches xx in the direction of the steepest descent, which is orthogonal to the gradient of ff.

It is very important to know that it makes no sense to use the conditional gradient method for the unconstrained optimization problem. This is because the linear search in the conditional gradient method needs to know xx in the following equation:

x~t=argminxf(xt)Tx,there is no t in  x \tilde{x}_t = \text{argmin}_{x} \nabla f(x_t)^Tx , \quad \text{there is no t in } \ x

Fabian Pedregosa has a very nice blog post on the conditional gradient method. Please read it if you want to know how to implement the conditional gradient method for constrained optimization problem such as Lasso.

One of the most important applications of the conditional gradient method is low-rank matrix completion, which we will not cover in this post. But, be ware of this as you might encounter this problem in the future, such as the Netflix problem.