Gradient Descent Derivation
Andrew Ng’s course on Machine Learning at Coursera provides an excellent explanation of gradient descent for linear regression. To really get a strong grasp on it, I decided to work through some of the derivations and some simple examples here.
This material assumes some familiarity with linear regression, and is primarily intended to provide additional insight into the gradient descent technique, not linear regression in general.
I am making use of the same notation as the Coursera course, so it will be most helpful for students of that course.
For linear regression, we have a linear hypothesis function, h(x)=θ0+θ1xh(x)=θ0+θ1x. We want to find the values of θ0θ0 and θ1θ1 which provide the best fit of our hypothesis to a training set. The training set examples are labeled x,yx,y, where xx is the input value and yy is the output. The ith training example is labeled as x(i),y(i)x(i),y(i). Do not confuse this as an exponent! It just means that this is the ith training example.
MSE Cost Function
The cost function JJ for a particular choice of parameters θθ is the mean squared error (MSE):
Where the variables used are:
The MSE measures the average amount that the model’s predictions vary from the correct values, so you can think of it as a measure of the model’s performance on the training set. The cost is higher when the model is performing poorly on the training set. The objective of the learning algorithm, then, is to find the parameters θθ which give the minimum possible cost JJ.
This minimization objective is expressed using the following notation, which simply states that we want to find the θθ which minimizes the cost J(θ)J(θ).
Gradient Descent Minimization - Single Variable Example
We’re going to be using gradient descent to find θθ that minimizes the cost. But let’s forget the MSE cost function for a moment and look at gradient descent as a minimization technique in general.
Let’s take the much simpler function J(θ)=θ2J(θ)=θ2, and let’s say we want to find the value of θθ which minimizes J(θ)J(θ).
Gradient descent starts with a random value of θθ, typically θ=0θ=0, but since θ=0θ=0 is already the minimum of our function θ2θ2, let’s start with θ=3θ=3.
Gradient descent is an iterative algorithm which we will run many times. On each iteration, we apply the following “update rule” (the := symbol means replace theta with the value computed on the right):
Alpha is a parameter called the learning rate which we’ll come back to, but for now we’re going to set it to 0.1. The derivative of J(θ)J(θ) is simply 2θ2θ.
Below is a plot of our function, J(θ)J(θ), and the value of θθ over ten iterations of gradient descent.
Below is a table showing the value of theta prior to each iteration, and the update amounts.
Cost Function Derivative
Why does gradient descent use the derivative of the cost function? Finding the slope of the cost function at our current θθ value tells us two things.
The first is the direction to move theta in. When you look at the plot of a function, a positive slope means the function goes upward as you move right, so we want to move left in order to find the minimum. Similarly, a negative slope means the function goes downard towards the right, so we want to move right to find the minimum.
The second is how big of a step to take. If the slope is large we want to take a large step because we’re far from the minimum. If the slope is small we want to take a smaller step. Note in the example above how gradient descent takes increasingly smaller steps towards the minimum with each iteration.
The Learning Rate - Alpha
The learning rate gives the engineer some additional control over how large of steps we make.
Selecting the right learning rate is critical. If the learning rate is too large, you can overstep the minimum and even diverge. For example, think through what would happen in the above example if we used a learning rate of 2. Each iteration would take us farther away from the minimum!
The only concern with using too small of a learning rate is that you will need to run more iterations of gradient descent, increasing your training time.
Convergence / Stopping Gradient Descent
Note in the above example that gradient descent will never actually converge on the minimum, θ=0θ=0. Methods for deciding when to stop gradient descent are beyond my level of expertise, but I can tell you that when gradient descent is used in the assignments in the Coursera course, gradient descent is run for a large, fixed number of iterations (for example, 100 iterations), with no test for convergence.
Gradient Descent - Multiple Variables Example
The MSE cost function includes multiple variables, so let’s look at one more simple minimization example before going back to the cost function.
Let’s take the function:
J(θ)=θ12+θ22J(θ)=θ12+θ22When there are multiple variables in the minimization objective, gradient descent defines a separate update rule for each variable. The update rule for θ1θ1 uses the partial derivative of JJ with respect to θ1θ1. A partial derivative just means that we hold all of the other variables constant–to take the partial derivative with respect to θ1θ1, we just treat θ2θ2 as a constant. The update rules are in the table below, as well as the math for calculating the partial derivatives. Make sure you work through those; I wrote out the derivation to make it easy to follow.
Note that when implementing the update rule in software, θ1θ1 and θ2θ2 should not be updated until after you have computed the new values for both of them. Specifically, you don’t want to use the new value of θ1θ1 to calculate the new value of θ2θ2.
Gradient Descent of MSE
Now that we know how to perform gradient descent on an equation with multiple variables, we can return to looking at gradient descent on our MSE cost function.
The MSE cost function is labeled as equation [1.0] below. Taking the derivative of this equation is a little more tricky. The key thing to remember is that x and y are not variables for the sake of the derivative. Rather, they represent a large set of constants (your training set). So when taking the derivative of the cost function, we’ll treat x and y like we would any other constant.
Once again, our hypothesis function for linear regression is the following:
h(x)=θ0+θ1xh(x)=θ0+θ1xI’ve written out the derivation below, and I explain each step in detail further down.
To move from equation [1.1] to [1.2], we need to apply two basic derivative rules:
Moving from [1.2] to [1.3], we apply both the power rule and the chain rule:
Finally, to go from [1.3] to [1.4], we must evaluate the partial derivative as follows. Recall again that when taking this partial derivative all letters except θ0θ0 are treated as constants ( θ1θ1, xx, and yy).
Equation [1.4] gives us the partial derivative of the MSE cost function with respect to one of the variables, θ0θ0. Now we must also take the partial derivative of the MSE function with respect to θ1θ1. The only difference is in the final step, where we take the partial derivative of the error:
One Half Mean Squared Error
In Andrew Ng’s Machine Learning course, there is one small modification to this derivation. We multiply our MSE cost function by 1/2 so that when we take the derivative, the 2s cancel out. Multiplying the cost function by a scalar does not affect the location of its minimum, so we can get away with this.
Alternatively, you could think of this as folding the 2 into the learning rate. It makes sense to leave the 1/m term, though, because we want the same learning rate (alpha) to work for different training set sizes (m).
Final Update Rules
Altogether, we have the following definition for gradient descent over our cost function.
Training Set Statistics
Note that each update of the theta variables is averaged over the training set. Every training example suggests its own modification to the theta values, and then we take the average of these suggestions to make our actual update.
This means that the statistics of your training set are being taken into account during the learning process. An outlier training example (or even a mislabeled / corrupted example) is going to have less influence over the final weights because it is one voice versus many.