Batch Gradient Descent
To implement Gradient Descent, we need to compute the gradient of the cost function with regard to each model parameter θj. In other words, partial derivative need to be calculated. Therefore, the Partial derivatives of the cost function can be presented as
The gradient vector of the cost function, contains all the partial derivatives of the cost function, can be described as
This formula involves calculations over the full training set X, at each Gradient Descent step, which is called Batch Gradient Descent or Full Gradient Descent. It uses the whole batch of training data at every step. Training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation or SVD decomposition.
With the gradient vector, which points uphill and downhill, the Gradient Descent step is
where the η is the learning rate. When the learning rate is too low, the algorithm will eventually reach the solution, but it will take a long time. On the other hand, when the learning rate is too high, the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step. We can use grid search to find a good learning rate. However, wemay want to limit the number of iterations so that grid search can eliminate models that take too long to converge. It is important to set the appropriate number of iterations. A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes smaller than a tiny number ε, which is called the tolerance.
When the cost function is convex and its slope does not change abruptly, Batch Gradient Descent with a fixed learning rate will eventually converge to the optimal solution. However, it can take O(1/ε) iterations to reach the optimum within a range of ε, depending on the shape of the cost function.