Gradient Descent in Machine Learning

Kinder Chen
2 min readSep 8, 2021

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Gradient Descent gets to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. It measures the local gradient of the error function with regard to the parameter vector θ, and it goes in the direction of descending gradient. You start by filling θ with random values, which is called random initialization. Then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function, until the algorithm converges to a minimum.

The size of the steps in Gradient Descent is determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time. On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

The MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with a slope that never changes abruptly. These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum. When using Gradient Descent, you should ensure that all features have a similar scale, or else it will take much longer to converge.

The cost function has the shape of a bowl, but it can be an elongated bowl if the features have very different scales. Training a model means searching for a combination of model parameters that minimizes a cost function over the training set. It is a search in the model’s parameter space: the more parameters a model has, the more dimensions this space has, and the harder the search is.

--

--

Kinder Chen

What happened couldn’t have happened any other way…