Gradient Descent in simple terms

Machine learning has revolutionized the way we process and analyze data, affecting industries from finance to healthcare. With its ability to spot patterns that would otherwise go unnoticed, it has become a cornerstone of modern technology. But as this field continues to grow and expand, so does the need for a deep understanding of its capabilities and limitations.

For me, this is the first article and I wanted to devote it to the basics of machine learning, and in particular to gradient descent.

There is enough information on gradient descent on the Internet, but, in my opinion, most of it is written in a very complex language, where formulas and terms are littered. At the initial stage of learning, it is very problematic to understand everything (unless, of course, you are a mathematician), so I will try to describe such an algorithm as gradient descentin simple words.

Gradient descent is an optimization algorithm used to minimize errors in a machine learning model. It works by iteratively adjusting the model parameters in the direction of the negative gradient of the loss function (which represents the error) to reduce the error and find the optimal parameters that give the best prediction results. The algorithm continues this process until it reaches a minimum or a predefined stopping criterion is met.

Or as simple as possible.

Gradient descent is a way to train and improve a machine learning model. He does this by constantly trying to better predict the correct answer by adjusting his “thinking”. To do this, a mathematical formula is used to determine which direction to move in order to get closer to the correct answer. The process is repeated many times until the algorithm can predict the answer as well as possible.

Each red dot is 1 repetition of the algorithm
Each red dot is 1 repetition of the algorithm


Let’s say you want to teach a computer program to predict a person’s height based on their weight. You have a dataset with the weight and height of many people. The program starts with a random assumption about the relationship between weight and height (for example, it might assume that for every 1 kilogram increase in weight, height increases by 0.5 meters).

Using gradient descent, the program will then calculate the error between its prediction and the actual height for each person in the dataset. This error will then be used to determine the gradient or direction of change.

The program will then correct its assumption about the relationship between weight and height (for example, it may reduce its assumption of an increase in height to 0.49 meters for a 1 kilogram increase in weight) and repeat the process of calculating the error and determining the gradient. It will keep repeating this process, updating its guess each time, until the error is minimized to a predetermined level or the stopping criterion is met.

At this point, the program will find the optimal ratio between weight and height, which it can use to predict new data.

There are also alternatives to gradient descent:

  • Stochastic Gradient Descent (SGD): A variant of gradient descent that updates model parameters using only one randomly selected training example per iteration.

  • Mini-batch gradient descent: A variant of gradient descent that updates model parameters using a small, randomly selected batch of training examples per iteration.

  • Conjugate gradient: An optimization algorithm that finds the minimum of a function by iteratively improving the approximation of the minimum. (Put simply, an approximation is an approximate guess about the smallest value of a function.)

  • BFGS (Broyden–Fletcher–Goldfarb–Shanno): A quasi-Newtonian method that approximates the Hessian matrix to quickly converge to the minimum of the loss function.

  • Adam (Adaptive Moment Estimation): A gradient-based optimization algorithm that uses gradient moving averages and gradient squares to adapt the learning rate for each parameter.

Well, some facts:

  • Gradient descent is widely used in machine learning:

    Gradient descent is a widely used optimization algorithm in many machine learning models, including linear regression, logistic regression, and neural networks.

  • Learning rate selection:

    The learning rate, also known as the step size, is an important hyperparameter in gradient descent. A high learning rate can cause the minimum to be exceeded, while a small learning rate can lead to slow convergence. Careful selection of the learning rate is important for getting good results with gradient descent.

  • Convergence and retraining:

    Gradient descent can converge to the minimum of the loss function, but it can also get stuck at a local minimum, leading to overfitting of the training data. To avoid overfitting, regularization methods such as L1 or L2 regularization can be used. (These are ways to prevent excessive model complexity by reducing the impact of certain features in the model.)

  • Gradient descent can be computationally expensive:

    Calculating the gradient of the loss function and updating the parameters at each gradient descent iteration can be computationally expensive, especially for large datasets or complex models. Techniques such as parallel computing and vectorization can be used to reduce computation time.

“Machine learning is the science of getting computers to act without being explicitly programmed.” – Andrew Ng, Co-Founder of Google Brain and former VP and Chief Scientist at Baidu.

In conclusion, I want to say that many things have been exaggerated for ease of understanding. Basically this is just a small overview of what gradient descent is and of course there is still a long way to go in learning.

Hope it will be useful for someone!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *