Feature scaling is an important transformations to engineer data. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. On the other hand, scaling the target values is generally not required.

There are two common ways to get all attributes to have the same scale: min- max scaling and standardization. For min-max scaling or normalization, values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min. For some reason, you also can…

Mini-batch Gradient Descent computes the gradients on small random sets of instances called mini-batches. Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.

Mini-batch sizes or batch sizes, are often tuned to an aspect of the computational architecture on which the implementation is being executed. The main advantage of Mini-batch GD is that we can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

The algorithm’s…

Stochastic Gradient Descent picks a random instance in the training set at every step and computes the gradients based only on that single instance. Working on a single instance at a time makes the algorithm much faster because it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration.

Due to its stochastic or random nature, this algorithm is much less regular. Instead of gently decreasing until it reaches the minimum, the cost function will bounce up and…

To implement Gradient Descent, we need to compute the gradient of the cost function with regard to each model parameter θj. In other words, partial derivative need to be calculated. Therefore, the Partial derivatives of the cost function can be presented as

The gradient vector of the cost function, contains all the partial derivatives of the cost function, can be described as

A linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias or intercept term.

To train a Linear Regression model, we need to find the value of θ that minimizes the mean squared error (MSE) or RMSE. MSE cost function for a Linear Regression model is

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Gradient Descent gets to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. It measures the local gradient of the error function with regard to the parameter vector θ, and it goes in the direction of descending gradient. You start by filling θ with random values, which is called random initialization. …

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in its predictions, with a higher weight for large errors.

m is the number of instances in the dataset you are measuring the RMSE on. x(i) is a vector of all the feature values (excluding the label) of the ith instance in the dataset, and y(i) is its label. X is a matrix containing all the feature values (excluding labels) of all instances in the dataset. There is one row per instance, and the…

How well a model will generalize to new cases is to actually try it out on new cases. A standard option is to split the data into two sets: the training set and the test set. You train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error or out-of-sample error, and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before. It…

The system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process is called feature engineering, which involves Feature selection (selecting the most useful features to train on among existing features) and Feature extraction (combining existing features to produce a more useful one).


Overfitting means that the model performs well on the training data, but it does not generalize well. Complex models can detect subtle…

One way to categorize Machine Learning systems is by how they generalize. There are two main approaches to generalization: instance-based learning and model-based learning.

Instance-Based Learning

For instance-based learning, the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples or a subset of them.

Model-Based Learning

Another way to generalize from a set of examples is to build a model of these examples and then use that model to make predictions. This is called model-based learning.

For model selection, you can either define a utility function or fitness function that…

Ning Chen

What happened couldn’t have happened any other way…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store