Least Squares in Linear Algebra

Linear regression is an important predictive analytical tool in the data scientist’s toolbox. In this blog, we implement least squares to approximate solutions of over-determined systems of linear equations by minimizing the sum of the squares of the errors in the equations. An introduction of how to use linear algebra to solve regression problems into machine learning and predictive analysis is reported.

View by Geometry

Linear equation has no solution when the matrix has more rows than columns, which means there are more equations than unknowns. Therefore, we cannot always get the error down to zero. However, a least squares solution can be accessed with optimization of error. In other words, we can get the exact solution when error is zero.

A toy example in the case of simple linear regression that approximates the linear mapping is presneted.

The column space of A is constructed by the combination of vectors a1 and a2. Due to b is outside that column space, the equation cannot be exactly solved for the vector x. For any vetor Ax, we have error Ax-b:

When vector Ax-p is reduced to zero, the error Ax-b is minimized. That is the case when Ax-b is perpendicular to the column space. Thereafter, we have the smallest possible error e=b-p, where p is the projection of b in space A. The ordinary least squares solution is

View by Linear Algebra

Now let’s rename the variables and solve the issue for a more general case. The model is composed by paired observations of the independent variable X and the dependent variable Y with error:

The matrix format is

By rearranging the equation, the error vetor with respect to 𝛽 is

Furthermore, the mean squared error (MSE) can be calculated and represented in matrix format:

Therefore, the gradient of the MSE with respect to 𝛽 is

Transpose the equation and set it to zero at the optimum of 𝛽, we have

The result is consistent with the conclusion derived above. Now let’s normalize the matrix products and estimate the coefficients.

which is the same as the results from the least squares of simple linear regression model.

The model can be extended to higher order as well,

and same optimum of 𝛽 can be calculated as follows.

The diagonal matrix can be significantly simplified by orthogonalizing the columns, which is the Gram-Schmidt process.

This simple and elegant formula works perfectly fine for the case of simple linear regression due to a limited number of computed dimensions, but with big data sets, ordinary least squares is computationally inefficient and expensive as it can potentially involve a huge number of complex mathematical operations. So in practice, gradient descent is performed to solve the linear regression optimization problem.

What happened couldn’t have happened any other way…