Implementing SGD(Stochastic Gradient Descent) for Linear Regression

kalikiri harichandhana
5 min readNov 2, 2020

Before implementing Stochastic Gradient Descent lets talk about what a Gradient Descent is.

Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).It is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm. It can be slow to run on very large datasets. Because one iteration of the gradient descent algorithm requires a prediction for each instance in the training dataset, it can take a long time when you have many millions of instances. when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent.

SGD refers to calculating the derivative from each training data instance and calculating the update immediately.

In order to obtain accurate results with SGD the data sample should be in a random order, and this is why we want to shuffle the training set for every epoch.

METRIC:

The R2 score varies between 0 and 100%. It is closely related to the MSE (see below), but not the same. Wikipedia defines R2 as

” …the proportion of the variance in the dependent variable that is predictable from the independent variable(s).”

Another definition is “(total variance explained by model) / total variance.”

R-squared = Explained variation / Total variation

So if it is 100%, the two variables are perfectly correlated, i.e., with no variance at all. A low value would show a low level of correlation, meaning a regression model that is not valid, but not in all cases. We use R2 score metric for evaluating our model.

LOSS FUNCTION:

Loss function

We predict y value(y = mx + c), for a given x.

The loss is the error in our predicted value of m and c. Our goal is to minimize this error to obtain the most accurate value of m and c.

Now that we have defined the loss function, lets get into the interesting part — minimizing it and finding m and c.

  1. Initially assume m = 0 and c = 0. Let L be our learning rate (a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function). L could be a small value like 0.0001 for good accuracy.
  2. Calculate the partial derivative of the loss function with respect to m, and plug in the current values of x, y, m and c in it to obtain the derivative value D.
Derivative of loss function with respective to m
Derivative of loss function with respective to c
  • we update the current value of m and c using the equations.
  • Repeat this process until our loss function is a small value(which means 0 error ,100%accuracy). The value of m and c that we are left with now will be the optimal values.

IMPLEMENTATION OF SGD :

Import modules required and split the data into train and test and scale it.

  1. Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data.
  2. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1.
  3. Note that the same scaling must be applied to the test vector to obtain meaningful results. This can be easily done using StandardScaler.

As standardization is useful when data has varying scales and the algorithm we are using does make assumptions about data having a Gaussian distribution. Linear regression is one such algorithm.

Here is the implementation of Linear regression with SGD .We use this implementation for training our data.

Hyper -parameter tuning is done for obtaining optimal learning rate and batchsize .

Using our implementation of linear regression with SGD ,we train our model with train data and predict on test data using our trained model.

From the above plot we can clearly see that our predicted prices and actual prices are forming a straight line indicating that our model’s prediction is close to actual ones.

print("R^2 score for
our model :", r2_score(Y_test, pred_test))
out[4]:R^2 score for our model : 0.7226745354368981

We have got a decent r2 value >0.7 indicating that our model is performing well.

Thanks for reading:)

References:

--

--