SGDRegressor and SGDClassifier

SGDRegressor and SGDClassifier use stochastic gradient descent (SGD) as an optimization algorithm. That makes the model very efficient for large datasets because due to SGD, the model parameters are updated after each training sample instead of the entire dataset. Moreover, both algorithms have a lot of flexibility regarding different loss functions, methods to prevent overfitting, and general optimizations.

The SGDClassifier for classification tasks is based on the SGDRegressor and adapts two key elements:
1) use a classification loss such as log loss or hinge loss
2) and an output activation function like a logistic or softmax function.

Stochastic Gradient Descent (SGD) vs. Batch Gradient Descent

The difference between stochastic gradient descent and batch gradient descent is how often the parameters of the model are updated.

Stochastic gradient descent: model parameters are updated after each training sample.
Batch gradient descent: model parameters are updated after each training epoch (processing the entire dataset)

Therefore SGDRegressor and SGDClassifier can make more frequent updates to the model parameters and adapt more quickly to the data, especially for large datasets.

Importance of Feature Scaling for SGD

Because SGDRegressor and SGDClassifier are based on stochastic gradient descent, both algorithms are sensitive to feature scaling.

Learning rate: The learning rate is the hyperparameter that determines how much to adjust the model parameters at each step of the gradient descent algorithm. If the input features have different scales, the learning rate may be too small for some features and too large for others, which can lead to slow convergence or oscillations during training.
Regularization: If the input features have vastly different scales, the regularization term (L1 or L2) that is added to the loss function may not be applied uniformly across all features, which can lead to biased or suboptimal solutions.

Therefore before the algorithm is trained, you have to scale the input features. You can apply for example

Normalization

When we normalize features in a dataset we rescale each feature to the value range between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = scaler.fit_transform(df)

Standardization

In the case of standardization, features are transformed and after the transformation, the features have a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = scaler.fit_transform(df)

Apply Early Stopping for Stochastic Gradient Descent

Because SGDRegressor and SGDClassifier are based on gradient descent, you can use early stopping to prevent overfitting and to find the least number of iterations which is sufficient to build a model that generalizes well. In the default parameter setting, early stopping is deactivated. You enable early stopping by setting the parameter early_stopping=True.
When early stopping is activated, a fraction (see parameter validation_fraction) of the input data is not used to train the algorithm but used to compute the validation score. The optimization continues until the validation score did not improve at least tol during the last n_iter_no_change iterations.
The actual number of iterations is saved during the training and available at the attribute n_iter_.

The following end-to-end example shows the influence of early stopping for the SGDClassifier. You can run the example via Google Colab in your own account.

We train the SGDClassifier with and without early stopping on the breast cancer dataset and visualize the number of iterations that were computed to reach convergence and also the training/validation accuracy. At the end of the Jupyter notebook, you see the plots that show the results of the example.

We see that with early stopping, the SGDClassifier only needs 10 iterations instead of 23 to reach an accuracy of 0.97891. This is only a few percentage points lower compared to the classifier that does not use early stopping and has an accuracy of 0.982425.
Because a fraction validation_fraction (in our example 20%) of the training data is held out for the validation stopping criterion, the accuracy for the SGDClassifier trained with enabled early stopping is lower. Simply because a part of the data is not used to train the algorithm.

Advantages & Disadvantages of SGDRegressor and SGDClassifier

Advantages

Scalability: SGD can handle large datasets and high-dimensional input features efficiently because it updates the model parameters for a subset of the training data rather than processing the entire dataset at once.
Flexibility: SGD allows for a wide range of loss functions and regularization techniques.
Convergence speed: SGD can converge faster than batch gradient descent, especially for large datasets or when the model has a large number of parameters because the model parameters are updated more frequently.
Memory efficiency: SGD requires minimal memory usage, as it updates the model parameters for each training example and discards the data afterward.

Disadvantages

Sensitivity to hyperparameters: The algorithm requires careful tuning of hyperparameters to avoid overfitting, which can be time-consuming and challenging to optimize.
Sensitive to feature scaling: SGD is sensitive to the scaling of input features.
Noisy gradients: SGD estimates the gradient of the loss function using a single training example, which can result in noisy and unstable updates to the model parameters.

SGDRegressor and SGDClassifier Parameter Overview

SGDRegressor and SGDClassifier have the same parameter when using the Sklearn library. Only the loss functions are different and dependent on regression or classification. The following table shows the parameter for the SGDRegressor and SGDClassifier algorithm, a short description, of how to tune the parameter in case of overfitting, and a note of how to use the parameter best.

loss: Loss function that is used by Stochastic Gradient Descent
- The loss function depends on the actual problem and should therefore be adjusted during the optimization.
  - Regression (SGDRegressor)
    - squared_error: Mean Square Error
    - huber: based on ordinary least squares but focuses less on getting outliers correct by switching from MSE to MAE past a distance of epsilon. The error of outliers is not squared in the calculation of the loss and therefore the influence of outliers on the total error is limited.
    - epsilon_insensitive: ignores errors less than epsilon and is linear past that. Analog SVR
    - squared_epsilon_insensitive: ignores errors less than epsilon but becomes squared loss past tolerance of epsilon
  - Classification (SGDClassifier)
    - hinge: analog to Support Vector Classification
    - log_loss: analog to Logistic Regression
    - perceptron: linear loss used by the perceptron algorithm
    - modified_huber: more tolerant to outliers
penalty: regularization term
- Overfitting: when the model is overfitting use L1 or elasticnet instead of L2 regularization.
- Note: keep in mind that when using L1 or elasticnet, the algorithm might set the weights of the least important features near/completely to zero (performs feature selection)
alpha: control the regularization as a constant that multiplies the L1 or L2 term. For SGDRegressor and SGDClassifier alpha also controls the learning rate, when the learning rate parameter is set to ‘optimal’.
- Initial search space: float 0.00001…100 -> np.logspace(-5,2,15)
- Overfitting: increase values of alpha to increase regularization.
- Note: when using L1 (Lasso) as a regularization term, higher alpha values increase computation time.
l1_ratio: control the ratio of the L1 and L2 regulariazion only for elasticnet regularization.
- Initial search space: float 0…1
- Note: l1_ratio=0: only L2 penalty; l1_ratio=1: only L1 penalty
max_iter: number of epochs (passes over training data) during the training of the machine learning algorithm
- Initial search space: 100…10000
- Overfitting: to reduce overfitting, reduce the max_iter
tol: when using early stopping, tol is the tolerance where the score of the model is compared against during training. If the score is not improving by at least tol (loss > best_loss – tol) for n_iter_no_change, training is stopped.
- Initial search space: 0.0001…0.01
- Overfitting: decrease the tolerance, because if the error function starts to level off or fluctuate, the tolerance threshold should stop the learning.
- Note: to optimize the tol value, you have to look at the curve of the error function. The error function should start high and gradually decrease as the model learns to better fit the training data. If the error function starts to level off or fluctuate (adaptes too much to the training data), the model is approaching its optimal performance on the training data and may not be able to improve much further.
epsilon: parameter for ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’ loss functions.
- Initial search space: 0.05…0.2
- Overfitting: for loss function = ‘huber’: decrease epsilon and for loss function = ‘epsilon_insensitive’ or ‘squared_epsilon_insensitive’ increase epsilon
- Note: for more details see the loss function article.
learning_rate: different learning rate schedules
- Overfitting:
- Note: see the article about learning rate schedules
eta0: initial learning rate
- Initial search space: 0.001…0.1
- Overfitting: when the model is overfitting, try to increase the learning rate
- Note:
early_stopping: Whether to use early stopping to terminate training when the validation score is not improving.
- Initial search space: bool [true, false]
- Overfitting: use early stopping (true) to prevent the model to overfit the training data.
- Note: a fraction (validation_fraction) of the training data is not used for training but for validation if the validation score is improving or not.
validation_fraction: proportion of training data to set aside as validation set for early stopping
- Initial search space: 0.1 default value
n_iter_no_change: Number of iterations with no improvement to wait before the training is stopped
- Initial search space: int 3…10
- Overfitting: reduce n_iter_no_change in case of overfitting

SGDRegressor End-to-End Example

Because we already used the SGDClassifier to show the influence of early stopping on the stochastic gradient descent algorithm, I also created an end-to-end example for the SGDRegressor that you also find on Google Colab.

In the example, we use the sklearn diabetes dataset to train and optimize the SGDRegressor. Because we have to scale the dataset, we build a pipeline with the standard scaler and the SGDRegressor. The whole pipeline is then optimized and cross-validated in the GridSearchCV function.

After the training, we print the neg_mean_squared_error of the training data and also the best parameter combination from the GridSearchCV. The last step of the end-to-end example is to visualize the results with a scatter plot. On the x-axis, you have the predicted value, and on the y-axis is the actual value. The perfect regression algorithm would have zero residuals and therefore all samples would lie on the red regression line.