Dev Docs

Training Process: Understanding Gradient Descent for Optimizing Weights and Bias

Gradient descent is a fundamental optimization algorithm used to train many machine learning models by iteratively adjusting the model parameters (weights and bias) to minimize the error between predictions and actual labels. Here’s an explanation in depth:

The Goal of Training

The aim is to find the optimal weights ($w_1, w_2, \dots, w_n$) and bias ($b$) that minimize a chosen loss function. The loss function measures how far off the model’s predictions are from the true labels. For example:

In regression: Mean Squared Error (MSE).
In classification: Cross-Entropy Loss.

What is Gradient Descent?

Gradient descent is an iterative optimization algorithm that works by taking small steps in the direction opposite to the gradient (slope) of the loss function with respect to the model parameters.

Mathematically:

The gradient is the vector of partial derivatives of the loss function (( L )) with respect to each parameter: $$ \nabla L = \left[ \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \dots, \frac{\partial L}{\partial w_n}, \frac{\partial L}{\partial b} \right] $$

Steps in Gradient Descent

Initialize Parameters:
- Start with random or zero values for weights (( w_i )) and bias (( b )).
Compute Predictions:
- Use the current weights and bias to calculate the predicted output (( \hat{y} )) for all training examples.
For a linear model: $$ \hat{y} = w_1x_1 + w_2x_2 + \dots + w_nx_n + b $$
Calculate the Loss:
- Use the loss function to measure the error between predictions (( \hat{y} )) and actual labels (( y )).
Example: For Mean Squared Error (MSE), $$ L = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2 $$
Compute the Gradient:
- Calculate the gradient of the loss function with respect to each parameter. For example:
  - For weight ( w_j ): $$ \frac{\partial L}{\partial w_j} = \frac{1}{N} \sum_{i=1}^N 2 (\hat{y}i - y_i) \cdot x{i,j} $$
  - For bias ( b ): $$ \frac{\partial L}{\partial b} = \frac{1}{N} \sum_{i=1}^N 2 (\hat{y}_i - y_i) $$
Update Parameters:
- Adjust the weights and bias in the direction opposite to the gradient to reduce the loss: $$ w_j = w_j - \eta \cdot \frac{\partial L}{\partial w_j} $$ $$ b = b - \eta \cdot \frac{\partial L}{\partial b} $$ Here, ( \eta ) is the learning rate, a hyperparameter that determines the step size.
Repeat:
- Iterate over steps 2–5 for multiple epochs (passes over the training data) until convergence or until the loss stops decreasing significantly.

Learning Rate (( \eta )): Importance and Challenges

Too Small: Leads to very slow convergence.
Too Large: Can cause the model to overshoot the minimum or diverge.

Variants of Gradient Descent

Batch Gradient Descent:
- Uses the entire training dataset to compute the gradient in each iteration.
- Stable but computationally expensive for large datasets.
Stochastic Gradient Descent (SGD):
- Updates parameters for each training example individually.
- Faster but introduces more noise into updates.
Mini-batch Gradient Descent:
- Computes the gradient for small batches of data, balancing efficiency and stability.

Key Concepts in Gradient Descent

Convergence:
- The algorithm stops when the loss reaches a minimum or after a fixed number of iterations.
Local vs Global Minimum:
- For non-convex loss functions, gradient descent may converge to a local minimum rather than the global minimum.
Regularization:
- Adding terms like ( \lambda ||w||^2 ) to the loss function prevents overfitting by penalizing large weights.

Gradient descent is at the heart of machine learning training and plays a pivotal role in adjusting model parameters to achieve optimal performance. Let me know if you’d like a practical example or deeper insights into specific gradient descent variants!