PyTorch Optimization

Loss Functions (aka Criteria)

Measure how far a model's predictions are from the true labels
Loss functions are typically subclasses of nn.Module.

Regression Losses

Mean Squared Error: nn.MSELoss
Used for continuous output:

loss_fn = nn.MSELoss()

loss = loss_fn(pred, target)

Mean Absolute Error: nn.L1Loss
- Less sensitive to outliers:

Classification Losses

nn.CrossEntropyLoss

Used for multi-class classification (one label).
This combines LogSoftmax + NLLLoss (negative log likelihood) into one efficient function.
Input: logits of shape (batch_size, num_classes)
Target: class indices of shape (batch_size)

loss_fn = nn.CrossEntropyLoss()

loss = loss_fn(logits, labels)

nn.BCEWithLogitsLoss

Binary classification with better numerical stability.

loss_fn = nn.BCEWithLogitsLoss()

Use when:
- output shape is (batch_size, 1) or (batch_size,)
- targets in {0, 1}

Specialized Losses

nn.SmoothL1Loss — for robust regression
nn.HuberLoss
nn.KLDivLoss — for distributions
nn.MarginRankingLoss
nn.TripletMarginLoss
nn.CTCLoss — sequence alignment (speech, OCR)

Custom Loss Function

Example:

class MyLoss(nn.Module):

def __init__(self):

super().__init__()

def forward(self, pred, target):

return torch.mean((pred - target)**2)

Optimization – torch.optim

Optimizers update model parameters using gradients computed in backward().

Typical Training Loop:

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

criterion = torch.nn.CrossEntropyLoss()

for data, target in dataloader:

optimizer.zero_grad() # clear old gradients

output = model(data) # forward pass

loss = criterion(output, target) # compute loss

loss.backward() # backward pass (computes gradients: dL/dW)

optimizer.step() # update parameters

Reset Gradients: optimizer.zero_grad()

Must be called each iteration, because by default PyTorch accumulates gradients in .grad.

Forward Pass: output = model(data)

Forward computation (e.g., linear layers, convolutions, activations)
Autograd builds a computational graph tracking operations, only if involved tensors have requires_grad=True.
Each operation becomes a node in the graph, enabling gradient computation later.

Compute Loss: loss = criterion(output, target)

criterion is the loss function (e.g., nn.CrossEntropyLoss).
The result loss is a scalar tensor.
Because output participates in the graph, loss also becomes part of the graph.

Backward pass: loss.backward()

Autograd traverses the computational graph in reverse (reverse-mode automatic differentiation).
It computes gradients of the loss w.r.t. each parameter: ∂𝐿/∂𝑊
After this call, every trainable parameter has its gradient stored in:

param.grad # e.g., model.linear.weight.grad

Parameter Update : optimizer.step()

Reads gradients from .grad and updates parameters according to optimization rule.
Examples:
- SGD:

param = param - lr * param.grad

- Adam: It maintains additional internal state:
  - Momentum (first moment): exponential moving average of gradients.
  - Adaptive learning rate (second moment): exponential moving average of squared gradients.
  - Bias correction: compensates for initialization at zero.
  - Adam update (conceptual):

m = β1 * m + (1 - β1) * grad # momentum

v = β2 * v + (1 - β2) * grad^2 # adaptive rate

m_hat = m / (1 - β1^t) # bias correction

v_hat = v / (1 - β2^t)

param = param - lr * m_hat / (sqrt(v_hat) + ε)

Optimizers

SGD: Stochastic Gradient Descent

optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

Adam: Adaptive Moment Estimation (most commonly used)

optimizer = optim.Adam(model.parameters(), lr=1e-3)

Tracks momentum and RMS of gradients.

RMSprop

Used in RNNs or reinforcement learning tasks.

optimizer = optim.RMSprop(model.parameters(), lr=0.001)

AdamW

Adam with decoupled weight decay (better for transformers).

optimizer = optim.AdamW(model.parameters(), lr=1e-4)

Others

optim.Adagrad
optim.Adadelta
optim.ASGD
optim.LBFGS (for small networks, second-order method)

Learning Rate (LR)

One of the most important hyperparameters in training neural networks.
It controls how big a step the optimizer takes when updating model parameters.

During training, parameters are updated using gradient descent: θ_new = θ_old − η · ∇θ L

η = learning rate
∇θ L = gradient of the loss w.r.t. parameters

If LR is too high:

Training becomes unstable
Loss may fluctuate wildly
Model may fail to converge

If LR is too low:

Training becomes very slow
Model could get stuck in local minima
Requires more epochs to reach good accuracy

🎯 Choose an LR that:

Converges fast
Remains stable
Finds a good solution

Learning Rate Schedulers

Changes the learning rate over time during training.
Why? Because a constant LR often performs poorly. Good training strategy:
- Start with a higher LR for fast learning
- Gradually reduce LR for fine-tuning
PyTorch provides many schedulers under: torch.optim.lr_scheduler

StepLR

Lowers LR by a factor every fixed number of epochs.

# optim.lr_scheduler.StepLR

scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Each epoch

scheduler.step()

Every 10 epochs, LR = LR × 0.1
Simple and widely used

MultiStepLR

Drops LR at specific epochs.

scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)

LR drops at epochs 30 and 80
Useful when you know the training schedule

ExponentialLR

Lowers LR exponentially each epoch.

scheduler = ExponentialLR(optimizer, gamma=0.95)

LR decreases by 5% each step

CosineAnnealingLR

Uses cosine decay; LR smoothly decreases then increases (annealing) toward end.

scheduler = CosineAnnealingLR(optimizer, T_max=50)

Very popular in modern deep learning
Good for training stability

ReduceLROnPlateau

Monitors validation loss and decreases LR when training stops improving.

scheduler = ReduceLROnPlateau(optimizer, factor=0.1, patience=5)

Great for adaptive control
Used in many production models

CyclicLR

LR oscillates between a lower and upper bound.

scheduler = CyclicLR(optimizer, base_lr=0.001, max_lr=0.01)

Helps escape local minima
Works well for small datasets

OneCycleLR

LR increases then decreases (one cycle).
Used with large-scale models.

scheduler = OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=100, epochs=10)

Provides good generalization
A favorite for fast training

Learning Rate Scheduler in the Training Loop

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

criterion = torch.nn.CrossEntropyLoss()

for epoch in range(100):

for data, target in train_loader:

optimizer.zero_grad() # reset gradients

output = model(data) # forward pass

loss = criterion(output, target) # compute loss

loss.backward() # backward pass

optimizer.step() # update parameters

validate(...) # evaluate on validation set

scheduler.step() # update learning rate

Google Sites

Report abuse