Measure how far a model's predictions are from the true labels
Loss functions are typically subclasses of nn.Module.
Mean Squared Error: nn.MSELoss
Used for continuous output:
loss_fn = nn.MSELoss()
loss = loss_fn(pred, target)
Mean Absolute Error: nn.L1Loss
Less sensitive to outliers:
nn.CrossEntropyLoss
Used for multi-class classification (one label).
This combines LogSoftmax + NLLLoss (negative log likelihood) into one efficient function.
Input: logits of shape (batch_size, num_classes)
Target: class indices of shape (batch_size)
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels)
nn.BCEWithLogitsLoss
Binary classification with better numerical stability.
loss_fn = nn.BCEWithLogitsLoss()
Use when:
output shape is (batch_size, 1) or (batch_size,)
targets in {0, 1}
nn.SmoothL1Loss — for robust regression
nn.HuberLoss
nn.KLDivLoss — for distributions
nn.MarginRankingLoss
nn.TripletMarginLoss
nn.CTCLoss — sequence alignment (speech, OCR)
Example:
class MyLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, pred, target):
return torch.mean((pred - target)**2)
Optimizers update model parameters using gradients computed in backward().
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
criterion = torch.nn.CrossEntropyLoss()
for data, target in dataloader:
optimizer.zero_grad() # clear old gradients
output = model(data) # forward pass
loss = criterion(output, target) # compute loss
loss.backward() # backward pass (computes gradients: dL/dW)
optimizer.step() # update parameters
Must be called each iteration, because by default PyTorch accumulates gradients in .grad.
Forward computation (e.g., linear layers, convolutions, activations)
Autograd builds a computational graph tracking operations, only if involved tensors have requires_grad=True.
Each operation becomes a node in the graph, enabling gradient computation later.
criterion is the loss function (e.g., nn.CrossEntropyLoss).
The result loss is a scalar tensor.
Because output participates in the graph, loss also becomes part of the graph.
Autograd traverses the computational graph in reverse (reverse-mode automatic differentiation).
It computes gradients of the loss w.r.t. each parameter: ∂𝐿/∂𝑊
After this call, every trainable parameter has its gradient stored in:
param.grad # e.g., model.linear.weight.grad
Reads gradients from .grad and updates parameters according to optimization rule.
Examples:
SGD:
param = param - lr * param.grad
Adam: It maintains additional internal state:
Momentum (first moment): exponential moving average of gradients.
Adaptive learning rate (second moment): exponential moving average of squared gradients.
Bias correction: compensates for initialization at zero.
Adam update (conceptual):
m = β1 * m + (1 - β1) * grad # momentum
v = β2 * v + (1 - β2) * grad^2 # adaptive rate
m_hat = m / (1 - β1^t) # bias correction
v_hat = v / (1 - β2^t)
param = param - lr * m_hat / (sqrt(v_hat) + ε)
optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
Tracks momentum and RMS of gradients.
Used in RNNs or reinforcement learning tasks.
optimizer = optim.RMSprop(model.parameters(), lr=0.001)
Adam with decoupled weight decay (better for transformers).
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
optim.Adagrad
optim.Adadelta
optim.ASGD
optim.LBFGS (for small networks, second-order method)
One of the most important hyperparameters in training neural networks.
It controls how big a step the optimizer takes when updating model parameters.
η = learning rate
∇θ L = gradient of the loss w.r.t. parameters
Training becomes unstable
Loss may fluctuate wildly
Model may fail to converge
Training becomes very slow
Model could get stuck in local minima
Requires more epochs to reach good accuracy
Converges fast
Remains stable
Finds a good solution
Changes the learning rate over time during training.
Why? Because a constant LR often performs poorly. Good training strategy:
Start with a higher LR for fast learning
Gradually reduce LR for fine-tuning
PyTorch provides many schedulers under: torch.optim.lr_scheduler
Lowers LR by a factor every fixed number of epochs.
# optim.lr_scheduler.StepLR
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
# Each epoch
scheduler.step()
Every 10 epochs, LR = LR × 0.1
Simple and widely used
Drops LR at specific epochs.
scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)
LR drops at epochs 30 and 80
Useful when you know the training schedule
Lowers LR exponentially each epoch.
scheduler = ExponentialLR(optimizer, gamma=0.95)
LR decreases by 5% each step
Uses cosine decay; LR smoothly decreases then increases (annealing) toward end.
scheduler = CosineAnnealingLR(optimizer, T_max=50)
Very popular in modern deep learning
Good for training stability
Monitors validation loss and decreases LR when training stops improving.
scheduler = ReduceLROnPlateau(optimizer, factor=0.1, patience=5)
Great for adaptive control
Used in many production models
LR oscillates between a lower and upper bound.
scheduler = CyclicLR(optimizer, base_lr=0.001, max_lr=0.01)
Helps escape local minima
Works well for small datasets
LR increases then decreases (one cycle).
Used with large-scale models.
scheduler = OneCycleLR(optimizer, max_lr=0.01, steps_per_epoch=100, epochs=10)
Provides good generalization
A favorite for fast training
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)
criterion = torch.nn.CrossEntropyLoss()
for epoch in range(100):
for data, target in train_loader:
optimizer.zero_grad() # reset gradients
output = model(data) # forward pass
loss = criterion(output, target) # compute loss
loss.backward() # backward pass
optimizer.step() # update parameters
validate(...) # evaluate on validation set
scheduler.step() # update learning rate