Layers, activation modules, containers, and models inherit from it
nn.Module
│
├── Layer (e.g., Linear, Conv2d)
│
├── Activation modules (e.g., ReLU, Softmax)
│
├── Containers (e.g., Sequential, ModuleList)
│
└── Model (your custom class inheriting from nn.Module)
So, PyTorch recursively walks through all submodules:
model.parameters() returns parameters from all nested modules
model.to(device) moves everything to GPU/CPU
model.state_dict() saves all parameters and buffers
model.train() / model.eval() toggle behavior (Dropout, BatchNorm)
Model (nn.Module)
│
└── Containers (optional, but common)
│
├── Layers (e.g., Linear, Conv2d)
└── Activations (e.g., ReLU, Softmax)
import torch
import torch.nn as nn
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
# Define network components/layers here
self.linear = nn.Linear(10, 5)
self.activation = nn.ReLU()
def forward(self, x):
# Define the forward pass
x = self.linear(x)
x = self.activation(x)
return x
Most flexible. Gives full control of the forward pass.
The architecture is not strictly sequential (e.g., skip connections, branching, merging).
The model has conditional logic or multiple inputs.
Supports complex models (ResNet, Transformer, etc.).
import torch
import torch.nn as nn
class SequentialModel(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(10, 20),
nn.ReLU(),
nn.Linear(20, 5)
)
def forward(self, x):
# Simple forward pass
return self.layers(x)
Architecture is purely sequential (cannot express anything non-sequential such as no skip connections or loops)
Clean, minimal code:
No need to reference layers individually.
No special logic needed in forward().
The model is strictly feed-forward and linear in structure.
Call the model: out = model(x)
Layers:
self.layers = nn.ModuleList([
nn.Linear(10, 20),
nn.Linear(20, 30),
nn.Linear(30, 40)
])
Forward:
def forward(self, x):
for layer in self.layers:
x = torch.relu(layer(x))
return x
The number of layers can be dynamic (e.g., variable depth networks).
Some parts of the model can be generated programmatically.
Layers:
self.layers = nn.ModuleDict([
"encoder": nn.Linear(10, 20),
"decoder": nn.Linear(20, 10)
])
Forward:
def forward(self, x):
x = self.layers["encoder"](x)
x = torch.relu(x)
x = self.layers["decoder"](x)
return x
Architecture can make runtime choices:
conditional branches
algorithmic routing
attention heads stored by name
Good for encoder-decoder structures, transformers, multi-task models.
A layer is typically a single transformation:
Have parameters stored in weight and possibly bias
Perform a computation in forward()
nn.Linear(in_features, out_features, bias=True)
Applies: y = xWᵀ + b
Parameters: weight, optional bias.
# in_features = 4, out_features = 2
layer = nn.Linear(4, 2)
# Input shape: (batch_size=1, in_features=4)
x = torch.randn(1, 4)
# Output shape: (batch_size=1, out_features=2)
y = layer(x)
Note: PyTorch expects 1D = batch dimension: (batch, in_features) → (batch, out_features)
2D Convolution: nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
Learns filters to detect spatial features in images.
conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)
x = torch.randn(1, 3, 64, 64)
y = conv(x) # (1, 16, 64, 64)
1D and 3D variants:
nn.Conv1d → 1D: audio, time-series
nn.Conv3d → 3D: volumetric data, medical scans
nn.ConvTranspose2d
Used for upsampling (e.g., autoencoders, GANs).
Max Pooling: nn.MaxPool2d(kernel_size, stride)
Reduces spatial size
Picks the max value in a region
pool = nn.MaxPool2d(2)
y = pool(torch.randn(1, 3, 32, 32)) # -> (1, 3, 16, 16)
Average Pooling: nn.AvgPool2d, nn.AdaptiveAvgPool2d
Adaptive = output size fixed (important in ResNet, MobileNet)
Batch Normalization
nn.BatchNorm1d (MLP, sequence features)
nn.BatchNorm2d (CNNs)
nn.BatchNorm3d
Normalizes across batch dimension.
bn = nn.BatchNorm2d(16)
Layer Normalization: nn.LayerNorm(normalized_shape)
Used often in transformers.
InstanceNorm, GroupNorm, LocalResponseNorm
Used in style transfer, segmentation, etc.
nn.Dropout(p=0.5)
Randomly zeroes activations during training to prevent overfitting.
drop = nn.Dropout(0.3)
y = drop(torch.randn(5, 10))
Other variants:
nn.Dropout1d
nn.Dropout2d
nn.Dropout3d
RNN: nn.RNN(input_size, hidden_size, num_layers=1)
LSTM: nn.LSTM(input_size, hidden_size, ...)
Stores long-term dependencies
rnn = nn.LSTM(10, 20, num_layers=2)
x = torch.randn(5, 3, 10) # seq_len=5, batch=3, features=10
output, (h, c) = rnn(x)
GRU: nn.GRU(input_size, hidden_size, ...)
Faster alternative to LSTM.
nn.Embedding(num_embeddings, embedding_dim)
Maps word indices → dense vectors.
embed = nn.Embedding(1000, 64)
x = torch.tensor([1, 5, 9])
embed(x)
Used in NLP and transformers.
nn.TransformerEncoderLayer
Contains:
Multi-head attention: nn.MultiheadAttention(embed_dim, num_heads)
Feedforward layers
LayerNorm
Dropout
layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
nn.Upsample(scale_factor=2, mode='nearest')
nn.PixelShuffle(upscale_factor)
Used in SRGAN, super-resolution.
nn.Flatten(start_dim=1)
flat = nn.Flatten()
y = flat(torch.randn(1, 3, 28, 28)) # -> (1, 2352)
nn.Unflatten
nn.ZeroPad2d, nn.ReflectionPad2d, nn.ConstantPad2d
nn.CosineSimilarity
nn.PairwiseDistance
Module version (nn.ReLU())
Functional version (F.relu(x))
nn.ReLU
act = nn.ReLU()
act(x)
nn.LeakyReLU(negative_slope=0.01)
nn.PReLU
nn.Sigmoid
nn.Tanh
nn.Softmax(dim=1)
nn.GELU (used in transformers)
nn.CrossEntropyLoss
criterion = nn.CrossEntropyLoss()
loss = criterion(pred, target)
nn.MSELoss
nn.BCELoss
nn.NLLLoss
nn.SmoothL1Loss
nn.Module provides:
a consistent interface (__call__, .to(device), .eval(), .train())
automatic registration of parameters and buffers. Note: Loss functions don't usually have parameters, but some do!
cast correctly with .float(), .half(), .bfloat16()
PyTorch models are typically subclasses of nn.Module.
Inside these modules, trainable weights and biases are stored as parameters, which PyTorch manages automatically.
These parameters are instances of torch.nn.Parameter, which are essentially tensors with a flag requires_grad=True.
for param in model.parameters():
print(param.shape)
for name, param in model.named_parameters():
print(f"{name}: {param.shape}")
total_params = sum(p.numel() for p in model.parameters())
p.numel() gives the number of elements in the tensor (like product of its dimensions).
Sums across all parameters.
To count only trainable parameters:
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
Training mode affects certain layers:
Dropout
Active during training.
Randomly zeroes out some activations with probability p.
Helps prevent overfitting.
BatchNorm
Uses mini-batch mean and variance.
Updates the layer's running mean/variance (used later during evaluation).
All other layers like Linear, Conv2d, ReLU, etc., behave the same in both train & eval modes.
This changes:
Dropout
Disabled — no dropout mask, outputs are deterministic.
BatchNorm
Uses the running mean/variance estimated during training.
Does not update running statistics.
Typically, wrap inference in torch.no_grad() to prevent gradient tracking:
model.eval()
with torch.no_grad():
output = model(x)
model = MyModel()
# --- Training ---
model.train()
for x, y in train_loader:
optimizer.zero_grad()
pred = model(x)
loss = criterion(pred, y)
loss.backward()
optimizer.step()
# --- Evaluation ---
model.eval()
with torch.no_grad():
for x, y in test_loader:
pred = model(x)
model.state_dict() returns a Python dictionary mapping each layer to its parameter tensors.
torch.save() serializes that dictionary and writes it to 'model.pth'.
This approach is lightweight (just the parameters), and portable across machines.
torch.load('model.pth') reads the saved parameter dictionary from disk.
model.load_state_dict(...) inserts those parameters into the model architecture.
👉 Important: To load parameters, the same model architecture must be recreated first:
model = MyModel() # same class/structure as when saving
model.load_state_dict(torch.load('model.pth'))
model.eval() # if using for inference
👉 Important: If the model was saved on GPU but loaded on CPU:
model.load_state_dict(torch.load('model.pth', map_location='cpu'))