PyTorch Module

torch.nn.Module — The Universal Building Block

Layers, activation modules, containers, and models inherit from it

nn.Module

│

├── Layer (e.g., Linear, Conv2d)

│

├── Activation modules (e.g., ReLU, Softmax)

│

├── Containers (e.g., Sequential, ModuleList)

│

└── Model (your custom class inheriting from nn.Module)

So, PyTorch recursively walks through all submodules:
- model.parameters() returns parameters from all nested modules
- model.to(device) moves everything to GPU/CPU
- model.state_dict() saves all parameters and buffers
- model.train() / model.eval() toggle behavior (Dropout, BatchNorm)

Composition Hierarchy — How a Model is Built:

Model (nn.Module)

│

└── Containers (optional, but common)

│

├── Layers (e.g., Linear, Conv2d)

└── Activations (e.g., ReLU, Softmax)

A Simple Model (No Container)

import torch

import torch.nn as nn

class SimpleModel(nn.Module):

def __init__(self):

super().__init__()

# Define network components/layers here

self.linear = nn.Linear(10, 5)

self.activation = nn.ReLU()

def forward(self, x):

# Define the forward pass

x = self.linear(x)

x = self.activation(x)

return x

Model Without Container

Most flexible. Gives full control of the forward pass.
The architecture is not strictly sequential (e.g., skip connections, branching, merging).
The model has conditional logic or multiple inputs.
Supports complex models (ResNet, Transformer, etc.).

A Model with Container

import torch

import torch.nn as nn

class SequentialModel(nn.Module):

def __init__(self):

super().__init__()

self.layers = nn.Sequential(

nn.Linear(10, 20),

nn.ReLU(),

nn.Linear(20, 5)

)

def forward(self, x):

# Simple forward pass

return self.layers(x)

Model With container (e.g., nn.Sequential)

Architecture is purely sequential (cannot express anything non-sequential such as no skip connections or loops)
Clean, minimal code:
- No need to reference layers individually.
- No special logic needed in forward().

Containers

nn.Sequential — Layers WITH an automatic forward pass

The model is strictly feed-forward and linear in structure.
Call the model: out = model(x)

nn.ModuleList — Handles loops, recursion, dynamic depth

Layers:

self.layers = nn.ModuleList([

nn.Linear(10, 20),

nn.Linear(20, 30),

nn.Linear(30, 40)

])

Forward:

def forward(self, x):

for layer in self.layers:

x = torch.relu(layer(x))

return x

The number of layers can be dynamic (e.g., variable depth networks).
Some parts of the model can be generated programmatically.

nn.ModuleDict — Handles dynamic routing

Layers:

self.layers = nn.ModuleDict([

"encoder": nn.Linear(10, 20),

"decoder": nn.Linear(20, 10)

])

Forward:

def forward(self, x):

x = self.layers["encoder"](x)

x = torch.relu(x)

x = self.layers["decoder"](x)

return x

Architecture can make runtime choices:
- conditional branches
- algorithmic routing
- attention heads stored by name
Good for encoder-decoder structures, transformers, multi-task models.

Layers

A layer is typically a single transformation:
- Have parameters stored in weight and possibly bias
- Perform a computation in forward()

Linear (Fully Connected) Layer

nn.Linear(in_features, out_features, bias=True)
Applies: y = xWᵀ + b
Parameters: weight, optional bias.

# in_features = 4, out_features = 2

layer = nn.Linear(4, 2)

# Input shape: (batch_size=1, in_features=4)

x = torch.randn(1, 4)

# Output shape: (batch_size=1, out_features=2)

y = layer(x)

Note: PyTorch expects 1D = batch dimension: (batch, in_features) → (batch, out_features)

Convolution Layer

2D Convolution: nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
Learns filters to detect spatial features in images.

conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)

x = torch.randn(1, 3, 64, 64)

y = conv(x) # (1, 16, 64, 64)

1D and 3D variants:
- nn.Conv1d → 1D: audio, time-series
- nn.Conv3d → 3D: volumetric data, medical scans

Transposed Convolution (Deconvolution)

nn.ConvTranspose2d
Used for upsampling (e.g., autoencoders, GANs).

Pooling Layers

Max Pooling: nn.MaxPool2d(kernel_size, stride)

Reduces spatial size
Picks the max value in a region

pool = nn.MaxPool2d(2)

y = pool(torch.randn(1, 3, 32, 32)) # -> (1, 3, 16, 16)

Average Pooling: nn.AvgPool2d, nn.AdaptiveAvgPool2d

Adaptive = output size fixed (important in ResNet, MobileNet)

Normalization Layers

Batch Normalization

nn.BatchNorm1d (MLP, sequence features)
nn.BatchNorm2d (CNNs)
nn.BatchNorm3d

Normalizes across batch dimension.

bn = nn.BatchNorm2d(16)

Layer Normalization: nn.LayerNorm(normalized_shape)

Used often in transformers.

InstanceNorm, GroupNorm, LocalResponseNorm

Used in style transfer, segmentation, etc.

Dropout Layers

nn.Dropout(p=0.5)

Randomly zeroes activations during training to prevent overfitting.

drop = nn.Dropout(0.3)

y = drop(torch.randn(5, 10))

Other variants:
- nn.Dropout1d
- nn.Dropout2d
- nn.Dropout3d

Recurrent Layers

RNN: nn.RNN(input_size, hidden_size, num_layers=1)

LSTM: nn.LSTM(input_size, hidden_size, ...)

Stores long-term dependencies

rnn = nn.LSTM(10, 20, num_layers=2)

x = torch.randn(5, 3, 10) # seq_len=5, batch=3, features=10

output, (h, c) = rnn(x)

GRU: nn.GRU(input_size, hidden_size, ...)

Faster alternative to LSTM.

Embedding Layer

nn.Embedding(num_embeddings, embedding_dim)

Maps word indices → dense vectors.

embed = nn.Embedding(1000, 64)

x = torch.tensor([1, 5, 9])

embed(x)

Used in NLP and transformers.

Transformer Layers

nn.TransformerEncoderLayer

Contains:
- Multi-head attention: nn.MultiheadAttention(embed_dim, num_heads)
- Feedforward layers
- LayerNorm
- Dropout

layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)

Upsampling Layers

nn.Upsample(scale_factor=2, mode='nearest')
nn.PixelShuffle(upscale_factor)
Used in SRGAN, super-resolution.

Flatten and Reshape Layers

nn.Flatten(start_dim=1)

flat = nn.Flatten()

y = flat(torch.randn(1, 3, 28, 28)) # -> (1, 2352)

nn.Unflatten

Padding Layers

nn.ZeroPad2d, nn.ReflectionPad2d, nn.ConstantPad2d

Distance / Similarity Layers

nn.CosineSimilarity
nn.PairwiseDistance

Activation Functions (As Layers)

Activations exist in two forms:

Module version (nn.ReLU())
Functional version (F.relu(x))

Major activations:

nn.ReLU

act = nn.ReLU()

act(x)

nn.LeakyReLU(negative_slope=0.01)
nn.PReLU
nn.Sigmoid
nn.Tanh
nn.Softmax(dim=1)
nn.GELU (used in transformers)

Loss Layers (also subclasses of nn.Module)

nn.CrossEntropyLoss

criterion = nn.CrossEntropyLoss()

loss = criterion(pred, target)

nn.MSELoss
nn.BCELoss
nn.NLLLoss
nn.SmoothL1Loss

Why is Loss Function implemented as nn.Module?

nn.Module provides:
- a consistent interface (__call__, .to(device), .eval(), .train())
- automatic registration of parameters and buffers. Note: Loss functions don't usually have parameters, but some do!
- cast correctly with .float(), .half(), .bfloat16()

Model Parameters

PyTorch models are typically subclasses of nn.Module.
- Inside these modules, trainable weights and biases are stored as parameters, which PyTorch manages automatically.
- These parameters are instances of torch.nn.Parameter, which are essentially tensors with a flag requires_grad=True.

Access all parameters

for param in model.parameters():

print(param.shape)

Access parameters with their names

for name, param in model.named_parameters():

print(f"{name}: {param.shape}")

Count total trainable parameters

total_params = sum(p.numel() for p in model.parameters())

p.numel() gives the number of elements in the tensor (like product of its dimensions).
Sums across all parameters.

To count only trainable parameters:

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

GPU Acceleration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

Training/Evaluation Modes

Training Mode: model.train()

Training mode affects certain layers:
- Dropout
  - Active during training.
  - Randomly zeroes out some activations with probability p.
  - Helps prevent overfitting.
- BatchNorm
  - Uses mini-batch mean and variance.
  - Updates the layer's running mean/variance (used later during evaluation).
All other layers like Linear, Conv2d, ReLU, etc., behave the same in both train & eval modes.

Evaluation/inference Mode: model.eval()

This changes:
- Dropout
  - Disabled — no dropout mask, outputs are deterministic.
- BatchNorm
  - Uses the running mean/variance estimated during training.
  - Does not update running statistics.
Typically, wrap inference in torch.no_grad() to prevent gradient tracking:

model.eval()

with torch.no_grad():

output = model(x)

Training & Evaluation

model = MyModel()

# --- Training ---

model.train()

for x, y in train_loader:

optimizer.zero_grad()

pred = model(x)

loss = criterion(pred, y)

loss.backward()

optimizer.step()

# --- Evaluation ---

model.eval()

with torch.no_grad():

for x, y in test_loader:

pred = model(x)

Saving/Loading The Model

Saving Model: torch.save(model.state_dict(), 'model.pth')

model.state_dict() returns a Python dictionary mapping each layer to its parameter tensors.
torch.save() serializes that dictionary and writes it to 'model.pth'.
This approach is lightweight (just the parameters), and portable across machines.

Loading a Model: model.load_state_dict(torch.load('model.pth'))

torch.load('model.pth') reads the saved parameter dictionary from disk.
model.load_state_dict(...) inserts those parameters into the model architecture.
👉 Important: To load parameters, the same model architecture must be recreated first:

model = MyModel() # same class/structure as when saving

model.load_state_dict(torch.load('model.pth'))

model.eval() # if using for inference

👉 Important: If the model was saved on GPU but loaded on CPU:

model.load_state_dict(torch.load('model.pth', map_location='cpu'))

Google Sites

Report abuse