Instead of feeding one sample at a time, we feed N samples together: x shape: (batch_size, features)
PyTorch Requires 1st dimension as a batch dimension: Input, x = torch.randn(1, 4) # 1 sample, 4 features
Efficient GPU/CPU computation
More stable gradient estimates
Parallelism
Feeding one sample at a time → Slow, inefficient, unstable gradient estimate
Feeding all data at once → Resource bottleneck including memory usage
Training with mini-batch → Right balance:
dataset → split into mini-batches → feed into model
DataLoader with batch size = 32:
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
y = model(batch)
Data pipeline:
Dataset → represents the data
DataLoader → batch loading + shuffling + prefetching
Every dataset you create (custom or built-in) inherits from torch.utils.data.Dataset.
It provides a way to access samples and labels from your dataset in a structured way.
Encapsulates your data (images, text, tabular data, etc.).
Provides a way to get an individual sample.
Can be easily integrated with DataLoader for batching and shuffling.
To create a custom dataset, you need to implement two methods:
__len__: Returns the total number of samples in the dataset.
__getitem__: Returns a single data sample (and label) by index.
import torch
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, data, labels):
self.data = data
self.labels = labels
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
x = self.data[idx]
y = self.labels[idx]
return x, y
# Example usage
data = torch.randn(100, 3) # 100 samples, 3 features
labels = torch.randint(0, 2, (100,)) # 100 labels (binary classification)
dataset = MyDataset(data, labels)
print(len(dataset)) # 100
print(dataset[0]) # (tensor of features, label)
Batching: Return multiple samples at a time.
Shuffling: Randomize the order of samples.
Parallel loading: Use multiple worker processes for faster data loading.
Custom collate functions: Handle complex data batching.
dataset: The dataset object.
batch_size: Number of samples per batch.
shuffle: Whether to shuffle the data at every epoch.
num_workers: Number of subprocesses for data loading (0 means main process).
drop_last: Whether to drop the last incomplete batch if the dataset size isn't divisible by batch_size.
from torch.utils.data import DataLoader
# Create DataLoader
dataloader = DataLoader(dataset, batch_size=10, shuffle=True, num_workers=0)
# Option 1: Iterate over DataLoader with enumeration
for batch_idx, (x_batch, y_batch) in enumerate(dataloader):
# batch_idx = batch index
# x_batch = batch of data
# y_batch = batch of label/target
# Option 2: Iterate over DataLoader without enumeration (no batch index)
for x_batch, y_batch in dataloader:
# x_batch = batch of data
# y_batch = batch of label/target
An epoch is one complete pass of the entire training dataset through the model.
That means:
If your dataset has 10,000 samples, one epoch means the model sees all 10,000 samples once.
During an epoch, the model updates its parameters (weights) many times—once per batch.
Adjust parameters gradually → Reduce loss/Improve accuracy
Generalize better
Overfitting (model memorizes training data instead of learning patterns)
# Epoch loop
for epoch in range(num_epochs):
# Batch loop within each epoch
for batch_idx, (x, y) in enumerate(dataloader):
# forward pass
# backward pass
# optimizer step
Typcially at the end of each epoch — after the model optimization, using train-dataset (training samples) — perform validation using validation-dataset (validation samples).
No backpropagation occurs
No weights are updated
Only metrics (loss, accuracy, etc.) are computed.
Generalization (performance on unseen data)
Underfitting or overfitting
Early stopping decision
Hyperparameter tuning
For example:
If training accuracy increases but validation accuracy drops → overfitting.
If both are low → underfitting.
import torch
from torch.utils.data import Dataset, DataLoader, random_split
# Randomly split the dataset into train & validation set (often 80/20)
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
# Make DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# Epoch loop
for epoch in range(num_epochs):
# ---- Training Epoch ----
model.train()
for x_train, y_train in train_loader:
optimizer.zero_grad()
out = model(x_train)
loss = criterion(out, y_train)
loss.backward()
optimizer.step()
# ---- Validation Epoch ----
model.eval()
val_loss = 0
correct = 0
with torch.no_grad():
for x_val, y_val in val_loader:
out = model(x_val)
val_loss += criterion(out, y_val).item()
correct += (out.argmax(1) == y_val).sum().item()
val_loss /= len(val_loader)
val_accuracy = correct / len(val_dataset)
print(f"Epoch {epoch+1}: val_loss={val_loss:.4f}, val_acc={val_accuracy:.4f}")