PyTorch DataLoader

Batching

A batch = A group of samples processed together in one forward/backward pass through the network

Instead of feeding one sample at a time, we feed N samples together: x shape: (batch_size, features)
PyTorch Requires 1st dimension as a batch dimension: Input, x = torch.randn(1, 4) # 1 sample, 4 features

Batching offers:

Efficient GPU/CPU computation
More stable gradient estimates
Parallelism

Training with Mini-Batch

Feeding one sample at a time → Slow, inefficient, unstable gradient estimate
Feeding all data at once → Resource bottleneck including memory usage
Training with mini-batch → Right balance:

dataset → split into mini-batches → feed into model

DataLoader with batch size = 32:

loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:

y = model(batch)

Data Utilities (torch.utils.data)

Data pipeline:
- Dataset → represents the data
- DataLoader → batch loading + shuffling + prefetching

Dataset : An abstract class representing a dataset

Every dataset you create (custom or built-in) inherits from torch.utils.data.Dataset.
It provides a way to access samples and labels from your dataset in a structured way.

Key Features:

Encapsulates your data (images, text, tabular data, etc.).
Provides a way to get an individual sample.
Can be easily integrated with DataLoader for batching and shuffling.

Required Methods:

To create a custom dataset, you need to implement two methods:
- __len__: Returns the total number of samples in the dataset.
- __getitem__: Returns a single data sample (and label) by index.

Example: Custom Dataset

import torch

from torch.utils.data import Dataset

class MyDataset(Dataset):

def __init__(self, data, labels):

self.data = data

self.labels = labels

def __len__(self):

return len(self.data)

def __getitem__(self, idx):

x = self.data[idx]

y = self.labels[idx]

return x, y

# Example usage

data = torch.randn(100, 3) # 100 samples, 3 features

labels = torch.randint(0, 2, (100,)) # 100 labels (binary classification)

dataset = MyDataset(data, labels)

print(len(dataset)) # 100

print(dataset[0]) # (tensor of features, label)

DataLoader: Wraps a Dataset and provides an iterable over the dataset

It provides:

Batching: Return multiple samples at a time.
Shuffling: Randomize the order of samples.
Parallel loading: Use multiple worker processes for faster data loading.
Custom collate functions: Handle complex data batching.

Key Parameters:

dataset: The dataset object.
batch_size: Number of samples per batch.
shuffle: Whether to shuffle the data at every epoch.
num_workers: Number of subprocesses for data loading (0 means main process).
drop_last: Whether to drop the last incomplete batch if the dataset size isn't divisible by batch_size.

Example: Using DataLoader

from torch.utils.data import DataLoader

# Create DataLoader

dataloader = DataLoader(dataset, batch_size=10, shuffle=True, num_workers=0)

# Option 1: Iterate over DataLoader with enumeration

for batch_idx, (x_batch, y_batch) in enumerate(dataloader):

# batch_idx = batch index

# x_batch = batch of data

# y_batch = batch of label/target

# Option 2: Iterate over DataLoader without enumeration (no batch index)

for x_batch, y_batch in dataloader:

# x_batch = batch of data

# y_batch = batch of label/target

Epoch

An epoch is one complete pass of the entire training dataset through the model.
That means:
- If your dataset has 10,000 samples, one epoch means the model sees all 10,000 samples once.
- During an epoch, the model updates its parameters (weights) many times—once per batch.

✅ Training for multiple epochs helps the model:

Adjust parameters gradually → Reduce loss/Improve accuracy
Generalize better

❌ But training for too many epochs may cause:

Overfitting (model memorizes training data instead of learning patterns)

Typical training loop:

# Epoch loop

for epoch in range(num_epochs):

# Batch loop within each epoch

for batch_idx, (x, y) in enumerate(dataloader):

# forward pass

# backward pass

# optimizer step

Validation (in each epoch)

Typcially at the end of each epoch — after the model optimization, using train-dataset (training samples) — perform validation using validation-dataset (validation samples).

During validation:

No backpropagation occurs
No weights are updated
Only metrics (loss, accuracy, etc.) are computed.

Validation helps monitor:

Generalization (performance on unseen data)
Underfitting or overfitting
Early stopping decision
Hyperparameter tuning

For example:

If training accuracy increases but validation accuracy drops → overfitting.
If both are low → underfitting.

import torch

from torch.utils.data import Dataset, DataLoader, random_split

# Randomly split the dataset into train & validation set (often 80/20)

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# Make DataLoaders

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Epoch loop

for epoch in range(num_epochs):

# ---- Training Epoch ----

model.train()

for x_train, y_train in train_loader:

optimizer.zero_grad()

out = model(x_train)

loss = criterion(out, y_train)

loss.backward()

optimizer.step()

# ---- Validation Epoch ----

model.eval()

val_loss = 0

correct = 0

with torch.no_grad():

for x_val, y_val in val_loader:

out = model(x_val)

val_loss += criterion(out, y_val).item()

correct += (out.argmax(1) == y_val).sum().item()

val_loss /= len(val_loader)

val_accuracy = correct / len(val_dataset)

print(f"Epoch {epoch+1}: val_loss={val_loss:.4f}, val_acc={val_accuracy:.4f}")

Google Sites

Report abuse