Module (Model, Layers, Activations)
│
│ ├── data : underlying storage (values)
│ ├── dtype : data type (float32, int64, etc.)
│ ├── shape : size of each dimension
│ ├── device : cpu / cuda
│ └── requires_grad : flag for autograd
│
├── backward() : compute gradients
├── detach() : return tensor without grad
├── to(device) : move to CPU/GPU
├── view(shape) : reshape
├── permute(dims) : reorder dimensions
├── item() : get single Python value
└── clone() : deep copy
│
│ ├── _modules : dict of child layers
│ ├── _parameters : dict of learnable tensors
│ ├── training : bool flag for train/eval mode
│ └── buffers : running stats (e.g., batchnorm)
│
├── forward(x) : define computation
├── parameters() : iterate learnable params
├── to(device) : move model to CPU/GPU
├── train() : set training mode
└── eval() : set eval mode
(Conceptual base structure)
│
├── Members:
│ ├── weights : learnable parameters (if applicable)
│ ├── bias : optional learnable bias
│ ├── hyperparams : e.g., kernel_size, stride, in/out features
│ └── buffers : e.g., running_mean in BatchNorm
│
└── Important Methods:
├── forward(x) : compute layer output
├── reset_parameters(): initialize weights
└── __call__() : wrapper that runs hooks + forward
├── weight : (out_features, in_features)
└── bias : (out_features)
├── weight : (out_channels, in_channels, kH, kW)
└── bias : (out_channels)
│
│ └── inplace : whether to modify in-place (for some activations)
│
└── forward(x) : apply activation
├── inplace : bool
└── forward(x) : max(0, x)
└── forward(x) : 1 / (1 + exp(-x))
└── forward(x) : exp(x) / sum(exp(x))
│
├── Members:
│ ├── reduction : 'mean' | 'sum' | 'none'
│ └── weight : optional class/element weights
│
└── Methods:
└── forward(pred, target) : return loss value
├── weight : class weights
└── reduction
└── reduction
│
├── Members:
│ ├── param_groups : list of param sets + hyperparameters
│ ├── state : per-parameter state (e.g., moments in Adam)
│ └── defaults : default hyperparameters (lr, momentum, etc.)
│
└── Methods:
├── step() : apply gradient update
├── zero_grad() : clear accumulated gradients
└── add_param_group() : add parameters post-init
├── lr : learning rate
├── momentum : momentum factor
└── weight_decay
├── lr
├── betas : exponential decay rates
├── eps
└── weight_decay
The graph is built on-the-fly as operations are executed (Contrasts older frameworks like TensorFlow 1.x that used static graphs.)
This makes debugging and model development more intuitive and "Pythonic."
This allows things like loops and conditionals to naturally be part of the model.
PyTorch's autograd system automatically computes gradients needed for backpropagation.
y = x.sum()
y.backward() # computes dy/dx
print(x.grad)
TorchVision — Image tasks (datasets, transforms, pretrained models)
TorchText — NLP tasks
TorchAudio — Speech/audio processing
PyTorch Lightning — High-level training framework
Hugging Face Transformers — Large pre-trained language models
TorchServe — Serve models in production