Traditional fine-tuning, you would normally adjust all the LLM weights → extremely expensive + time consuming.
In LoRA, You only train 0.1–1% of the model's parameters.
Freeze the original model
Insert small "adapter matrices" (a tiny "side path" of trainable weights) that learn new behavior
Only train the small adapters
Cheap: can fine-tune a huge model on a single GPU.
Fast: hours instead of days/weeks.
Multiple LoRAs can be stacked or swapped.
Empirical research shows: Fine-tuning updates tend to live in a low-rank space. That's When you fine-tune a giant model, the actual change needed is small and structured — like a tiny "directional correction."
LoRA captures this efficiently.
Teach an LLM to follow new instructions.
Train a customer-support personality.
Adapt a model to financial, legal, medical text.
Style-specialization: "speak like my company brand".
Add "adapter matrices" in certain linear layers (usually attention layers):
A weight matrix W is replaced with: W′=W+ΔW
Where: ΔW=BA
A is a down-projection (reduces dimension)
B is an up-projection (restores dimension)
These are low-rank matrices (rank = r, usually 4–64)
W (the original giant matrix) is frozen; Only A and B are trainable (they are tiny)
Typical attention/FFN weight matrix is something like: 4096 × 4096 = 16 million parameters.
LoRA inserts:
A:4096×r
B:r×4096
If r = 16, then: 4096×16+16×4096=131k parameters
Attention projection layers:
Query projection: W𝑞
Key projection: W𝑘
Value projection: W𝑣
Output projection: W𝑜
Sometimes also: Feed-forward layers (MLP up/down projections)
Q/K/V matrices (attention projection layers) directly modify attention patterns, which allow the model to adapt much faster.
Most of the "task-specific" behavior is encoded in Q/V.
LoRA on Q and V gives 80% of the benefit with minimal cost -- this is most common
class LoRALinear(nn.Linear):
def __init__(...):
super().__init__(...)
self.lora_A = nn.Linear(in_dim, r, bias=False)
self.lora_B = nn.Linear(r, out_dim, bias=False)
# freeze original weight
self.weight.requires_grad = False
def forward(self, x):
return x @ self.weight.T + self.lora_B(self.lora_A(x))
Architecture is fundamentally the same, just augmented.
Automatically determine which LoRA adapters are needed.
Combine multiple specialized LoRAs at runtime.
Auto-train new adapters when data changes.
Automatically manage/optimize adapter sizes and hyperparameters.
Choosing rank values
Selecting which layers to adapt
Deciding training configurations
Managing multiple LoRAs
You upload your company’s data →
The system automatically:
Detects the domain
Fine-tunes the necessary adapters
Connects them to the base model
Uses them only when relevant
This creates a dynamic, modular fine-tuned system.