Transformer-based LLMs (GPT, Grok, Llama, etc.) use a variable-length input limited by a maximum context window.
Internally, shorter inputs are padded (usually with special padding tokens or just masked) to a fixed batch size during inference, but this is hidden from the user.
User input (any length ≤ max)
↓
Tokenizer → converts text → token IDs
↓
Positional embeddings + token embeddings
↓
Transformer layers (attention works over entire current length)
↓
Output logits for next token
GPT-3.5 Turbo: 4,096 or 16,384
GPT-4 Turbo/ GPT-4o: 128,000
GPT-o1 series: 128,000 (input) + 32k output
Claude 3.5 Sonnet: 200,000
Gemini 1.5 Pro: Up to 1 million (2M in some cases)
Grok-2: ~128,000
Llama 3.1 (405B): 128,000
Most of the models use dense Transformer architectures that scale very poorly beyond ~128k–200k.
Gemini 1.5 Pro is built from the ground up as a sparse Mixture-of-Experts (MoE) model optimized for ultra-long contexts (further it uses TPU-scale engineering). That's why Google can offer 1 million (and experimentally 2 million) while keeping latency reasonable on their own cloud.
Every time you send a message, the system builds the full context from scratch.
System prompt + Q1 (Your first question)
System prompt + Q1 + A1 (LLM's answer for Q1) + Your Q2
System prompt + Q1 + A1 + Q2 + A2 + Your Q3
... Keeps growing until it hits the max context window; and then older messages gradually get dropped
Attention formula:
Note: t = current; 1:t = from 1 to t. So caching past computations of K, V is helpful
Attached files (the content) are resent every time you ask a question + the chat history (until the context window is full).
ChatGPT (GPT-4o): ~300 pages
Claude 3.5: ~500–600 pages
Grok: ~100–150 pages
Gemini 1.5 Pro: Up to ~2,500 pages
Your query →
System searches documents / database / internet →
Retrieved info is passed into the LLM →
LLM produces an answer using that info
Document ingestion: chunking, cleaning
Embedding model: turns text → vectors
VectorDB / Retriever: finds relevant chunks
LLM: uses retrieved chunks to generate an answer
Optional re-ranking / filtering
User uploads file
↓
Python code splits text into chunks
↓
clean (remove headers, HTML, etc)
↓
sent to Embedding model
it does require:
heuristics
structure
possibly NLP tricks
domain knowledge sometimes
Fixed-size chunks (simplest)
Example:
every 500 characters
every 300 tokens
Cons: might cut meaning mid-sentence
Split by sentences
Use an NLP sentence splitter
chunk roughly 3–5 sentences
Paragraph-based
Split on blank lines or <p>
Sliding window (VERY common)
Like:
chunk size = 400 tokens
overlap = 100
So chunks are like:
1: 0-400
2: 300-700
3: 600-1000
Overlap preserves context (this is a best practice for RAG).
Hybrid strategies
Paragraphs + but limited max token size + with overlaps + with semantic boundary detection
Tools like LangChain and LlamaIndex do this automatically.
→ take text
→ produce numerical vector
→ store it in Vector DB
Example: "How to cancel order?" → [0.21, -0.14, ... 1536 numbers]
Example:
OpenAI text-embedding-3-small
BERT
Instructor XL
sentence-transformers
Example:
"cancel order" → [0.2, -0.1, 0.55, .... ]
"refund request" → [0.19, -0.09, 0.53, .... ]
These two vectors will be close together in vector space because they mean similar things.
Embedding model captures semantic similarity
A typical embedding model:
tokenizes text
passes tokens through a transformer encoder
produces contextualized hidden states
pools them into one vector representation
Example architecture:
Input Text
→ Tokenizer
→ Transformer Encoder
→ Pool (Mean, CLS)
→ Dense Layer (optional)
→ Output Embedding Vector
OpenAI text-embedding-3, SentenceTransformer, InstructorXL, etc are variations of this.
Embedding model is trained to:
push related texts closer
push unrelated texts farther apart
This is called contrastive learning.
Training pairs examples like:
"How to cancel?"
"request refund" → bring close
"refund request"
"banana nutrition" → push apart
This is the core trick.
Contrastive loss (like InfoNCE):
Minimize: distance(similar_pairs)
Maximize: distance(negative_pairs)
Size: Over millions of examples.
A Vector Database stores embeddings (high-dimensional vectors) and supports similarity search.
embeddings
metadata
original text
Example:
id | chunk text | embedding |
-- | ------------------- | --------------- |
1 | "Refund request..." | [0.1, 0.2, ...] |
2 | "Cancel orders..." | [0.5, -0.1] |
Faiss
Milvus
Pinecone
Weaviate
Chroma
Qdrant
Vector storage
Fast nearest-neighbor search
Metadata filters
Indexing (HNSW, IVF, PQ, etc.)
Scalability + sharding
extract text
split into chunks
clean text
normalize
remove headers
Example chunks:
chunk1: "Refund requests must ..."
chunk2: "Users can cancel orders ..."
chunk3: "Shipping delays ..."
Every chunk becomes a vector:
chunk → embedding vector
Example (vector shape depends on model):
[0.12, -0.33, ... 1536 dims]
Store the original text and embedding (numerical vector):
id | chunk text | embedding |
-- | ------------------- | --------------- |
1 | "Refund request..." | [0.1, 0.2, ...] |
2 | "Cancel orders..." | [0.5, -0.1] |
The question is converted into an embedding vector using the same embedding model.
Example: "How do I cancel an order?" → query vector [0.51, -0.11 ...]
search the stored vectors
for closest embeddings
to query embedding
Usually using:
cosine similarity
dot product
HNSW indexing
High similarity = closer meaning.
Construct a prompt:
User asked:
"How do I cancel an order?"
Relevant info:
(1) Users can cancel orders within 30 days...
(2) Refund and cancellation policy...
Answer: LLM then generates final answer based ONLY on relevant chunks.
Write-time (ingestion): chunk → embed → store(vectorDB)
Read-time (query): question → embed → similarity search → top results → LLM answer
Treat memory as its own neural module, not in weights.
LLM reasons about what information it needs, then requests it.
LLM → “Search the memory for contract clauses about refunds.”
System → retrieves only those.
LLM → generates final answer.
This is a step closer to human-like memory, but still relies on external DB.
These systems let the LLM:
Read a document
Generate embeddings and summaries
Store them in a structured "memory"
Retrieve based on long-term semantics, not just vector similarity
Examples:
DeepMind’s RETRO
Meta’s RRL: Retrieval Reinforced LLMs
Microsoft’s Semantic Memory for LLMs
Anthropic’s constitutional memory agents