LLM Context Window & RAG (Retrieval-Augmented Generation)

Input Length: Variable up to Max

Transformer-based LLMs (GPT, Grok, Llama, etc.) use a variable-length input limited by a maximum context window.
Internally, shorter inputs are padded (usually with special padding tokens or just masked) to a fixed batch size during inference, but this is hidden from the user.

User input (any length ≤ max)

↓

Tokenizer → converts text → token IDs

↓

Positional embeddings + token embeddings

↓

Transformer layers (attention works over entire current length)

↓

Output logits for next token

Context Length:

GPT-3.5 Turbo: 4,096 or 16,384
GPT-4 Turbo/ GPT-4o: 128,000
GPT-o1 series: 128,000 (input) + 32k output
Claude 3.5 Sonnet: 200,000
Gemini 1.5 Pro: Up to 1 million (2M in some cases)
Grok-2: ~128,000
Llama 3.1 (405B): 128,000

Note:

Most of the models use dense Transformer architectures that scale very poorly beyond ~128k–200k.
Gemini 1.5 Pro is built from the ground up as a sparse Mixture-of-Experts (MoE) model optimized for ultra-long contexts (further it uses TPU-scale engineering). That's why Google can offer 1 million (and experimentally 2 million) while keeping latency reasonable on their own cloud.

LLMs are Stateless

Every time you send a message, the system builds the full context from scratch.

System prompt + Q1 (Your first question)

System prompt + Q1 + A1 (LLM's answer for Q1) + Your Q2

System prompt + Q1 + A1 + Q2 + A2 + Your Q3

... Keeps growing until it hits the max context window; and then older messages gradually get dropped

KV Cache: Caches past key/value vectors so we don't recompute the

Attention formula:

Note: t = current; 1:t = from 1 to t. So caching past computations of K, V is helpful

Attaching Files

Attached files (the content) are resent every time you ask a question + the chat history (until the context window is full).

Approx. file size:

ChatGPT (GPT-4o): ~300 pages
Claude 3.5: ~500–600 pages
Grok: ~100–150 pages
Gemini 1.5 Pro: Up to ~2,500 pages

RAG (Retrieval-Augmented Generation): A technique where the LLM retrieves external information before generating an answer

Typical workflow:

Your query →
System searches documents / database / internet →
Retrieved info is passed into the LLM →
LLM produces an answer using that info

A RAG pipeline usually includes:

Document ingestion: chunking, cleaning
Embedding model: turns text → vectors
VectorDB / Retriever: finds relevant chunks
LLM: uses retrieved chunks to generate an answer
Optional re-ranking / filtering

Document ingestion: chunking → cleaning

User uploads file

↓

Python code splits text into chunks

↓

clean (remove headers, HTML, etc)

↓

sent to Embedding model

Chunking

it does require:

heuristics
structure
possibly NLP tricks
domain knowledge sometimes

Common chunking methods

Fixed-size chunks (simplest)

Example:
- every 500 characters
- every 300 tokens
Cons: might cut meaning mid-sentence

Split by sentences

Use an NLP sentence splitter
chunk roughly 3–5 sentences

Paragraph-based

Split on blank lines or <p>

Sliding window (VERY common)

Like:
- chunk size = 400 tokens
- overlap = 100
So chunks are like:
- 1: 0-400
- 2: 300-700
- 3: 600-1000
Overlap preserves context (this is a best practice for RAG).

Hybrid strategies

Paragraphs + but limited max token size + with overlaps + with semantic boundary detection
Tools like LangChain and LlamaIndex do this automatically.

Embedding model: turns text → vectors

→ take text

→ produce numerical vector

→ store it in Vector DB

Example: "How to cancel order?" → [0.21, -0.14, ... 1536 numbers]

Embedding model = A separate AI model (often smaller than LLM)

Example:
- OpenAI text-embedding-3-small
- BERT
- Instructor XL
- sentence-transformers

Numerical vector = a list of floating-point numbers representing meaning

Example:
- "cancel order" → [0.2, -0.1, 0.55, .... ]
- "refund request" → [0.19, -0.09, 0.53, .... ]
These two vectors will be close together in vector space because they mean similar things.
Embedding model captures semantic similarity

Embedding model architecture = Typically based on Transformer (BERT-like)

A typical embedding model:
- tokenizes text
- passes tokens through a transformer encoder
- produces contextualized hidden states
- pools them into one vector representation
Example architecture:

Input Text

→ Tokenizer

→ Transformer Encoder

→ Pool (Mean, CLS)

→ Dense Layer (optional)

→ Output Embedding Vector

OpenAI text-embedding-3, SentenceTransformer, InstructorXL, etc are variations of this.

Contrastive learning

Embedding model is trained to:
- push related texts closer
- push unrelated texts farther apart
This is called contrastive learning.
Training pairs examples like:

"How to cancel?"

"request refund" → bring close

"refund request"

"banana nutrition" → push apart

This is the core trick.

Training objective (simplified)

Contrastive loss (like InfoNCE):
Minimize: distance(similar_pairs)
Maximize: distance(negative_pairs)
Size: Over millions of examples.

Vector DB

A Vector Database stores embeddings (high-dimensional vectors) and supports similarity search.

Vector DB stores:

embeddings
metadata
original text

Example:

id | chunk text | embedding |

-- | ------------------- | --------------- |

1 | "Refund request..." | [0.1, 0.2, ...] |

2 | "Cancel orders..." | [0.5, -0.1] |

Vector DB Examples:

Faiss
Milvus
Pinecone
Weaviate
Chroma
Qdrant

A VectorDB provides:

Vector storage
Fast nearest-neighbor search
Metadata filters
Indexing (HNSW, IVF, PQ, etc.)
Scalability + sharding

Workflow Overview

User uploads (PDF, docs, HTML, plain text) → Ingestion (chunking & cleaning)

extract text

split into chunks

clean text

normalize

remove headers

Example chunks:

chunk1: "Refund requests must ..."

chunk2: "Users can cancel orders ..."

chunk3: "Shipping delays ..."

Embedding Model turns chunks → embeddings

Every chunk becomes a vector:

chunk → embedding vector

Example (vector shape depends on model):

[0.12, -0.33, ... 1536 dims]

Store embeddings Vector DB

Store the original text and embedding (numerical vector):

id | chunk text | embedding |

-- | ------------------- | --------------- |

1 | "Refund request..." | [0.1, 0.2, ...] |

2 | "Cancel orders..." | [0.5, -0.1] |

User asks a question → embedding vector of the question

The question is converted into an embedding vector using the same embedding model.
Example: "How do I cancel an order?" → query vector [0.51, -0.11 ...]

Retrieval (of relevant chunks) from Vector DB: Finding nearest vectors

search the stored vectors

for closest embeddings

to query embedding

Usually using:

cosine similarity
dot product
HNSW indexing

High similarity = closer meaning.

Feed into LLM

Construct a prompt:

User asked:

"How do I cancel an order?"

Relevant info:

(1) Users can cancel orders within 30 days...

(2) Refund and cancellation policy...

Answer: LLM then generates final answer based ONLY on relevant chunks.

Full pipeline summary

Write-time (ingestion): chunk → embed → store(vectorDB)
Read-time (query): question → embed → similarity search → top results → LLM answer

Modular Memory for LLMs: Integrate memory deeply (into LLM's logic) but separately (as knowledge)

Treat memory as its own neural module, not in weights.

ReAct + RAG Agents (LLM-guided retrieval)

LLM reasons about what information it needs, then requests it.

LLM → “Search the memory for contract clauses about refunds.”

System → retrieves only those.

LLM → generates final answer.

This is a step closer to human-like memory, but still relies on external DB.

Memory-Augmented LLMs (new research direction)

These systems let the LLM:
- Read a document
- Generate embeddings and summaries
- Store them in a structured "memory"
- Retrieve based on long-term semantics, not just vector similarity
Examples:
- DeepMind’s RETRO
- Meta’s RRL: Retrieval Reinforced LLMs
- Microsoft’s Semantic Memory for LLMs
- Anthropic’s constitutional memory agents

Google Sites

Report abuse