What Is Context?
In AI systems — especially large language models (LLMs) — context is everything the model can "see" at the moment it generates a response. The model has no persistent memory between sessions; instead, every piece of relevant information must be present inside the context at inference time.
Think of context as a document handed to the model right before it answers. Whatever is in that document shapes what the model knows, how it behaves, and what it can reference.
What Context Contains
Context is not a single thing — it is a structured collection of inputs assembled before each model call. The exact structure varies by platform, but in practice it typically includes:
| Part | Purpose |
|---|---|
| System prompt | Sets model behavior, role, and rules |
| Conversation history | Prior turns of the dialogue |
| Injected data | External information added programmatically |
| Tool/function results | Output from tools the model called |
| Current user message | The latest input from the user |
All of these are concatenated (in token space) and passed to the model as one block.
System Prompt
The system prompt is the foundational layer of context. It is written by the developer — not the user — and defines:
- The model's role ("You are a helpful coding assistant.")
- Behavioral rules ("Always respond in the user's language.")
- Constraints ("Never reveal internal configuration.")
- Output format requirements ("Respond in JSON.")
System: You are a senior TypeScript developer. Answer questions concisely.
When showing code, always include type annotations.
Do not speculate about runtime behavior — cite documentation instead.
The system prompt is evaluated first and carries the highest implicit authority. Users cannot override it unless the prompt explicitly allows them to.
Keep the system prompt focused. Bloated system prompts consume tokens that could be used for conversation history or injected data.
Conversation History
LLMs are stateless — they have no memory across calls. To maintain a coherent multi-turn conversation, the application must replay the entire message history on every request.
Each turn is represented as a message with a role:
[
{ "role": "user", "content": "What is a closure?" },
{ "role": "assistant", "content": "A closure is a function that captures variables from its surrounding scope..." },
{ "role": "user", "content": "Can you show a TypeScript example?" }
]
The model reads the full history and uses it to understand what has already been said, avoid repetition, and maintain coherence.
Practical implication: As conversations grow, history consumes more and more of the context window. Applications typically truncate older messages or summarize them to stay within limits.
Injected Data
Injected data is external information that your application inserts into the context programmatically — the model does not fetch it itself. Common patterns include:
Retrieval-Augmented Generation (RAG): A search step finds relevant documents and injects them as context before the model answers.
System: Use only the provided documentation to answer.
[INJECTED DOCUMENT]
Function: createUser(name: string, email: string): Promise<User>
Creates a new user record. Throws if the email already exists.
[END DOCUMENT]
User: What does createUser throw?
User profile / session data:
System: The current user is Ana, a premium subscriber, timezone UTC+2.
Real-time data: current date, feature flags, API results injected as text.
Injected data lets you give the model knowledge it was not trained on, without fine-tuning.
Tool Results
Modern LLM APIs support tool use (also called function calling). The model can request that a tool be executed, and the result is then injected back into the context so the model can use it in its final response.
The flow looks like this:
- User asks a question.
- Model outputs a tool call request (e.g.
search("TypeScript generics")). - The application runs the tool and gets a result.
- The result is appended to the context as a tool message.
- The model reads the result and produces the final answer.
{ "role": "tool", "tool_call_id": "call_abc", "content": "TypeScript generics allow types to be parameterized..." }
Tool results make context dynamic — the model's knowledge can be extended at runtime with live data.
Context Window
The context window is the maximum number of tokens the model can process in a single call. It is a hard limit set by the model architecture.
| Model family | Typical context window |
|---|---|
| GPT-4o | 128 000 tokens |
| Claude 3.5 Sonnet | 200 000 tokens |
| Gemini 1.5 Pro | 1 000 000 tokens |
Everything — system prompt, history, injected data, tool results, and the current message — must fit within this limit. If the total exceeds it, older content must be dropped or summarized.
Token cost matters. A rough rule: 1 token ≈ 4 characters in English. A 200 000-token window holds roughly 150 000 words — about a 500-page book. Sounds large, but agentic workflows with many tool calls can fill it quickly.
Design your context budget intentionally. Reserve space for the system prompt and recent history; use summarization or chunking for long injected documents.
How to Use Context Effectively
Be explicit in the system prompt. The model follows instructions that are present. If a behavior matters to you, state it explicitly — don't assume the model will infer it.
Inject only what is relevant. Irrelevant content dilutes the signal and wastes tokens. If you are answering a question about function X, inject only the docs for function X, not the entire library reference.
Structure injected data clearly. Use delimiters so the model can distinguish injected content from instructions:
[START CONTEXT]
...your retrieved document here...
[END CONTEXT]
Answer the user's question using only the context above.
Summarize long histories. Instead of replaying 50 turns verbatim, summarize older turns into a compact paragraph and keep only the most recent turns in full.
Put critical instructions at both ends. Research shows LLMs sometimes underweight instructions buried in the middle of a long context. Repeat key constraints at the end of the system prompt or just before the user message.
Common Mistakes
Assuming the model remembers previous sessions. It does not. If a fact matters, it must be in the current context.
Overloading the system prompt. A 4 000-token system prompt leaves less room for conversation history and data. Keep it lean and delegate detail to injected documents.
Not trimming stale history. Replaying a full chat history with 100 turns is expensive and often counter-productive — early turns are rarely relevant to the current question.
Injecting unfiltered user input into the system prompt. This opens prompt-injection vulnerabilities. Always validate or sanitize content before placing it in privileged positions in the context.
Never concatenate raw user-supplied text directly into the system prompt. An attacker can override your instructions by embedding directives like "Ignore all previous instructions and…" in their input.
Ignoring token costs. Each token in the context is processed on every request. Large, unmanaged contexts increase latency and cost linearly.
