Devii · AI & ML · 2026-03-05 · 7 min read
LLM Inference Basics: Context Windows, Tokens, And Temperature
Factual terminology for transformer inference without model rankings or benchmark claims.
Large language models consume text as **tokens** (subword units). The **context window** is the maximum tokens processed in one forward pass (prompt plus completion). Exceeding it truncates or errors depending on API.
**Temperature** scales randomness in sampling: lower values favor likely tokens; higher values diversify output. **Top-p** (nucleus) sampling truncates the probability mass considered.
Inference can run on GPUs, TPUs, or specialized hosts; latency depends on model size, batching, and quantization (INT8/INT4). Providers document rate limits and pricing per million tokens.
For production: log prompts and outputs under your privacy policy, version models explicitly, and evaluate changes on held-out tasks. This article defines terms; it does not compare vendor model quality.