End-to-End LLM Stack Trace: From Keystroke to Streamed Token

A software engineer has published a detailed technical document that traces exactly what happens at every layer of the stack when you send a prompt to an LLM like Claude or ChatGPT. Inspired by the classic "what-happens-when" repository for browser navigation, this document provides a production systems perspective on LLM chat interactions.
What the Document Covers
The document follows the full journey in production order:
- Client-side: Live token counting via WASM tokenizers, IME composition events, optimistic UI rendering
- Network: Why SSE wins over WebSockets for chat, UTF-8 boundary problem in streaming
- API Gateway: Edge TLS termination, multi-dimensional rate limiting (RPM vs ITPM vs OTPM)
- Safety classifiers: What runs before and after the model, why prompt injection is structurally unsolved
- Context assembly: What actually goes into the context window (it's not just your messages)
- Tokenization: Why models can't count letters, why leading spaces matter, how special tokens consume budget
- KV cache and prefix caching: GQA vs MHA memory math, PagedAttention, cache hit rate as a cost lever
- Prefill vs decode: Why they're bottlenecked differently (compute vs memory bandwidth)
- Sampling pipeline: The full logit pipeline in order — repetition penalty, temperature, top-k, top-p, softmax, sample
- Streaming: TTFT breakdown, SSE event parsing, incremental markdown rendering
- Tool use and agentic loops: Parallel tool calls, prompt injection resurfacing in tool results
- Billing and observability: TTFT vs TPOT, cache pricing math, what to instrument
Document Details
The document is aimed at engineers who already understand transformers and want to see how production systems actually work. It's released under CC0 license, and contributions are welcome. The author notes several uncovered subsystems at the bottom including speculative decoding, multimodal systems, and multi-agent coordination.
The repository was created to address the gap between high-level "transformers are magic" explanations and academic papers that don't connect concepts to production system behavior.
📖 Read the full source: r/LocalLLaMA
👀 See Also

12 OpenClaw SOUL.md and STYLE.md Templates with Practical Lessons
A developer created 12 OpenClaw agent templates for common use cases, each following the official 4-section spec, and identified key lessons including the necessity of STYLE.md for defining communication patterns and the importance of specific boundaries over vague personality traits.

Accessing USB Webcams in WSL2 for Local Motion Detection
A developer shares how to use usbipd-win to pass USB webcams from Windows to WSL2, enabling local motion detection with OpenCV without cloud dependencies.

Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra
Paged MoE engine keeps only 20 experts resident and lazy-loads the rest from SSD, running a 209GB 397B model on a 64GB Mac Studio with 1.59 tok/s and 14GB peak RAM. Includes smaller model benchmarks.

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers
A Reddit user shares a mental model for AI agent systems that separates deterministic layers (scripts, commands, APIs) from probabilistic layers (LLM reasoning and decisions). The key insight: push as much work as possible to the deterministic side.