LLM Stack Trace: 8 Layers From Keystroke to Streamed Token

A software engineer has published a detailed technical document that traces exactly what happens at every layer of the stack when you send a prompt to an LLM like Claude or ChatGPT. Inspired by the classic "what-happens-when" repository for browser navigation, this document provides a production systems perspective on LLM chat interactions.

What the Document Covers

The document follows the full journey in production order:

Client-side: Live token counting via WASM tokenizers, IME composition events, optimistic UI rendering
Network: Why SSE wins over WebSockets for chat, UTF-8 boundary problem in streaming
API Gateway: Edge TLS termination, multi-dimensional rate limiting (RPM vs ITPM vs OTPM)
Safety classifiers: What runs before and after the model, why prompt injection is structurally unsolved
Context assembly: What actually goes into the context window (it's not just your messages)
Tokenization: Why models can't count letters, why leading spaces matter, how special tokens consume budget
KV cache and prefix caching: GQA vs MHA memory math, PagedAttention, cache hit rate as a cost lever
Prefill vs decode: Why they're bottlenecked differently (compute vs memory bandwidth)
Sampling pipeline: The full logit pipeline in order — repetition penalty, temperature, top-k, top-p, softmax, sample
Streaming: TTFT breakdown, SSE event parsing, incremental markdown rendering
Tool use and agentic loops: Parallel tool calls, prompt injection resurfacing in tool results
Billing and observability: TTFT vs TPOT, cache pricing math, what to instrument

Document Details

The document is aimed at engineers who already understand transformers and want to see how production systems actually work. It's released under CC0 license, and contributions are welcome. The author notes several uncovered subsystems at the bottom including speculative decoding, multimodal systems, and multi-agent coordination.

The repository was created to address the gap between high-level "transformers are magic" explanations and academic papers that don't connect concepts to production system behavior.

📖 Read the full source: r/LocalLLaMA

End-to-End LLM Stack Trace: From Keystroke to Streamed Token

What the Document Covers

Document Details

👀 See Also

12 OpenClaw SOUL.md and STYLE.md Templates with Practical Lessons

Accessing USB Webcams in WSL2 for Local Motion Detection

Qwen3.5-397B MoE Runs on 14GB RAM via Paged Expert Loading on M1 Ultra

Understanding AI Agent Architecture: Deterministic vs Probabilistic Layers