Arena AI Model ELO History Tracks LLM Performance Decay Over Time

Erwin Mayer's Arena AI Model ELO History (live tracker) plots historical ELO ratings from the LMSYS Arena leaderboard to expose performance trends of flagship AI models. The core insight: models that feel great at launch often degrade weeks later due to silent updates, quantization, or safety wrapper changes.
Key Features
- One curve per lab: Instead of a spaghetti chart of every variant, each major AI lab gets a single continuous line representing their highest-rated flagship model at any point in time.
- Flagship tracking logic: The curve sticks to the top-tier model (e.g., Opus stays active until a new higher-scoring model appears). Mid-tier releases like Sonnet don't cause a jump while Opus leads.
- Inference modes merged: Suffixes like
-thinking,-reasoning,-highare collapsed under the base model to avoid flip-flopping. - New release markers: Releases are shown as labeled points, typically accompanied by score jumps.
- Degradation visible: Downward trends within a model's lifecycle between releases are clearly plotted.
- Mobile-friendly + dark mode included.
Data Source
Data is automatically fetched daily from the official LMSYS Arena Dataset on Hugging Face. The Arena uses thousands of blind crowdsourced human evaluations via API endpoints — not consumer web UIs.
Critical Blindspot: Web UI vs. API
The author acknowledges a key limitation: LMSYS tests raw API models. Consumer interfaces (chatgpt.com, gemini.com) add heavy system prompts, safety wrappers, and may silently switch to quantized models under load. The project seeks historical ELO or evaluation datasets from actual web UIs to capture the "nerfing" that users experience. PRs with such datasets are welcome (repo link in footer).
Who It’s For
Developers and researchers tracking LLM model quality over time, especially those deploying AI agents that rely on consistent model behavior.
📖 Read the full source: HN LLM Tools
👀 See Also

Qwen 3.5 35B Running on 8GB VRAM with llama.cpp Configuration
A developer shares their llama.cpp configuration for running Qwen 3.5 35B (Q4_K_M GGUF) on an RTX 4060m with 8GB VRAM, achieving 700 t/s prompt processing and 42 t/s generation, and discusses using Cline in VSCode with kat-coder-pro and qwen3.5 modes.

obsidian-mcp: Graph-Aware MCP Server for Claude with 25 Tools Targeting Large Vaults
obsidian-mcp is an MCP server exposing 25 tools (including get_note, traverse_graph, query_dataview, move_note, create_notes) that gives Claude graph-aware access to your Obsidian vault — avoiding context window death on 5k-note vaults. MIT, works with Claude Desktop, Claude Code, Cursor, Cline, Continue, Zed.

Chrome Extension Adds Live Preview to Claude Code Web
A Chrome extension called Claude Code Preview adds live preview functionality to Claude Code Web, similar to Lovable and other 'vibecoding' sites, allowing side-by-side viewing of deployments.

Agent Memory Protocol (AMP): Open Spec for Interoperable AI Agent Memory on Top of MCP
AMP defines a standard interface for persistent memory in MCP-compatible agents with six core verbs: encode, recall, forget, consolidate, pin, and stats. Includes compliance test suite and reference implementation.