Comparison of 8 AI Coding Models on Real-World TypeScript Feature Implementation

✍️ OpenClawRadar📅 Published: March 15, 2026🔗 Source
Comparison of 8 AI Coding Models on Real-World TypeScript Feature Implementation
Ad

Real-World AI Coding Model Comparison

A developer conducted a practical comparison of 8 AI coding models by having them implement the same real-world feature in an existing TypeScript project. The goal was to move beyond synthetic benchmarks and see how models perform when working with actual codebases.

The Test Setup

The project used was OpenCode Telegram Bot, an open-source TypeScript bot built with the grammY framework that provides Telegram interface to Opencode capabilities. The bot has i18n support and existing test coverage.

The task was implementing a /rename command that renames the current working session. This feature touches all application layers and requires handling multiple edge cases. The original implementation had been reverted, providing a clean baseline for evaluation.

Each model received the same prompt in two phases: first in planning mode (studying the codebase and forming an implementation plan), then in coding mode. All testing was done using Opencode with "thinking" mode and reasoning enabled.

Models Tested

  • Claude 4.6 Sonnet ($3.00 input/$15.00 output per 1M tokens)
  • Claude 4.6 Opus ($5.00/$25.00)
  • GLM 5 ($1.00/$3.20)
  • Kimi K2.5 ($0.60/$3.00)
  • MiniMax M2.5 ($0.30/$1.20)
  • GPT 5.3 Codex (high) ($1.75/$14.00)
  • GPT 5.4 (high) ($2.50/$15.00)
  • Gemini 3.1 Pro (high) ($2.00/$12.00)

Coding Index and Agentic Index data came from Artificial Analysis. All models were accessed through OpenCode Zen, a provider from the OpenCode team that tests models for compatibility with their tool.

Ad

Evaluation Methodology

Four metrics were used:

  • API cost ($) - Total cost of all API calls during the task, including sub-agents
  • Execution time (mm:ss) - Total model working time
  • Implementation correctness (0-10) - How well the behavior matches requirements and edge cases
  • Technical quality (0-10) - Engineering quality of the solution

For correctness and quality scores, the existing /rename implementation was used to derive detailed evaluation criteria covering command integration, main flow, error handling, cancellation, i18n, documentation, architecture, state management, tests, and tech debt. Evaluation was performed by GPT-5.3 Codex against a structured rubric, with multiple runs showing variance within ±0.5 points.

Key Findings

The results showed GPT-5.4 (high) achieving the highest implementation correctness score of 57 out of 69 on the Agentic Index. GLM 5 demonstrated strong cost-performance ratio at $1.00/$3.20 per 1M tokens with a Coding Index of 53. The experiment revealed that inexpensive open-source models from China are approaching proprietary ones in practical coding tasks, though benchmarks alone don't tell the full story.

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also

Council: A Structured Dialogue Framework for Claude
Tools

Council: A Structured Dialogue Framework for Claude

Council — A Crucible is a structured dialogue framework that runs inside a single Claude context window, using persona framing to produce four distinct modes of engagement: rigorous interrogation, generative action, lived experience, and unformed intuition.

OpenClawRadar
Bio-Inspired Memory System for Local LLMs: LTP and Selective Oblivion Implementation
Tools

Bio-Inspired Memory System for Local LLMs: LTP and Selective Oblivion Implementation

A developer built a local MCP server implementing bio-inspired memory mechanics including Long-Term Potentiation reinforcement, selective oblivion decay, and weekly consolidation cycles. The system uses hybrid search with sqlite-vec and text fallbacks, non-blocking architecture with asyncio executors, and maintains state via a persistent 'Soul' file.

OpenClawRadar
Queuelo: A Lightweight Approval API for LLM Agents
Tools

Queuelo: A Lightweight Approval API for LLM Agents

Queuelo is a simple API layer that lets LLM agents pause before irreversible actions. Agents POST action requests, you get notified to approve or reject, and the agent receives the answer via webhook.

OpenClawRadar
Xiaozhen: A Claude Code skill that digs three layers into root causes
Tools

Xiaozhen: A Claude Code skill that digs three layers into root causes

Xiaozhen (小真) is a Claude Code skill that uses three mechanics—The Gift, Three Layers Deep, and The Prediction—to help users uncover what's actually bothering them rather than giving direct advice. It's installed with a one-line curl command and activated by typing /小真 in Claude Code.

OpenClawRadar