AI Coding Metrics Flawed: Codestrap Founders Warn 3.7x Codebase

Dorian Smiley and Connor Deeks, founders of AI advisory service Codestrap, argue that enterprise organizations are struggling to implement AI effectively because there's no established playbook for reference architectures or use cases. They contend that many companies are pretending to have AI strategies while lacking proper feedback loops to measure actual impact.

Problematic metrics and flawed outcomes

Smiley states that current AI coding evaluation focuses on the wrong metrics: "Lines of code, number of [pull requests], these are liabilities. These are not measures of engineering excellence." He identifies proper engineering metrics as deployment frequency, lead time to production, change failure rate, mean time to restore, and incident severity.

To illustrate the consequences of poor measurement, Smiley cites a recent attempt to rewrite SQLite in Rust using AI: "It passed all the unit tests, the shape of the code looks right. It's 3.7x more lines of code that performs 2,000 times worse than the actual SQLite. Two thousand times worse for a database is a non-viable product."

Foundational LLM limitations

Deeks points to fundamental problems with current LLM technology: "It's hard to teach them new facts. It's hard to reliably retrieve facts. The forward pass through the neural nets is non-deterministic, especially when you have reasoning models that engage an internal monologue to increase the efficiency of next token prediction, meaning you're going to get a different answer every time."

Smiley adds: "And they have no inductive reasoning capabilities. A model cannot check its own work. It doesn't know if the answer it gave you is right. Those are foundational problems no one has solved in LLM technology."

Proposed new measurement approach

The founders argue for developing new metrics specifically for AI-assisted engineering. Smiley suggests one potential metric: "measuring tokens burned to get to an approved pull request – a formally accepted change in software." He emphasizes that organizations need to experiment and iterate in feedback loops because "AI still doesn't work very well" even within coding contexts.

Deeks references recent Amazon and AWS outages as indicators of potential future problems, though Amazon has stated these incidents were unrelated to AI.

📖 Read the full source: HN AI Agents

Codestrap founders critique AI coding metrics and warn of quality issues

Problematic metrics and flawed outcomes

Foundational LLM limitations

Proposed new measurement approach

👀 See Also

Synthetic Society: AI Agents Building Virtual Lives on Moltbook

Anthropic Launches Claude Code Channels for Messaging from Telegram or Discord

AI Coding Agents Struggle with Context Management in Large Codebases

Sarvam AI releases 30B and 105B open-source LLMs with Indian training infrastructure