Codestrap founders critique AI coding metrics and warn of quality issues

Dorian Smiley and Connor Deeks, founders of AI advisory service Codestrap, argue that enterprise organizations are struggling to implement AI effectively because there's no established playbook for reference architectures or use cases. They contend that many companies are pretending to have AI strategies while lacking proper feedback loops to measure actual impact.
Problematic metrics and flawed outcomes
Smiley states that current AI coding evaluation focuses on the wrong metrics: "Lines of code, number of [pull requests], these are liabilities. These are not measures of engineering excellence." He identifies proper engineering metrics as deployment frequency, lead time to production, change failure rate, mean time to restore, and incident severity.
To illustrate the consequences of poor measurement, Smiley cites a recent attempt to rewrite SQLite in Rust using AI: "It passed all the unit tests, the shape of the code looks right. It's 3.7x more lines of code that performs 2,000 times worse than the actual SQLite. Two thousand times worse for a database is a non-viable product."
Foundational LLM limitations
Deeks points to fundamental problems with current LLM technology: "It's hard to teach them new facts. It's hard to reliably retrieve facts. The forward pass through the neural nets is non-deterministic, especially when you have reasoning models that engage an internal monologue to increase the efficiency of next token prediction, meaning you're going to get a different answer every time."
Smiley adds: "And they have no inductive reasoning capabilities. A model cannot check its own work. It doesn't know if the answer it gave you is right. Those are foundational problems no one has solved in LLM technology."
Proposed new measurement approach
The founders argue for developing new metrics specifically for AI-assisted engineering. Smiley suggests one potential metric: "measuring tokens burned to get to an approved pull request – a formally accepted change in software." He emphasizes that organizations need to experiment and iterate in feedback loops because "AI still doesn't work very well" even within coding contexts.
Deeks references recent Amazon and AWS outages as indicators of potential future problems, though Amazon has stated these incidents were unrelated to AI.
📖 Read the full source: HN AI Agents
👀 See Also

Synthetic Society: AI Agents Building Virtual Lives on Moltbook

Anthropic Launches Claude Code Channels for Messaging from Telegram or Discord
Anthropic has released Claude Code Channels, allowing developers to message their AI coding sessions from Telegram or Discord while keeping code local.

AI Coding Agents Struggle with Context Management in Large Codebases
Analysis of AI coding agents reveals they spend 15-20 tool calls on orientation tasks like grepping for routes and reading middleware before writing code, burning through context windows. Vercel achieved 100% accuracy by stripping 80% of tools and using bash, while Pi uses just 4 tools and a system prompt under 1,000 tokens.

Sarvam AI releases 30B and 105B open-source LLMs with Indian training infrastructure
Sarvam AI has open-sourced Sarvam 30B and Sarvam 105B, two reasoning models trained from scratch in India on compute provided under the IndiaAI mission. Both models use Mixture-of-Experts architecture with sparse expert routing and are optimized for efficient deployment across hardware from GPUs to laptops.