Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications

✍️ OpenClawRadar📅 Published: April 20, 2026🔗 Source

Phone-to-home chat benchmark results

A recent benchmark evaluated 8 local LLMs for phone-to-home chat applications where inference runs on a home computer. The test involved 640 evaluations (8 models × 8 datasets × 10 samples) on Mac mini M4 Pro 24Gb hardware.

Fitness formula and weighting

The composite fitness formula weighted three factors: 50% chat UX, 30% speed, and 20% shortform quality. This weighting prioritizes user experience for mobile applications where latency matters most.

Key findings

Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model tested
It achieved the lowest TTFT (11.2s), highest throughput (89.3 tok/s), and coolest thermals (45°C)
Larger models like GPT-OSS:20B passed 70% of tasks but ranked 6th due to 25.4s mean TTFT
Thermal performance varied significantly: Qwen3:14B peaked at 83°C, DeepSeek-R1:14B at 81°C
Magistral:24B was excluded from final ranking after triggering timeout loops and reaching 97°C GPU temperature

Why smaller models performed better

The benchmark revealed that for phone chat applications, faster first-token response (TTFT) and lower thermal load matter more than raw accuracy. A model scoring 77.5% accuracy but requiring 25s first-token wait loses to one that replies at 72.5% but responds in 11s. The thermal gap is significant for personal hardware reliability and longevity.

Independent analysis

An independent analysis using Claude on the same 640-evaluation dataset weighted reliability and TTFT more aggressively and reached a slightly different top-4 order, confirming that KPI weighting is a choice rather than ground truth.

Use case considerations

The author notes that for different use cases like coding or long-form writing, the weighting formula would flip entirely, prioritizing quality over speed and chat UX.

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Kimi k2.5: Breaking New Ground in AI Automation

Kimi k2.5 has set a new standard for AI automation, boasting advanced capabilities that are turning heads in the tech community. Discover how it is reshaping the landscape.

Apr 20, 2026, 05:38 PM UTC

OpenClawRadar

News

Deezer reports 44% of daily uploads are AI-generated music

Deezer announced that AI-generated tracks now represent 44% of all new music uploaded to its platform, with nearly 75,000 AI tracks uploaded daily. The company's detection system tags these tracks, removes them from recommendations, and demonetizes 85% of AI streams due to fraud.

Apr 20, 2026, 06:27 PM UTC

OpenClawRadar

News

Microsoft Copilot injects ads into GitHub and GitLab pull requests

Microsoft Copilot has reportedly injected ads into 1.5 million GitHub pull requests and also affects GitLab. The ads appear within pull request descriptions generated by the AI coding assistant.

Mar 31, 2026, 09:45 PM UTC

OpenClawRadar

News

Hybrid AI Architecture: Open-Source Components with Proprietary Reasoning Models

A practical hybrid AI architecture is emerging where 89% of organizations use open-source components to reduce costs by over 50%, while proprietary models handle complex reasoning tasks. Open-source frameworks offer transparency and fine-tuning capabilities without licensing negotiations.

Mar 29, 2026, 09:45 PM UTC

OpenClawRadar