Benchmark shows smaller 4B model outperforms larger LLMs for phone-to-home chat applications

Phone-to-home chat benchmark results
A recent benchmark evaluated 8 local LLMs for phone-to-home chat applications where inference runs on a home computer. The test involved 640 evaluations (8 models × 8 datasets × 10 samples) on Mac mini M4 Pro 24Gb hardware.
Fitness formula and weighting
The composite fitness formula weighted three factors: 50% chat UX, 30% speed, and 20% shortform quality. This weighting prioritizes user experience for mobile applications where latency matters most.
Key findings
- Gemma3:4B won with a composite fitness score of 88.7 despite being the smallest model tested
- It achieved the lowest TTFT (11.2s), highest throughput (89.3 tok/s), and coolest thermals (45°C)
- Larger models like GPT-OSS:20B passed 70% of tasks but ranked 6th due to 25.4s mean TTFT
- Thermal performance varied significantly: Qwen3:14B peaked at 83°C, DeepSeek-R1:14B at 81°C
- Magistral:24B was excluded from final ranking after triggering timeout loops and reaching 97°C GPU temperature
Why smaller models performed better
The benchmark revealed that for phone chat applications, faster first-token response (TTFT) and lower thermal load matter more than raw accuracy. A model scoring 77.5% accuracy but requiring 25s first-token wait loses to one that replies at 72.5% but responds in 11s. The thermal gap is significant for personal hardware reliability and longevity.
Independent analysis
An independent analysis using Claude on the same 640-evaluation dataset weighted reliability and TTFT more aggressively and reached a slightly different top-4 order, confirming that KPI weighting is a choice rather than ground truth.
Use case considerations
The author notes that for different use cases like coding or long-form writing, the weighting formula would flip entirely, prioritizing quality over speed and chat UX.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Kimi k2.5: Breaking New Ground in AI Automation
Kimi k2.5 has set a new standard for AI automation, boasting advanced capabilities that are turning heads in the tech community. Discover how it is reshaping the landscape.

Deezer reports 44% of daily uploads are AI-generated music
Deezer announced that AI-generated tracks now represent 44% of all new music uploaded to its platform, with nearly 75,000 AI tracks uploaded daily. The company's detection system tags these tracks, removes them from recommendations, and demonetizes 85% of AI streams due to fraud.

Microsoft Copilot injects ads into GitHub and GitLab pull requests
Microsoft Copilot has reportedly injected ads into 1.5 million GitHub pull requests and also affects GitLab. The ads appear within pull request descriptions generated by the AI coding assistant.

Hybrid AI Architecture: Open-Source Components with Proprietary Reasoning Models
A practical hybrid AI architecture is emerging where 89% of organizations use open-source components to reduce costs by over 50%, while proprietary models handle complex reasoning tasks. Open-source frameworks offer transparency and fine-tuning capabilities without licensing negotiations.