TranslateGemma-12b: Human Review Catches 71% Errors Missed by Automated Metrics

A follow-up audit of TranslateGemma-12b subtitle translations reveals that automated metrics significantly underestimate real-world errors. The original benchmark showed the model beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) across 6 languages. To verify, the team added human review.
Setup
- 21 English subtitle segments from one tutorial video
- TranslateGemma-12b translated into 4 languages: ES, JA, TH, ZH-CN (Korean and Traditional Chinese dropped)
- 84 translations total, preselected as scoring well on automated metrics
- Every translation sent to human MQM review
Results
Under the dashboard's own red-flag threshold (MX ≥ 5 OR CK < 0.70):
- Auto-flagged: 1/84 (1.2%)
- Human-flagged (any): 60/84 (71%)
- Human-flagged (Major): 13/84 (15%)
Per language:
- ES: 0/21 auto, 11/21 human-flagged, 2/21 Major — mostly tone inconsistencies (formal/informal switches), easiest of the four
- JA: 0/21 auto, 17/21 human-flagged, 3/21 Major — “fluent but wrong meaning” pattern; 10 of 15 total mistranslations in dataset. High COMETKiwi (0.86 mean) masked errors. Same failure mode seen in Claude Sonnet 4.6 on JA.
- TH: 0/21 auto, 17/21 human-flagged, 5/21 Major — over-production: 5 Accuracy/Addition errors (inserting content not in source), plus punctuation errors from English-style periods.
- ZH-CN: 1/21 auto (Style error), 15/21 human-flagged, 3/21 Major — including omission of “store” changing meaning, and inconsistent “ticket” translation across segments.
Of 25 Accuracy-class errors (mistranslation, omission, addition, untranslated), all were in the metric-blind quadrant. The metrics caught zero accuracy errors.
Takeaway
Small audit, one model, one content set — numbers are directional. But the pattern is clear: automated metrics alone miss the majority of real translation issues, especially accuracy errors. For production subtitle work, human review remains essential.
📖 Read the full source: r/LocalLLaMA
👀 See Also

llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix
A fix to llama.cpp's SYCL backend brings Q8_0 quantization on Intel Arc GPUs from 21% to 66% of theoretical memory bandwidth, achieving 15.24 tokens/second versus 4.88 tokens/second previously on an Arc Pro B70 with Qwen3.5-27B.

C++26 Standard Draft Finalized with Reflection, Memory Safety, Contracts, and Async Framework
The C++26 standard draft is complete, introducing reflection for metaprogramming, enhanced memory safety that eliminates undefined behavior for uninitialized variables and adds bounds safety for standard library types, contracts with pre/post-conditions, and std::execution for concurrency.

GitHub Copilot Code Review to Burn Actions Minutes Starting June 1, 2026
Starting June 1, 2026, GitHub Copilot code reviews on private repos will consume GitHub Actions minutes in addition to AI Credits. Public repos remain free.

Buddy turns down $300k+ role replacing 70% of staff with Claude agents — Reddit debates the moral and technical reality
A Reddit post describes a friend who refused a role as 'AI Transition Lead' to map workflows, build Claude/GPT agent pipelines, and fire 70% of staff. The poster argues the $300k+ bag is worth it to waste time and watch C-suite delusion crash.