TranslateGemma-12b: 71% Errors Missed by Automated Metrics

A follow-up audit of TranslateGemma-12b subtitle translations reveals that automated metrics significantly underestimate real-world errors. The original benchmark showed the model beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) across 6 languages. To verify, the team added human review.

Setup

21 English subtitle segments from one tutorial video
TranslateGemma-12b translated into 4 languages: ES, JA, TH, ZH-CN (Korean and Traditional Chinese dropped)
84 translations total, preselected as scoring well on automated metrics
Every translation sent to human MQM review

Results

Under the dashboard's own red-flag threshold (MX ≥ 5 OR CK < 0.70):

Auto-flagged: 1/84 (1.2%)
Human-flagged (any): 60/84 (71%)
Human-flagged (Major): 13/84 (15%)

Per language:

ES: 0/21 auto, 11/21 human-flagged, 2/21 Major — mostly tone inconsistencies (formal/informal switches), easiest of the four
JA: 0/21 auto, 17/21 human-flagged, 3/21 Major — “fluent but wrong meaning” pattern; 10 of 15 total mistranslations in dataset. High COMETKiwi (0.86 mean) masked errors. Same failure mode seen in Claude Sonnet 4.6 on JA.
TH: 0/21 auto, 17/21 human-flagged, 5/21 Major — over-production: 5 Accuracy/Addition errors (inserting content not in source), plus punctuation errors from English-style periods.
ZH-CN: 1/21 auto (Style error), 15/21 human-flagged, 3/21 Major — including omission of “store” changing meaning, and inconsistent “ticket” translation across segments.

Of 25 Accuracy-class errors (mistranslation, omission, addition, untranslated), all were in the metric-blind quadrant. The metrics caught zero accuracy errors.

Takeaway

Small audit, one model, one content set — numbers are directional. But the pattern is clear: automated metrics alone miss the majority of real translation issues, especially accuracy errors. For production subtitle work, human review remains essential.

📖 Read the full source: r/LocalLLaMA

TranslateGemma-12b: Human Review Catches 71% Errors Missed by Automated Metrics

Setup

Results

Takeaway

👀 See Also

llama.cpp Q8_0 quantization gets 3.1x speedup on Intel Arc GPUs with SYCL reorder fix

C++26 Standard Draft Finalized with Reflection, Memory Safety, Contracts, and Async Framework

GitHub Copilot Code Review to Burn Actions Minutes Starting June 1, 2026

Buddy turns down $300k+ role replacing 70% of staff with Claude agents — Reddit debates the moral and technical reality