Local Translation Model Recommendations for 32GB VRAM GPUs

A developer with a 32GB VRAM GPU setup (specifically mentioning a 5090) shared practical findings on local translation models optimized for real-time subtitle and word/phrase translation. Their primary language pairs are Swedish-English and Korean-English.
Recommended Models
Based on testing for quality and speed:
- For overall languages: Unsloth Gemma3 27b Instruct UD, Q6_K_XL
- For European languages + 11 included (Korean among others): Bartowski Utter Project EuroLLM 22B Instruct 2512, Q8_0
The developer noted these outperformed previous go-to models: Magistral Small 2509 Q8, Gemma 3 27b Q4, Mistral Small 3.2 Q6_K, and GPT_OSS 20b (in that order).
Performance Notes
With these models, they achieved:
- Subtitle translations with little to no buffering
- Word-lookup translations within 0-2 seconds
Models That Were Too Slow
- Qwen3.5 27b Q6
- HyperCLOVAX SEED Think 32B Q6 (for Korean)
- Qwen3 32b Q6 (among other Qwen3-3.5 variants)
- Viking 33b I1 Q4_K_S
Other Observations
The developer mentioned TranslateGemma models, which they report are "significantly better according to Google than Gemma3 27b at translation," but noted these use user-user prompts rather than system-user format. They haven't tried them firsthand due to this format difference.
For Swedish translation specifically, GPT SW3 20b was noted as "good when it works, which is rarely (refuses to accept my system prompt)."
The developer also mentioned switching to trial Gemini 2.5 Flash and Gemini 2.5 Flash-lite not because local translations were bad, but because they were "still noticing some mistakes." They're debating between Deepseek, OpenAI, Gemini, z.AI, and Claude for cheap translations, with ChatGPT Thinking as their quality bar.
They noted some free API key options via: NVIDIA NIM, Routeway, Kilo, OpenCode, and Puter.js, though they haven't tried them. They did test GLM-4.7-Flash API directly from z.ai, finding it "pretty good, around Gemma 3 27b level or even better," but hit rate limits when doing word lookups on top of subtitle translations.
📖 Read the full source: r/LocalLLaMA
👀 See Also

DeepSeek-V4-Flash W4A16+FP8 with MTP Self-Speculation: 85 tok/s on 2x RTX PRO 6000 Max-Q
DeepSeek-V4-Flash quantized to W4A16+FP8 achieves 85.52 tok/s at 524k context on 2× RTX PRO 6000 Max-Q using a patched vLLM with retrofitted MTP head, up from 52.85 tok/s baseline.

Using Claude to analyze writing patterns for better custom instructions
A Reddit user describes a method for creating more effective custom instructions by having Claude analyze 10 writing samples to identify concrete patterns like punctuation avoidance and analogy sources, rather than relying on subjective tone descriptions.

Components of a Coding Agent: How Tools, Memory, and Context Extend LLMs
Sebastian Raschka breaks down the six building blocks of coding agents like Claude Code and Codex CLI, explaining how agent harnesses combine models with tools, memory, and repository context to make LLMs more effective for software work.

5 Core OpenClaw Capabilities Available Without Installing Skills
OpenClaw's base installation can handle file operations, shell commands, web fetching, scheduled tasks, and multi-step workflows without additional skills, reducing token costs and setup complexity.