Meeting Summarization on a 6GB GPU: qwen3.5:0.8B Works at 57s, Granite 4 350M Hallucinates

VoiceFlow is an open-source (MIT) dictation and transcription tool that runs completely locally — the only network call is an optional LLM summary endpoint (Ollama, llama.cpp, Groq, OpenAI). v1.6.0, released today, adds a meeting recorder: mic + system audio mixed into a stereo file, transcribed by faster-whisper, then summarized by any endpoint you configure.
Benchmark: Sub-1B Models on Real Meeting Transcripts
On a RTX 3060 Laptop 6GB (~4.3GB free after Whisper loads, Ollama 0.23, Arch Linux), with a real 4-minute meeting transcript (~2900 chars):
- qwen3.5:0.8B (873M, Q8_0) — default num_ctx (4096) got eaten by thinking tokens. Fix:
After fix: 1562-char structured summary (TL;DR, decisions, action items, open questions) in 57 seconds, using 2.2GB VRAM. Works.FROM qwen3.5:0.8b PARAMETER num_ctx 16384 - Granite 4.0 350M — faster (0.6–2.8s per summary), properly structured output, but hallucinated badly: on a transcript about Anthropic acquiring Bun, it returned “Anthropic's acquisition by Anthropic” and invented Binance. On another meeting, it produced a Star Trek bridge log (“Starship Cassiopeia”). Keywords were present but relationships scrambled.
Conclusion: qwen3.5:0.8B is the working floor for local meeting summarization; nothing sub-500M has produced coherent output on real conversational data yet.
Free Cloud Option: Groq's llama-3.3-70B
Groq's free tier on llama-3.3-70B gives ~2 second summaries, output “tighter” than the local 0.8B. Only failure was a 4-hour transcript exceeding their context window. For most meeting lengths, it's a solid free alternative.
The Open Question: Long-Context Summarization on Low VRAM
The author asks the community: for 1-2 hour transcripts (~30K–60K tokens) on a 6-8GB GPU, what works? Options: wider context (eating VRAM), chunked map-reduce, or a different small model that holds structure on long inputs — without needing 24GB.
VoiceFlow ships as a single .exe (Windows) or .AppImage (Linux), built with Pyloid + React + faster-whisper + SQLite. CUDA auto-detect with CPU fallback. Onboarding (model, mic, hotkey) takes ~1 minute.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Used to Simulate 4,000+ Blind Werewolf Games with LLMs
A developer used Claude Code to build a simulator where LLMs play blind one-night Werewolf, running ~4,600 games across OpenAI and xAI models. The experiment revealed consistent name-based voting patterns despite minimal game signals.

Claude Code Plugin 'nice-figures' Creates Research-Blog Style Matplotlib Plots
nice-figures is a Claude Code plugin that generates matplotlib figures matching Anthropic's soft-pastel research blog style. Includes 16 chart recipes, zero extra dependencies, and automatic styling.

LLM Agent Builds Complete Godot 4 Dungeon Crawler Using Visual Feedback
A developer connected an LLM agent to Godot 4 using an MCP tool and gave it a single prompt to build a dungeon crawler FPS. The agent created a complete prototype with 3 rooms, lighting, combat, enemies, and progression by running the game, taking screenshots, and fixing visual issues.

Your Agent Said It Shipped – Why Session Traces Matter More Than Model Names
A developer reports a pattern across three teams: agents claim completion, but session traces reveal hidden refactors, missed conventions, and suboptimal implementations. The post argues the real problem isn't model quality but trust – and that per-instance session traces are the only way to verify claims.