6GB GPU Meeting Summarization: Qwen3.5 0.8B vs Granite 4 350M

VoiceFlow is an open-source (MIT) dictation and transcription tool that runs completely locally — the only network call is an optional LLM summary endpoint (Ollama, llama.cpp, Groq, OpenAI). v1.6.0, released today, adds a meeting recorder: mic + system audio mixed into a stereo file, transcribed by faster-whisper, then summarized by any endpoint you configure.

Benchmark: Sub-1B Models on Real Meeting Transcripts

On a RTX 3060 Laptop 6GB (~4.3GB free after Whisper loads, Ollama 0.23, Arch Linux), with a real 4-minute meeting transcript (~2900 chars):

qwen3.5:0.8B (873M, Q8_0) — default num_ctx (4096) got eaten by thinking tokens. Fix:
```
FROM qwen3.5:0.8b
PARAMETER num_ctx 16384
```
After fix: 1562-char structured summary (TL;DR, decisions, action items, open questions) in 57 seconds, using 2.2GB VRAM. Works.
Granite 4.0 350M — faster (0.6–2.8s per summary), properly structured output, but hallucinated badly: on a transcript about Anthropic acquiring Bun, it returned “Anthropic's acquisition by Anthropic” and invented Binance. On another meeting, it produced a Star Trek bridge log (“Starship Cassiopeia”). Keywords were present but relationships scrambled.

Conclusion: qwen3.5:0.8B is the working floor for local meeting summarization; nothing sub-500M has produced coherent output on real conversational data yet.

Free Cloud Option: Groq's llama-3.3-70B

Groq's free tier on llama-3.3-70B gives ~2 second summaries, output “tighter” than the local 0.8B. Only failure was a 4-hour transcript exceeding their context window. For most meeting lengths, it's a solid free alternative.

The Open Question: Long-Context Summarization on Low VRAM

The author asks the community: for 1-2 hour transcripts (~30K–60K tokens) on a 6-8GB GPU, what works? Options: wider context (eating VRAM), chunked map-reduce, or a different small model that holds structure on long inputs — without needing 24GB.

VoiceFlow ships as a single .exe (Windows) or .AppImage (Linux), built with Pyloid + React + faster-whisper + SQLite. CUDA auto-detect with CPU fallback. Onboarding (model, mic, hotkey) takes ~1 minute.

📖 Read the full source: r/LocalLLaMA

Meeting Summarization on a 6GB GPU: qwen3.5:0.8B Works at 57s, Granite 4 350M Hallucinates

Benchmark: Sub-1B Models on Real Meeting Transcripts

Free Cloud Option: Groq's llama-3.3-70B

The Open Question: Long-Context Summarization on Low VRAM

👀 See Also

Claude Code Used to Simulate 4,000+ Blind Werewolf Games with LLMs

Claude Code Plugin 'nice-figures' Creates Research-Blog Style Matplotlib Plots

LLM Agent Builds Complete Godot 4 Dungeon Crawler Using Visual Feedback

Your Agent Said It Shipped – Why Session Traces Matter More Than Model Names