Fine-tuned Qwen3.5-2B with RAG-Engram architecture improves grounded answer accuracy from 50% to 93% at 8K context

Fine-tuning approach for improved RAG performance
A developer has created a fine-tuned version of Qwen3.5-2B that addresses the 'lost in the middle' phenomenon and hallucinations in small language models when context windows are saturated with approximately 8K tokens of retrieved data. The custom architecture, called RAG-Engram, improved correct answers at 8K tokens from 50% to 93% across 14 real-world queries.
Architecture details
The RAG-Engram system is a two-level system built around Qwen3.5-2B's hybrid Gated DeltaNet architecture:
- Level 1 — Static Engram Table: 135K pre-computed entity embeddings (Indian proper nouns, government schemes, Hindi phrases, financial terms) stored in CPU RAM. This frees up the model's attention from having to reconstruct known entities.
- Level 2 — Dynamic Chunk Navigation: At inference time, a lightweight spaCy extractor (~15MB) scans retrieved chunks, builds a pointer map of where key entities appear, and generates an attention bias matrix. This gets added to Q·K^T scores before softmax at layers 3 and 15 (the full-attention layers in the hybrid architecture — the other 18 layers are Gated DeltaNet which don't have softmax attention).
The approach tells attention heads where to look instead of having the model blindly scan 8,000 tokens hoping to find answers.
Training specifications
- Base model: Qwen3.5-2B-Base
- Method: LoRA (r=16, alpha=16) via Unsloth
- Data: 2,168 examples distilled from DeepSeek V3 across MS MARCO, TyDi QA, NQ Open, MLQA Hindi, IndicQA, Dolly-15K
- Training time: 15 minutes on Modal (single GPU)
- Train/Val loss: 1.369 / 1.385 — no overfitting
The supervised fine-tuning teaches the model to answer in a specific conversational style (markdown, bold key insights, source grounding), while the Engram bias handles attention navigation at long contexts.
Evaluation results
Evaluation was conducted by Claude Opus 4.6 using Google search result chunks padded to 8K tokens:
- Vanilla Qwen3.5-2B: 50% correct answers at 8K tokens, 14% failures/refusals
- Drissy + RAG-Engram: 93% correct answers at 8K tokens, 0% failures/refusals
The combination eliminated 'lost in the middle' failures completely. The developer reports the entire project from spec to HuggingFace took about 2 weeks and cost less than a coffee.
Model availability
The fine-tuned model is available as:
- Model: drissea-ai/drissy-qwen3.5-2b
- GGUF: drissea-ai/drissy-qwen3.5-2b-GGUF
📖 Read the full source: r/LocalLLaMA
👀 See Also

SuperContext: A Persistent Memory Framework for AI Coding Agents
SuperContext is an open-source framework that gives AI coding tools like Claude persistent memory through structured, targeted files instead of large instruction documents. It includes an executable prompt that builds the system in about 10 minutes with no manual setup.

Vibeyard: Open-Source Dashboard That Launches Claude Sessions from PRs, Issues, and Kanban Cards
Vibeyard is an open-source (MIT) home screen with draggable widgets for PRs, issues, kanban, and Claude sessions. Click any card to spawn a pre-scoped Claude Code session for review, fix planning, or resumption.

onWatch: Open-source local API quota tracker with SQLite storage
onWatch is a local-first API quota tracker that stores all data in a local SQLite database with no cloud service, telemetry, or account creation. It's a single binary (~13MB) that runs as a background daemon using <50MB RAM and serves a dashboard on localhost.

Using a Local LLM as a Claude Code Subagent to Reduce Context Usage
A Reddit user demonstrates how Claude Code can delegate tasks to a local LLM running via LM Studio, keeping file content out of Claude's context. The setup uses a ~120-line Python script with LM Studio's tool-calling API to handle file operations locally.