Qwen3.5-2B Fine-Tune: RAG-Engram Boosts Accuracy 50% to 93%

Fine-tuning approach for improved RAG performance

A developer has created a fine-tuned version of Qwen3.5-2B that addresses the 'lost in the middle' phenomenon and hallucinations in small language models when context windows are saturated with approximately 8K tokens of retrieved data. The custom architecture, called RAG-Engram, improved correct answers at 8K tokens from 50% to 93% across 14 real-world queries.

Architecture details

The RAG-Engram system is a two-level system built around Qwen3.5-2B's hybrid Gated DeltaNet architecture:

Level 1 — Static Engram Table: 135K pre-computed entity embeddings (Indian proper nouns, government schemes, Hindi phrases, financial terms) stored in CPU RAM. This frees up the model's attention from having to reconstruct known entities.
Level 2 — Dynamic Chunk Navigation: At inference time, a lightweight spaCy extractor (~15MB) scans retrieved chunks, builds a pointer map of where key entities appear, and generates an attention bias matrix. This gets added to Q·K^T scores before softmax at layers 3 and 15 (the full-attention layers in the hybrid architecture — the other 18 layers are Gated DeltaNet which don't have softmax attention).

The approach tells attention heads where to look instead of having the model blindly scan 8,000 tokens hoping to find answers.

Training specifications

Base model: Qwen3.5-2B-Base
Method: LoRA (r=16, alpha=16) via Unsloth
Data: 2,168 examples distilled from DeepSeek V3 across MS MARCO, TyDi QA, NQ Open, MLQA Hindi, IndicQA, Dolly-15K
Training time: 15 minutes on Modal (single GPU)
Train/Val loss: 1.369 / 1.385 — no overfitting

The supervised fine-tuning teaches the model to answer in a specific conversational style (markdown, bold key insights, source grounding), while the Engram bias handles attention navigation at long contexts.

Evaluation results

Evaluation was conducted by Claude Opus 4.6 using Google search result chunks padded to 8K tokens:

Vanilla Qwen3.5-2B: 50% correct answers at 8K tokens, 14% failures/refusals
Drissy + RAG-Engram: 93% correct answers at 8K tokens, 0% failures/refusals

The combination eliminated 'lost in the middle' failures completely. The developer reports the entire project from spec to HuggingFace took about 2 weeks and cost less than a coffee.

Model availability

The fine-tuned model is available as:

Model: drissea-ai/drissy-qwen3.5-2b
GGUF: drissea-ai/drissy-qwen3.5-2b-GGUF

📖 Read the full source: r/LocalLLaMA

Fine-tuned Qwen3.5-2B with RAG-Engram architecture improves grounded answer accuracy from 50% to 93% at 8K context

Fine-tuning approach for improved RAG performance

Architecture details

Training specifications

Evaluation results

Model availability

👀 See Also

SuperContext: A Persistent Memory Framework for AI Coding Agents

Vibeyard: Open-Source Dashboard That Launches Claude Sessions from PRs, Issues, and Kanban Cards

onWatch: Open-source local API quota tracker with SQLite storage

Using a Local LLM as a Claude Code Subagent to Reduce Context Usage