Codebook Lossless LLM Compression: 10-25% RAM Reduction with Bitwise Packing

✍️ OpenClawRadar📅 Published: March 15, 2026🔗 Source
Codebook Lossless LLM Compression: 10-25% RAM Reduction with Bitwise Packing
Ad

A developer has published proof-of-concept code for lossless LLM compression that reduces memory usage by 10-25% through bitwise generic packing of indexed weights. The technique trades some inference speed for smaller model size, making it possible to run larger models on hardware with limited VRAM.

How It Works

The developer started by asking how many unique values actually exist in LLM layers. Analysis revealed that while fp16 uses 16 bits, most models only utilize about 12-13 bits of unique values. By packing these values into blocks, the technique achieves compression without losing precision.

Performance Characteristics

  • RAM reduction: 10-25%+ across tested models
  • Speed impact: Inference speed approximately halved in example tests
  • Test hardware: NVIDIA P2200 (5GB) and CPU, with updates being developed for AMD MI50 (32GB)
Ad

Implementation Details

The developer worked on this project for several weeks using AI coding assistants including Claude, Qwen, and Gemini. The repository includes both lossless and lossy/balanced versions, though the lossy version hasn't been extensively tested yet.

The developer suggests this compression approach might serve as a way to measure a model's "compactness" - how efficiently it uses its parameter space.

Code Availability

The proof-of-concept code is available on GitHub: https://github.com/bigattichouse/Codebook-Quantization

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also