Codebook Lossless LLM Compression: 10-25% RAM Reduction with Bitwise Packing

A developer has published proof-of-concept code for lossless LLM compression that reduces memory usage by 10-25% through bitwise generic packing of indexed weights. The technique trades some inference speed for smaller model size, making it possible to run larger models on hardware with limited VRAM.
How It Works
The developer started by asking how many unique values actually exist in LLM layers. Analysis revealed that while fp16 uses 16 bits, most models only utilize about 12-13 bits of unique values. By packing these values into blocks, the technique achieves compression without losing precision.
Performance Characteristics
- RAM reduction: 10-25%+ across tested models
- Speed impact: Inference speed approximately halved in example tests
- Test hardware: NVIDIA P2200 (5GB) and CPU, with updates being developed for AMD MI50 (32GB)
Implementation Details
The developer worked on this project for several weeks using AI coding assistants including Claude, Qwen, and Gemini. The repository includes both lossless and lossy/balanced versions, though the lossy version hasn't been extensively tested yet.
The developer suggests this compression approach might serve as a way to measure a model's "compactness" - how efficiently it uses its parameter space.
Code Availability
The proof-of-concept code is available on GitHub: https://github.com/bigattichouse/Codebook-Quantization
📖 Read the full source: r/LocalLLaMA
👀 See Also

Automate daily briefings into personal Spotify podcasts with OpenClaw and the Save to Spotify CLI
OpenClaw runs daily at 7am, pulls Slack threads + GitHub notifications + calendar, summarizes into mp3, and uploads as a private episode via the Save to Spotify CLI. Works on Free and Premium.

Marketing Wisdom MCP: Free Semantic Search for Startup Insights
A free MCP server provides semantic search across 6,700 insights from 1,040 episodes of My First Million and Starter Story podcasts. It offers four tools for querying founder wisdom on growth, marketing, and business strategies.

Sylve: A FreeBSD Management Plane for Virtualization, Containers, and Storage
Sylve is a BSD-2 licensed management plane for FreeBSD that provides unified control over Bhyve VMs, FreeBSD Jails, ZFS storage, and networking. It uses a RAFT consensus model for clustering and includes Samba share management with ZFS snapshot automation.

Scaling Karpathy's Autoresearch with 16 GPUs: Results and Methods
The SkyPilot team gave Claude Code access to 16 GPUs on a Kubernetes cluster to run Karpathy's Autoresearch project. Over 8 hours, the agent submitted ~910 experiments, reduced validation bits per byte from 1.003 to 0.974 (2.87% improvement), and reached the best validation loss 9x faster than sequential execution.