JANG Quantization Method Improves MLX Performance for Large Models

Performance Gap Between MLX and GGUF Quantizations
The source discusses a significant performance issue with standard MLX quantization methods for large language models. On the MMLU benchmark (200 questions), MiniMax-M2.5 quantized to 4-bit for MLX scored only 26.5% (53/200), while the same model quantized with JANG_2S method scored 74% (148/200). The JANG method outperformed all MLX quantization levels (2-bit, 3-bit, and 4-bit), which all scored near random chance at approximately 25%.
Specific Benchmark Results
Detailed MMLU subject breakdown shows JANG_2L consistently outperforming MLX quantizations:
- Abstract Algebra: JANG_2L 10/20 vs MLX 4-bit 3/20
- Astronomy: JANG_2L 20/20 vs MLX 4-bit 7/20
- College CS: JANG_2L 13/20 vs MLX 4-bit 4/20
- HS Biology: JANG_2L 18/20 vs MLX 4-bit 4/20
The root cause identified for poor MLX performance is that "MLX generates meta-commentary instead of direct answers on this model."
Model Size and Performance Comparisons
For Qwen 3.5 122B model:
- JANG_4K: 86% MMLU score, 69 GB size
- MLX 4-bit: 85% MMLU score, 64 GB size
- JANG_2S: 79% MMLU score, 38 GB size
- MLX 2-bit: 56.5% MMLU score, 36 GB size
The author notes that "People trade the M chip speed for coherency, with no GGUF equivalent on MLX" and that "Qwen 3.5 on Macs when using GGUF is also 1/3rd slower than MLX."
MiniMax-M2.5 Code Generation Issue
From referenced benchmarks: "MiniMax-M2.5 can't code — 10% on HumanEval+ despite 87% tool calling and 80% reasoning. Something is off with its code generation format. Great for reasoning though."
Availability and Implementation
Currently available through:
- MLX Studio: https://mlx.studio/ - has JANG_Q inferencing engine native
- Repository: For self-installation and model quantization
The method allows running models like MiniMax-M2.5 at "2bit MLX equivalent while getting test results that just wasn't possible before on MLX."
📖 Read the full source: r/LocalLLaMA
👀 See Also

GrapeRoot tool reduces Claude Code costs by 45% with pre-scanned repository context
A free tool called GrapeRoot that pre-scans repositories and builds dependency graphs reduced Claude Code costs by 45% on average across 10 engineering tasks while improving response quality by 13%. The tool eliminates exploration loops that normally consume tokens.

AI Agent Session Center: 3D Dashboard for Monitoring Claude Code Sessions
AI Agent Session Center is a real-time dashboard that visualizes Claude Code sessions as 3D robots in a cyberdrome, with animations showing agent status and features including live terminal views, approval alerts, and session resume. It installs via npx with lightweight bash hooks.

Queuelo: A Lightweight Approval API for LLM Agents
Queuelo is a simple API layer that lets LLM agents pause before irreversible actions. Agents POST action requests, you get notified to approve or reject, and the agent receives the answer via webhook.

Multi-Agent Loop Failures Are Org-Design Failures, Not Prompt Failures
Agent loops bouncing between peers aren't prompt bugs—they're org-chart problems. Treat agent networks as hierarchies with clear stop authority.