JANG Quantization Boosts MLX Performance on Large Models

Performance Gap Between MLX and GGUF Quantizations

The source discusses a significant performance issue with standard MLX quantization methods for large language models. On the MMLU benchmark (200 questions), MiniMax-M2.5 quantized to 4-bit for MLX scored only 26.5% (53/200), while the same model quantized with JANG_2S method scored 74% (148/200). The JANG method outperformed all MLX quantization levels (2-bit, 3-bit, and 4-bit), which all scored near random chance at approximately 25%.

Specific Benchmark Results

Detailed MMLU subject breakdown shows JANG_2L consistently outperforming MLX quantizations:

Abstract Algebra: JANG_2L 10/20 vs MLX 4-bit 3/20
Astronomy: JANG_2L 20/20 vs MLX 4-bit 7/20
College CS: JANG_2L 13/20 vs MLX 4-bit 4/20
HS Biology: JANG_2L 18/20 vs MLX 4-bit 4/20

The root cause identified for poor MLX performance is that "MLX generates meta-commentary instead of direct answers on this model."

Model Size and Performance Comparisons

For Qwen 3.5 122B model:

JANG_4K: 86% MMLU score, 69 GB size
MLX 4-bit: 85% MMLU score, 64 GB size
JANG_2S: 79% MMLU score, 38 GB size
MLX 2-bit: 56.5% MMLU score, 36 GB size

The author notes that "People trade the M chip speed for coherency, with no GGUF equivalent on MLX" and that "Qwen 3.5 on Macs when using GGUF is also 1/3rd slower than MLX."

MiniMax-M2.5 Code Generation Issue

From referenced benchmarks: "MiniMax-M2.5 can't code — 10% on HumanEval+ despite 87% tool calling and 80% reasoning. Something is off with its code generation format. Great for reasoning though."