Benchmark Results for Qwen3.5 Models with 2K to 400K Context on RTX 4090

✍️ OpenClawRadar📅 Published: March 7, 2026🔗 Source
Benchmark Results for Qwen3.5 Models with 2K to 400K Context on RTX 4090
Ad

Qwen3.5 Performance Testing on RTX 4090

A developer shared benchmark results for Qwen3.5 models running on an RTX 4090 GPU, testing context windows from 2,048 to 400,000 tokens. The tests were originally planned for 262k context but extended to 400k using yarn and other methods.

Models Tested

The following Qwen3.5 model variants were benchmarked:

  • Qwen3.5-0.8B-Q4_K_M
  • Qwen3.5-0.8B-bf16
  • Qwen3.5-2B-Q4_K_M
  • Qwen3.5-2B-bf16
  • Qwen3.5-4B-Q4_K_M
  • Qwen3.5-4B-bf16
  • Qwen3.5-9B-Q4_K_M
  • Qwen3.5-9B-bf16
  • Qwen3.5-27B-Q4_K_M
  • Qwen3.5-35B-A3B-Q4_K_M

Context Windows Tested

The models were evaluated at these specific context lengths: 2048, 4096, 8192, 32768, 65536, 98304, 131072, 196608, 262144, 327680, 360448, 393216, and 400000 tokens.

Testing Methodology

The benchmark script was configured to achieve the best possible tokens/second speed using NGL settings with 8-bit and 4-bit KV cache. The developer noted that while initial time-to-first-token (TTFT) appears lengthy, the Warm TTFT Avg (s) column shows better performance once the KV cache is loaded. Context was fully loaded in the first interaction intentionally.

To test context capabilities, the models were given a 1-sentence prompt to summarize logs, followed by 2k to 400k tokens of log data. The developer reported some discrepancies but overall satisfactory performance.

Ad

Current Status and Next Steps

Three models failed during testing and are undergoing KV offload tests: Qwen3.5-4B-bf16, Qwen3.5-27B-Q4_K_M, and Qwen3.5-35B-A3B-Q4_K_M. The developer had to restart these tests after a script issue wasted 24 hours of runtime.

Once the VRAM offloading tests complete, the developer plans to compare results against foundational models and has saved outputs for analysis. The developer expressed particular surprise at the performance of the 9B and 27B dense models.

The developer is seeking community input on which models to compare against and what grading methodology to use for evaluation.

📖 Read the full source: r/openclaw

Ad

👀 See Also