Benchmark Results for Qwen3.5 Models with 2K to 400K Context on RTX 4090

Qwen3.5 Performance Testing on RTX 4090
A developer shared benchmark results for Qwen3.5 models running on an RTX 4090 GPU, testing context windows from 2,048 to 400,000 tokens. The tests were originally planned for 262k context but extended to 400k using yarn and other methods.
Models Tested
The following Qwen3.5 model variants were benchmarked:
- Qwen3.5-0.8B-Q4_K_M
- Qwen3.5-0.8B-bf16
- Qwen3.5-2B-Q4_K_M
- Qwen3.5-2B-bf16
- Qwen3.5-4B-Q4_K_M
- Qwen3.5-4B-bf16
- Qwen3.5-9B-Q4_K_M
- Qwen3.5-9B-bf16
- Qwen3.5-27B-Q4_K_M
- Qwen3.5-35B-A3B-Q4_K_M
Context Windows Tested
The models were evaluated at these specific context lengths: 2048, 4096, 8192, 32768, 65536, 98304, 131072, 196608, 262144, 327680, 360448, 393216, and 400000 tokens.
Testing Methodology
The benchmark script was configured to achieve the best possible tokens/second speed using NGL settings with 8-bit and 4-bit KV cache. The developer noted that while initial time-to-first-token (TTFT) appears lengthy, the Warm TTFT Avg (s) column shows better performance once the KV cache is loaded. Context was fully loaded in the first interaction intentionally.
To test context capabilities, the models were given a 1-sentence prompt to summarize logs, followed by 2k to 400k tokens of log data. The developer reported some discrepancies but overall satisfactory performance.
Current Status and Next Steps
Three models failed during testing and are undergoing KV offload tests: Qwen3.5-4B-bf16, Qwen3.5-27B-Q4_K_M, and Qwen3.5-35B-A3B-Q4_K_M. The developer had to restart these tests after a script issue wasted 24 hours of runtime.
Once the VRAM offloading tests complete, the developer plans to compare results against foundational models and has saved outputs for analysis. The developer expressed particular surprise at the performance of the 9B and 27B dense models.
The developer is seeking community input on which models to compare against and what grading methodology to use for evaluation.
📖 Read the full source: r/openclaw
👀 See Also

Current State of Chinese LLMs: Market Leaders, Open Models, and Business Models
A Reddit analysis details the Chinese LLM landscape, identifying ByteDance's Doubao as the proprietary market leader and DeepSeek as the most innovative, while outlining the business models of major players and 'Six AI Small Tigers' focused on open-weight models.

Qwen 35B-A3B as always-on agent on 16GB M4 Mac: disk I/O fails before RAM
Running Qwen 35B-A3B with llama.cpp on a 16GB M4 Mac works for batch inference, but an always-on agentic loop alongside Claude Code and Codex CLI causes SSD contention that leads to system instability and missed cron jobs, despite RAM being fine.

Linux kernel developers propose removing legacy code due to LLM-generated bug reports
Linux kernel developers are proposing to remove several legacy subsystems including ISA/PCMCIA Ethernet drivers, amateur radio protocols, ATM, and ISDN to reduce the burden of handling security bug reports generated by large language models.

Mistral AI Acquires Emmi AI to Build an Industrial Engineering AI Stack
Mistral AI acquires Emmi AI, integrating Physics AI models for industrial simulation across energy, automotive, semiconductors, and aerospace. The combined team of 30+ researchers will open a new office in Linz.