DeepSeek-V4 Pro and Flash: 1.6T Parameters, 1M Token Context, Hybrid Attention

✍️ OpenClawRadar📅 Published: April 24, 2026🔗 Source
DeepSeek-V4 Pro and Flash: 1.6T Parameters, 1M Token Context, Hybrid Attention
Ad

DeepSeek AI has released a preview of the DeepSeek-V4 series on Hugging Face. The lineup includes two Mixture-of-Experts (MoE) language models:

  • DeepSeek-V4-Pro: 1.6 trillion total parameters, 49 billion activated per token
  • DeepSeek-V4-Flash: 284 billion total parameters, 13 billion activated per token

Both models support a context length of one million tokens.

Architectural Upgrades

The V4 series introduces a hybrid attention mechanism combining:

  • Compressed Sparse Attention (CSA)
  • Heavily Compressed Attention (HCA)

At the 1M-token context length, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2.

Additionally, the models incorporate Manifold-Constrained Hyper-Connections (mHC) to strengthen residual connections, improving training stability.

Ad

Model Details

  • Repository: deepseek-ai/DeepSeek-V4-Pro on Hugging Face
  • Pipeline tag: text-generation
  • Auto model class: AutoModelForCausalLM
  • License: MIT
  • Weights: sharded safetensors, including BF16, F32, F8_E8M0, F8_E4M3, and INT8 formats
  • Total parameter count from safetensors: ~862 billion parameters (likely total across all experts)

Benchmarks and Efficiency

The technical report (not yet fully public) mentions that the hybrid attention dramatically improves long-context efficiency. In the 1M-token setting, the model achieves a 73% reduction in FLOPs and 90% reduction in KV cache vs V3.2.

For developers running long-context applications (e.g., document analysis, codebase understanding, multi-turn agents), this makes DeepSeek-V4 a compelling choice for beating context-length limits without proportional compute costs.

Who It's For

This release targets developers building AI agents that need to process very long documents, large codebases, or multi-turn conversations with full context retention.

📖 Read the full source: HN AI Agents

Ad

👀 See Also

OpenClaw 2026.3.24: Bridge Config Removed, Heartbeat Token Savings, Loop Detection
News

OpenClaw 2026.3.24: Bridge Config Removed, Heartbeat Token Savings, Loop Detection

OpenClaw 2026.3.24 removes the deprecated bridge configuration section from openclaw.json, adds isolatedSession: true to heartbeat config to reduce token costs from ~100K to 2-5K per run, and introduces new features including imageGenerationModel, tools.loopDetection, channels.modelByChannel, built-in model aliases, and pdfModel.

OpenClawRadar
Goldman Sachs Analysis Shows Minimal AI Impact on 2025 US GDP Growth
News

Goldman Sachs Analysis Shows Minimal AI Impact on 2025 US GDP Growth

Goldman Sachs economists report AI investment contributed 'basically zero' to US GDP growth in 2025, citing imported hardware and unmeasured productivity impacts as key factors.

OpenClawRadar
GitHub Copilot Removes Opus Models from Pro Plan, Pauses New Signups
News

GitHub Copilot Removes Opus Models from Pro Plan, Pauses New Signups

GitHub is removing Opus models from the Copilot Pro plan and pausing new signups for Pro, Pro+, and Student plans. Opus 4.7 remains available on Pro+, while Pro+ plans now offer more than 5X the usage limits of Pro.

OpenClawRadar
Talkie: A 13B LLM Trained Exclusively on Pre-1931 Text, Using Claude as a Judge in RL Training
News

Talkie: A 13B LLM Trained Exclusively on Pre-1931 Text, Using Claude as a Judge in RL Training

Researchers released Talkie, a 13B LLM trained only on text published before 1931 (no internet, no WWII data). Claude Sonnet 4.6 was used as the judge in its online DPO reinforcement learning pipeline, and Claude Opus 4.4 generated synthetic multi-turn conversations for fine-tuning. The model can write Python code from a few in-context examples despite zero modern code in training.

OpenClawRadar