DeepSeek-V4 Pro and Flash: 1.6T Parameters, 1M Token Context, Hybrid Attention

DeepSeek AI has released a preview of the DeepSeek-V4 series on Hugging Face. The lineup includes two Mixture-of-Experts (MoE) language models:
- DeepSeek-V4-Pro: 1.6 trillion total parameters, 49 billion activated per token
- DeepSeek-V4-Flash: 284 billion total parameters, 13 billion activated per token
Both models support a context length of one million tokens.
Architectural Upgrades
The V4 series introduces a hybrid attention mechanism combining:
- Compressed Sparse Attention (CSA)
- Heavily Compressed Attention (HCA)
At the 1M-token context length, DeepSeek-V4-Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2.
Additionally, the models incorporate Manifold-Constrained Hyper-Connections (mHC) to strengthen residual connections, improving training stability.
Model Details
- Repository:
deepseek-ai/DeepSeek-V4-Proon Hugging Face - Pipeline tag:
text-generation - Auto model class:
AutoModelForCausalLM - License: MIT
- Weights: sharded safetensors, including BF16, F32, F8_E8M0, F8_E4M3, and INT8 formats
- Total parameter count from safetensors: ~862 billion parameters (likely total across all experts)
Benchmarks and Efficiency
The technical report (not yet fully public) mentions that the hybrid attention dramatically improves long-context efficiency. In the 1M-token setting, the model achieves a 73% reduction in FLOPs and 90% reduction in KV cache vs V3.2.
For developers running long-context applications (e.g., document analysis, codebase understanding, multi-turn agents), this makes DeepSeek-V4 a compelling choice for beating context-length limits without proportional compute costs.
Who It's For
This release targets developers building AI agents that need to process very long documents, large codebases, or multi-turn conversations with full context retention.
📖 Read the full source: HN AI Agents
👀 See Also

OpenClaw 2026.3.24: Bridge Config Removed, Heartbeat Token Savings, Loop Detection
OpenClaw 2026.3.24 removes the deprecated bridge configuration section from openclaw.json, adds isolatedSession: true to heartbeat config to reduce token costs from ~100K to 2-5K per run, and introduces new features including imageGenerationModel, tools.loopDetection, channels.modelByChannel, built-in model aliases, and pdfModel.

Goldman Sachs Analysis Shows Minimal AI Impact on 2025 US GDP Growth
Goldman Sachs economists report AI investment contributed 'basically zero' to US GDP growth in 2025, citing imported hardware and unmeasured productivity impacts as key factors.

GitHub Copilot Removes Opus Models from Pro Plan, Pauses New Signups
GitHub is removing Opus models from the Copilot Pro plan and pausing new signups for Pro, Pro+, and Student plans. Opus 4.7 remains available on Pro+, while Pro+ plans now offer more than 5X the usage limits of Pro.

Talkie: A 13B LLM Trained Exclusively on Pre-1931 Text, Using Claude as a Judge in RL Training
Researchers released Talkie, a 13B LLM trained only on text published before 1931 (no internet, no WWII data). Claude Sonnet 4.6 was used as the judge in its online DPO reinforcement learning pipeline, and Claude Opus 4.4 generated synthetic multi-turn conversations for fine-tuning. The model can write Python code from a few in-context examples despite zero modern code in training.