Artificial Intelligence 13 min read

Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips

The article analyzes how both DeepSeek V4 and Meituan's LongCat‑2.0‑P preview, each with trillion‑scale parameters and 1 M‑token context, were trained and inferred entirely on Chinese‑made accelerators, detailing memory optimizations, deterministic operators, MoE redesigns, and massive multi‑card clusters that prove domestic compute can meet top‑tier AI workloads.

Machine Heart

Apr 30, 2026

Beyond DeepSeek V4: A Trillion‑Parameter LLM Trained End‑to‑End on Domestic Chips

On June 24, 2026, DeepSeek released the preview of its next‑generation model DeepSeek‑V4, whose total parameter count entered the trillion‑scale and supports a million‑token context window. On the same day, Meituan quietly announced the preview of its own trillion‑parameter model, LongCat‑2.0‑Preview (LongCat‑2.0‑P).

Although DeepSeek‑V4 Pro matches DeepSeek‑V4 in parameter magnitude, the real breakthrough lies in the training‑inference pipeline: the entire workflow runs on domestic accelerators with a NVIDIA‑hardware share of zero. Insider reports confirm that the training phase used a domestic‑chip cluster of 5 – 6 million cards, setting a new upper bound for Chinese‑made compute supporting ultra‑large models.

The technical report reveals three key signals that explain how this was achieved. First, the word “accelerator” appears in the training chapter while “GPU” is omitted, indicating a deliberate avoidance of NVIDIA hardware. Second, the peak training memory was compressed from the typical 80 GB to 60 GB after applying the V‑ZB optimizer, a reduction that is critical for chips with limited per‑card memory. Third, the report emphasizes deterministic operator implementations (e.g., the custom FAG version of FlashAttention) that guarantee reproducible results across long training runs.

Deterministic implementations on domestic chips originally suffered a 20‑70× slowdown compared to nondeterministic versions. LongCat’s team rewrote the FAG operator, limiting performance loss to about 5 % while preserving determinism. Similar rewrites were performed for Scatter‑type operators, achieving tens‑fold speedups when parallelized across all available compute units.

Beyond low‑level kernels, the architecture underwent substantial redesign. The MoE backbone was augmented with N‑gram embedding, moving part of the expert parameters into the embedding layer to reduce expert‑to‑expert communication. Sparse attention with cross‑layer flow‑aware indexing further cuts redundant full‑attention calculations, enabling stable support for 1 M‑token contexts while keeping inference latency and cost under control.

Training such a model required sophisticated parallelism. The team re‑engineered the classic expert‑parallel (EP), tensor‑parallel (TP), and pipeline‑parallel (PP) strategies to fit a cluster where each card’s HBM capacity and bandwidth lag behind high‑end NVIDIA GPUs. Fine‑grained memory‑aware scheduling, dynamic expert slicing, and a custom fault‑tolerance layer (link awareness, auto‑rescheduling, multi‑level anomaly detection) ensured that hardware failures or network jitter did not derail the 30‑day training cycle.

Software‑side, the team avoided relying solely on mature CUDA ecosystems. Core operators such as GEMM and attention were hand‑optimized for the domestic ISA, and deterministic computation paths were introduced to guarantee numerical stability. The DORA asynchronous training framework paper is cited, confirming that the production cluster operates with roughly 60 GB of device memory per accelerator.

Overall, the LongCat‑2.0‑P preview demonstrates that, despite current gaps in raw memory and bandwidth, domestic chips can match international standards in correctness, numerical precision, and long‑duration training stability. The model’s 1.6 T total parameters, 48 B average activation size, and 1 M‑token context illustrate a successful end‑to‑end domestic AI stack—from hardware to software, from training to inference.

Meituan has opened internal testing of LongCat‑2.0‑P, offering 1 000 000 free tokens per day for developers, and the platform’s API is publicly accessible at https://longcat.chat/platform/usage, inviting the community to evaluate the capabilities of a fully domestically‑trained trillion‑parameter LLM.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts Large Language Model model training Sparse Attention LongCat Deterministic Ops Domestic AI Chip

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.