Artificial Intelligence 6 min read

TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs

The article analyzes TeleChat3-105B-A4.7-Thinking, the first domestically built 100‑billion‑parameter Mixture‑of‑Experts model, detailing its multi‑dimensional evaluation, three‑stage training pipeline, hardware‑level optimizations, fine‑grained architecture, and its significance for the evolving AI competition landscape.

Baobao Algorithm Notes

Dec 25, 2025

TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs

Over the past 25 years, large‑model competition has entered a new dimension, moving from trillion‑parameter MoE models to Omni and now Agentic capabilities, pushing the frontier into deeper waters.

The latest highlight is TeleChat3-105B-A4.7-Thinking, the first openly disclosed Chinese‑built MoE model exceeding 100 billion parameters. Its evaluation compares it with top open‑source models across question answering, writing, mathematics, coding, and agent tasks, showing competitive performance on all fronts.

Training pipeline : TeleAI reveals a three‑stage “devil training” process. Stage 1 pre‑trains on general knowledge, stage 2 intensively adds STEM and code data, and stage 3 feeds repository‑scale code and agent‑task data. The final stage balances logical rigor and aesthetic quality by applying rule‑based checks for code and a dedicated RM model for text creation and role‑play scoring.

Hardware innovations : To run the model on domestic compute, the team re‑engineered the MoE communication layer, converting tensor‑parallel domains into expert‑parallel domains and confining the costly “All‑to‑All” exchanges within nodes. They also introduced micro‑level dynamic stitching to mitigate load‑imbalance during long‑sequence training.

Model architecture : TeleChat3 employs a fine‑grained 105 billion‑parameter MoE, with total parameters surpassing one trillion but only 4.7 billion active at inference. It uses one shared expert plus 192 routing experts, achieving a high expert‑sparsity ratio that enables precise knowledge‑point activation, akin to a student calling on the exact specialist for each question.

Beyond the 105 B flagship, TeleAI also open‑sourced a 36 B dense version with targeted improvements in logic and agent capabilities. Interested readers can visit the GitHub repository at https://github.com/Tele-AI/TeleChat3 to explore the code and model weights.

The article concludes with a forward‑looking view: while U.S. giants leverage massive capital and compute to chase AGI, Chinese firms, constrained by domestic hardware, can achieve localized breakthroughs through talent leverage and innovative engineering, carving a distinctive path toward practical AGI.