Tencent Open-Sources HunYuan DiT: First Chinese-Native Text-to-Image Model with 1.5B Parameters
Tencent has open‑sourced its upgraded 1.5‑billion‑parameter HunYuan DiT model—the first Chinese‑native, bilingual (Chinese‑English) text‑to‑image diffusion‑with‑transformer system—delivering about 20% visual quality improvement, multi‑round generation, video‑generation potential, and free commercial use, with full weights, inference code, and algorithms available on Hugging Face and GitHub for developers and enterprises.
On May 14, Tencent officially open-sourced the upgraded HunYuan text-to-image large model—the first Chinese-native DiT architecture (same architecture as Sora) text-to-image open-source model, supporting bilingual (Chinese and English) input and understanding with 1.5 billion parameters.
The upgraded HunYuan text-to-image model not only supports text-to-image generation but can also serve as the foundation for video and other multimodal visual generation. The model has been released on Hugging Face and GitHub, including complete model assets such as model weights, inference code, and model algorithms, available for free commercial use by individuals and enterprise developers.
Technical Highlights:
1. DiT Architecture (Same as Sora): The upgraded model adopts the new DiT architecture (Diffusion With Transformer), which is the same architecture and key technology used by Sora and Stable Diffusion 3—a diffusion model based on Transformer architecture. Since July 2023, the Tencent HunYuan text-to-image team has been committed to the DiT architecture direction and launched a new generation model R&D. The model supports up to 256 characters of input, achieving industry-leading levels. It also innovatively implements multi-round image generation and dialogue capabilities, allowing users to adjust initially generated images through natural language descriptions. In general scenarios, the text-to-image effect based on DiT visual generation model shows a 20% improvement in overall visual generation quality, with comprehensive enhancements in picture realism, texture and details, and spatial composition.
2. Native Chinese Understanding Capability: Previous mainstream text-to-image open-source models like Stable Diffusion primarily use English core datasets, where Chinese applications essentially process text through Chinese-to-English translation, often resulting in misinterpretation and nonsensical generated images. HunYuan text-to-image is the first Chinese-native DiT model with bilingual (Chinese and English) understanding and generation capabilities, demonstrating excellent performance in generating Chinese elements such as ancient poetry, idioms, traditional architecture, and Chinese cuisine.
3. Fully Open-Sourced, Identical to Online Version: Based on the open-sourced text-to-image model, developers and enterprises can directly use it for inference without training from scratch, and can build exclusive AI painting applications and services based on HunYuan text-to-image, saving significant manpower and computing resources. The transparently open algorithm also ensures model safety and reliability. Notably, this open-source version is identical to the latest version of Tencent HunYuan text-to-image products (including WeChat mini-program, Web端, and cloud API), trained on Tencent's massive application scenarios, and both individual and enterprise developers can use it for free commercial purposes .
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.