Tagged articles
18 articles
Page 1 of 1
Old Zhang's AI Learning
Old Zhang's AI Learning
Mar 8, 2026 · Artificial Intelligence

Twinkle – A Lightweight, Fully Chinese Large‑Model Training Framework from ModelScope

Twinkle is a lightweight client‑server training framework open‑sourced by ModelScope that abstracts away Ray clusters, data and model parallelism, offers three run modes (torchrun, Ray, HTTP), multi‑tenant LoRA training, dual back‑ends (Transformers and Megatron), and a serverless Training‑as‑a‑Service gateway for enterprise and individual developers.

LoRAModelScopeTaaS
0 likes · 14 min read
Twinkle – A Lightweight, Fully Chinese Large‑Model Training Framework from ModelScope
Fun with Large Models
Fun with Large Models
Jan 12, 2026 · Artificial Intelligence

Why You Should Master Large‑Model Training: A Full‑Process Practical Guide

The article explains why mastering large‑model training is crucial for professionals, researchers, and enterprises, outlines the end‑to‑end pipeline—from data preparation and pre‑training to instruction fine‑tuning and RLHF alignment—compares training with RAG, and presents a structured learning roadmap.

AI agentsPyTorchRAG
0 likes · 14 min read
Why You Should Master Large‑Model Training: A Full‑Process Practical Guide
Architects' Tech Alliance
Architects' Tech Alliance
Jul 23, 2025 · Artificial Intelligence

Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

The article explains how AI large‑model training demands massive GPU resources and how carefully designed network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, multi‑rail versus single‑rail connections, Dragonfly, and Torus—impact performance, scalability, cost, and reliability, guiding the selection of optimal data‑center networks.

AIData centerGPU clusters
0 likes · 9 min read
Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?
Architects' Tech Alliance
Architects' Tech Alliance
May 23, 2025 · Artificial Intelligence

Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training

The whitepaper explains that AI model training and inference rely on massive data computation, with model sizes reaching billions of parameters, demanding low‑latency, high‑bandwidth, stable, scalable, and manageable networks; it compares RDMA‑based InfiniBand and RoCE solutions and offers design recommendations for future AI compute clusters.

AIHigh‑Performance NetworkingInfiniBand
0 likes · 10 min read
Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training
DataFunSummit
DataFunSummit
Feb 17, 2025 · Artificial Intelligence

NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies

The article introduces the NorthStar large‑model training framework developed by DeWu, detailing its background challenges, pipeline architecture, rich API support, multi‑GPU training modes, multi‑level embedding storage, hardware selection considerations, and a brief Q&A on data versus model parallelism.

AI FrameworkEmbedding Storagelarge model training
0 likes · 9 min read
NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies
Baidu Geek Talk
Baidu Geek Talk
Feb 5, 2025 · Artificial Intelligence

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

This article analyzes common GPU efficiency problems in enterprise AI compute platforms—such as low utilization, long fault‑resolution times, and limited performance gains—and presents three practical solutions: dynamic resource allocation, systematic fault‑tolerance, and system‑level tuning, illustrated with real‑world case studies.

AI PlatformGPU utilizationlarge model training
0 likes · 11 min read
How to Unlock Full GPU Efficiency for Enterprise AI Platforms
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 29, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

The article details Baidu Baige 4.0’s four‑layer AI infrastructure—hardware, cluster components, training‑inference acceleration, and platform tools—highlighting its heterogeneous computing, high‑performance networking, fault‑tolerant communication library, and optimizations that boost large‑model training and inference efficiency.

AI InfrastructureHigh‑Performance Networkingheterogeneous computing
0 likes · 17 min read
How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training
Baidu Geek Talk
Baidu Geek Talk
Jul 10, 2024 · Artificial Intelligence

Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training

Baidu's HPN network solves hash‑collision bottlenecks in large‑model training by combining TOR‑affinity scheduling with Dynamic Load Balancing on self‑developed switches, boosting physical network bandwidth efficiency to about 95%, improving throughput by roughly 10% and adding a further 1.5% training‑speed gain via the BCCL library.

Baidu CloudDLB Dynamic Load BalancingHPN Network
0 likes · 12 min read
Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 3, 2024 · Operations

How to Eliminate Network Hash Collisions in Large‑Model Training

This article examines the impact of GPU communication bottlenecks on large‑model training, analyzes hash‑collision issues in high‑performance networks, and presents three practical solutions—including increasing RDMA streams, affinity‑aware scheduling, and dynamic load balancing—to boost effective network bandwidth up to 95%.

Hash CollisionRDMAdynamic load balancing
0 likes · 11 min read
How to Eliminate Network Hash Collisions in Large‑Model Training
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 23, 2024 · Artificial Intelligence

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

PAI‑TorchAcc, an Alibaba Cloud AI platform accelerator, offers a seamless PyTorch interface that integrates HuggingFace models and employs LazyTensor‑based static graph conversion, multi‑strategy distributed training, and extensive GPU optimizations to dramatically boost throughput for 1B‑175B parameter models, surpassing PyTorch native and Megatron‑LM performance.

AI accelerationAlibaba CloudGPU Optimization
0 likes · 13 min read
How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud
DataFunTalk
DataFunTalk
Jan 29, 2024 · Artificial Intelligence

PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

The article introduces PAI‑ChatLearn, a flexible and high‑performance framework developed by Alibaba Cloud's PAI team that supports full‑pipeline RLHF training for large models, explains the evolution of parallel training strategies, details the framework’s architecture and configuration, and showcases performance results and practical usage examples.

AI FrameworkPAI-ChatLearnRLHF
0 likes · 17 min read
PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models
Architects' Tech Alliance
Architects' Tech Alliance
Sep 11, 2023 · Artificial Intelligence

Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design

The 2023 Open Acceleration Specification AI Server Design Guide details the hardware architecture, OAM module and UBB board specifications, cooling, management, fault diagnosis, and software platform needed to build high‑performance, scalable AI compute clusters for large‑model training.

AI accelerationOAMUBB board
0 likes · 10 min read
Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design
Baidu Tech Salon
Baidu Tech Salon
May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseDistributed Training
0 likes · 17 min read
Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureDistributed TrainingGPU clusters
0 likes · 19 min read
How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models
Tencent Cloud Developer
Tencent Cloud Developer
Apr 14, 2023 · Artificial Intelligence

Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training

Tencent Cloud's new HCC high‑performance computing cluster triples previous generation performance with 3.2 TB/s server bandwidth, Xingsha servers and NVIDIA H800 GPUs delivering up to 1979 TFlops, while its Xingmai 3.2 T ETH RDMA network, TB‑level storage via COS + GooseFS, and multi‑form access (bare metal, cloud servers, containers, functions) enable efficient large‑model training.

AI computingGPU clusterHigh‑performance computing
0 likes · 9 min read
Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training
Tencent Cloud Developer
Tencent Cloud Developer
Mar 22, 2023 · Artificial Intelligence

How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations

This article analyzes Tencent's AngelPTM framework, detailing its ZeRO-Cache strategy, unified storage management, multi‑stream async execution, SSD tiered storage, and performance benchmarks that show up to 95% larger model capacity and over 44% speedup compared to community solutions.

AI InfrastructureGPU AccelerationMemory Optimization
0 likes · 12 min read
How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 23, 2023 · Artificial Intelligence

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

This article explains how Baidu's intelligent cloud overcomes the compute and storage walls of large‑scale model training by combining hardware design, network topology, and software optimizations such as pipeline, tensor, and expert parallelism, cost‑model‑driven placement, and future‑proof AI infrastructure evolution.

AI InfrastructureBaidu CloudCost Model
0 likes · 28 min read
How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models