Tagged articles

18 articles

Page 1 of 1

Mar 8, 2026 · Artificial Intelligence

Twinkle – A Lightweight, Fully Chinese Large‑Model Training Framework from ModelScope

Twinkle is a lightweight client‑server training framework open‑sourced by ModelScope that abstracts away Ray clusters, data and model parallelism, offers three run modes (torchrun, Ray, HTTP), multi‑tenant LoRA training, dual back‑ends (Transformers and Megatron), and a serverless Training‑as‑a‑Service gateway for enterprise and individual developers.

LoRAModelScopeTaaS

0 likes · 14 min read

Twinkle – A Lightweight, Fully Chinese Large‑Model Training Framework from ModelScope

Fun with Large Models

Jan 12, 2026 · Artificial Intelligence

Why You Should Master Large‑Model Training: A Full‑Process Practical Guide

The article explains why mastering large‑model training is crucial for professionals, researchers, and enterprises, outlines the end‑to‑end pipeline—from data preparation and pre‑training to instruction fine‑tuning and RLHF alignment—compares training with RAG, and presents a structured learning roadmap.

AI agentsPyTorchRAG

0 likes · 14 min read

Why You Should Master Large‑Model Training: A Full‑Process Practical Guide

Architects' Tech Alliance

Jul 23, 2025 · Artificial Intelligence

Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

The article explains how AI large‑model training demands massive GPU resources and how carefully designed network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, multi‑rail versus single‑rail connections, Dragonfly, and Torus—impact performance, scalability, cost, and reliability, guiding the selection of optimal data‑center networks.

AIData centerGPU clusters

0 likes · 9 min read

Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

Architects' Tech Alliance

May 23, 2025 · Artificial Intelligence

Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training

The whitepaper explains that AI model training and inference rely on massive data computation, with model sizes reaching billions of parameters, demanding low‑latency, high‑bandwidth, stable, scalable, and manageable networks; it compares RDMA‑based InfiniBand and RoCE solutions and offers design recommendations for future AI compute clusters.

AIHigh‑Performance NetworkingInfiniBand

0 likes · 10 min read

Why High‑Performance Networks Are Critical for Large‑Scale AI Model Training

DataFunSummit

Feb 17, 2025 · Artificial Intelligence

NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies

The article introduces the NorthStar large‑model training framework developed by DeWu, detailing its background challenges, pipeline architecture, rich API support, multi‑GPU training modes, multi‑level embedding storage, hardware selection considerations, and a brief Q&A on data versus model parallelism.

AI FrameworkEmbedding Storagelarge model training

0 likes · 9 min read

NorthStar Large‑Model Training Framework: Architecture, APIs, Pipeline and Multi‑GPU Strategies

Baidu Geek Talk

Feb 5, 2025 · Artificial Intelligence

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

This article analyzes common GPU efficiency problems in enterprise AI compute platforms—such as low utilization, long fault‑resolution times, and limited performance gains—and presents three practical solutions: dynamic resource allocation, systematic fault‑tolerance, and system‑level tuning, illustrated with real‑world case studies.

AI PlatformGPU utilizationlarge model training

0 likes · 11 min read

How to Unlock Full GPU Efficiency for Enterprise AI Platforms

Baidu Intelligent Cloud Tech Hub

Sep 29, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

The article details Baidu Baige 4.0’s four‑layer AI infrastructure—hardware, cluster components, training‑inference acceleration, and platform tools—highlighting its heterogeneous computing, high‑performance networking, fault‑tolerant communication library, and optimizations that boost large‑model training and inference efficiency.

AI InfrastructureHigh‑Performance Networkingheterogeneous computing

0 likes · 17 min read

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

Baidu Geek Talk

Jul 10, 2024 · Artificial Intelligence

Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training

Baidu's HPN network solves hash‑collision bottlenecks in large‑model training by combining TOR‑affinity scheduling with Dynamic Load Balancing on self‑developed switches, boosting physical network bandwidth efficiency to about 95%, improving throughput by roughly 10% and adding a further 1.5% training‑speed gain via the BCCL library.

Baidu CloudDLB Dynamic Load BalancingHPN Network

0 likes · 12 min read

Baidu HPN Network: Solving Hash Collision for 95% Physical Network Bandwidth Efficiency in Large Model Training

Baidu Intelligent Cloud Tech Hub

Jul 3, 2024 · Operations

How to Eliminate Network Hash Collisions in Large‑Model Training

This article examines the impact of GPU communication bottlenecks on large‑model training, analyzes hash‑collision issues in high‑performance networks, and presents three practical solutions—including increasing RDMA streams, affinity‑aware scheduling, and dynamic load balancing—to boost effective network bandwidth up to 95%.

Hash CollisionRDMAdynamic load balancing

0 likes · 11 min read

How to Eliminate Network Hash Collisions in Large‑Model Training

Alibaba Cloud Big Data AI Platform

Feb 23, 2024 · Artificial Intelligence

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

PAI‑TorchAcc, an Alibaba Cloud AI platform accelerator, offers a seamless PyTorch interface that integrates HuggingFace models and employs LazyTensor‑based static graph conversion, multi‑strategy distributed training, and extensive GPU optimizations to dramatically boost throughput for 1B‑175B parameter models, surpassing PyTorch native and Megatron‑LM performance.

AI accelerationAlibaba CloudGPU Optimization

0 likes · 13 min read

How PAI‑TorchAcc Supercharges Large‑Model Training on Alibaba Cloud

DataFunTalk

Jan 29, 2024 · Artificial Intelligence

PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

The article introduces PAI‑ChatLearn, a flexible and high‑performance framework developed by Alibaba Cloud's PAI team that supports full‑pipeline RLHF training for large models, explains the evolution of parallel training strategies, details the framework’s architecture and configuration, and showcases performance results and practical usage examples.

AI FrameworkPAI-ChatLearnRLHF

0 likes · 17 min read

PAI‑ChatLearn: A Flexible Large‑Scale RLHF Training Framework for Massive Models

Architects' Tech Alliance

Sep 11, 2023 · Artificial Intelligence

Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design

The 2023 Open Acceleration Specification AI Server Design Guide details the hardware architecture, OAM module and UBB board specifications, cooling, management, fault diagnosis, and software platform needed to build high‑performance, scalable AI compute clusters for large‑model training.

AI accelerationOAMUBB board

0 likes · 10 min read

Open Acceleration Specification AI Server Design Guide (2023): Architecture, OAM Modules, UBB Board, and System Design

Baidu Intelligent Cloud Tech Hub

Jul 24, 2023 · Artificial Intelligence

How PaddlePaddle Powers Large‑Model Distributed Training: Techniques & Optimizations

This article explains the challenges of training massive AI models and details PaddlePaddle's 4D hybrid parallelism, MoE acceleration, long‑sequence strategies, end‑to‑end performance optimizations, and practical code examples for building and scaling large models efficiently.

AIDistributed TrainingPaddlePaddle

0 likes · 12 min read

How PaddlePaddle Powers Large‑Model Distributed Training: Techniques & Optimizations

Baidu Tech Salon

May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseDistributed Training

0 likes · 17 min read

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

Baidu Intelligent Cloud Tech Hub

May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureDistributed TrainingGPU clusters

0 likes · 19 min read

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

Tencent Cloud Developer

Apr 14, 2023 · Artificial Intelligence

Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training

Tencent Cloud's new HCC high‑performance computing cluster triples previous generation performance with 3.2 TB/s server bandwidth, Xingsha servers and NVIDIA H800 GPUs delivering up to 1979 TFlops, while its Xingmai 3.2 T ETH RDMA network, TB‑level storage via COS + GooseFS, and multi‑form access (bare metal, cloud servers, containers, functions) enable efficient large‑model training.

AI computingGPU clusterHigh‑performance computing

0 likes · 9 min read

Tencent Cloud's Next-Generation HCC High-Performance Computing Cluster for Large Model Training

Tencent Cloud Developer

Mar 22, 2023 · Artificial Intelligence

How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations

This article analyzes Tencent's AngelPTM framework, detailing its ZeRO-Cache strategy, unified storage management, multi‑stream async execution, SSD tiered storage, and performance benchmarks that show up to 95% larger model capacity and over 44% speedup compared to community solutions.

AI InfrastructureGPU AccelerationMemory Optimization

0 likes · 12 min read

How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations

Baidu Intelligent Cloud Tech Hub

Feb 23, 2023 · Artificial Intelligence

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

This article explains how Baidu's intelligent cloud overcomes the compute and storage walls of large‑scale model training by combining hardware design, network topology, and software optimizations such as pipeline, tensor, and expert parallelism, cost‑model‑driven placement, and future‑proof AI infrastructure evolution.

AI InfrastructureBaidu CloudCost Model

0 likes · 28 min read

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models