Tagged articles

AI training

94 articles · Page 1 of 1

Jun 27, 2026 · Artificial Intelligence

Lilian Weng’s Deep Dive Overturns Three Years of Large‑Model Scaling Law Assumptions

In a ten‑thousand‑word analysis, former OpenAI safety VP Lilian Weng retraces the history of model scaling laws from Kaplan’s 2020 formulation, demonstrates how DeepMind’s Chinchilla overturns the original parameter‑to‑data ratio, uncovers two critical bugs in the Chinchilla paper, and warns that the impending 2026‑2028 data wall makes naïve scaling of parameters and compute unsustainable.

AI trainingchinchilladata wall

0 likes · 10 min read

Lilian Weng’s Deep Dive Overturns Three Years of Large‑Model Scaling Law Assumptions

Machine Learning Algorithms & Natural Language Processing

Jun 6, 2026 · Artificial Intelligence

How to Systematically Build More Realistic Mobile Agent Environments for Large‑Scale Training

PhoneWorld reconstructs mock Android apps from real‑world usage traces, creating scalable, resettable, and verifiable environments that let Mobile Agents train on realistic page structures, navigation paths, and state changes, and the paper shows substantial gains across four mobile benchmarks.

AI trainingMobile AgentPhoneWorld

0 likes · 12 min read

How to Systematically Build More Realistic Mobile Agent Environments for Large‑Scale Training

Machine Heart

May 21, 2026 · Industry Insights

Zuckerberg’s 4 AM Layoff of 8,000 Staff: Using Employees’ Keyboard Data to Train AI

Meta’s third 2026 layoff round cut roughly 8,000 employees—delivered at 4 am in Singapore—citing a secret program that tracks keyboard, mouse and screenshots to train AI, with Evercore estimating $3 billion in annual savings, while the company reallocates survivors to an AI‑focused division.

AI budgetAI trainingEvercore

0 likes · 7 min read

Zuckerberg’s 4 AM Layoff of 8,000 Staff: Using Employees’ Keyboard Data to Train AI

Alibaba Cloud Big Data AI Platform

May 21, 2026 · Artificial Intelligence

FluxVLA Engine and Alibaba Cloud PAI Team Up to Accelerate Embodied Intelligence into the Physical World

LimX Dynamics partners with Alibaba Cloud PAI to migrate training workloads, achieving a 10% boost in training efficiency and a 17% drop in operational complexity, while open‑sourcing the FluxVLA Engine to lower the barrier for deploying embodied‑intelligence models at scale.

AI trainingAlibaba Cloud PAIEmbodied Intelligence

0 likes · 5 min read

FluxVLA Engine and Alibaba Cloud PAI Team Up to Accelerate Embodied Intelligence into the Physical World

Weekly Large Model Application

May 5, 2026 · Artificial Intelligence

Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training

The article argues that successful speech model training starts with understanding user scenarios, then selecting appropriate data, and finally choosing metrics, detailing six key questions, data sourcing strategies, evaluation criteria, and compliance considerations to avoid the misconception that sheer data volume guarantees performance.

AI trainingdata collectionmodel evaluation

0 likes · 6 min read

Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training

CodeTrend

Apr 24, 2026 · Artificial Intelligence

How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained

The article explains why pretrained LLMs cannot call tools, then breaks down the three‑stage training pipeline—Supervised Fine‑Tuning, Reinforcement Learning from Human Feedback, and knowledge distillation—showing how each step teaches models to read tool schemas, decide when to invoke a tool, generate JSON calls, and finally transfer the capability to smaller models with LoRA.

AI trainingFunction CallingLLM

0 likes · 19 min read

How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained

Machine Heart

Apr 23, 2026 · Industry Insights

Meta Forces Employee Mouse‑and‑Keyboard Tracking to Train AI, Sparking Outrage

Meta is installing software on U.S. employees' computers to capture mouse movements, clicks, and keystrokes for AI model training, a move detailed in an internal memo that has provoked strong backlash, raised privacy concerns, and highlighted the company's broader push toward autonomous AI agents amid industry‑wide automation trends.

AI AgentsAI trainingMeta

0 likes · 9 min read

Meta Forces Employee Mouse‑and‑Keyboard Tracking to Train AI, Sparking Outrage

Machine Heart

Apr 19, 2026 · Artificial Intelligence

How Google Turns Your CAPTCHA Clicks into Training Data for the Next Generation of AI

The article explains how YouTube’s AI‑video rating and Google’s reCAPTCHA system covertly collect billions of user interactions each day, converting them into labeled data that fuels Google’s computer‑vision models such as Veo, Maps and Waymo, effectively turning routine security checks into a massive, unpaid AI training workforce.

AI trainingGoogleWaymo

0 likes · 7 min read

How Google Turns Your CAPTCHA Clicks into Training Data for the Next Generation of AI

Fun with Large Models

Apr 17, 2026 · Artificial Intelligence

Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro

This guide walks you through interpreting training logs and loss curves, diagnosing common issues such as oscillation, under‑fitting, and over‑fitting, and applying concrete adjustments to learning rate, LoRA settings, batch size, and epochs, with scenario‑specific strategies to turn a novice into a tuning expert.

AI trainingLoRAhyperparameters

0 likes · 23 min read

Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro

Machine Learning Algorithms & Natural Language Processing

Apr 4, 2026 · Artificial Intelligence

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

The paper reveals that over‑optimizing supervised fine‑tuning (SFT) for large language models can diminish their reinforcement‑learning (RL) potential, proposes an Adaptive Early‑Stop Loss (AESL) that balances accuracy and output diversity during cold‑start, and demonstrates across multiple LLMs that AESL consistently yields superior RL results.

AI trainingAdaptive Early‑Stop LossLLM

0 likes · 11 min read

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

21CTO

Mar 26, 2026 · Industry Insights

GitHub Will Harvest Your Copilot Data to Train AI – What Developers Need to Know

Starting April 24, GitHub will collect user interaction data—including code inputs, outputs, snippets, context, comments, repository structure, and feedback—to train its AI models, affecting Copilot Free, Pro, and Pro+ users while offering an opt‑out option via settings, and mirroring similar policies at Anthropic, JetBrains, and Microsoft.

AI trainingCopilotGitHub

0 likes · 4 min read

GitHub Will Harvest Your Copilot Data to Train AI – What Developers Need to Know

AI Engineering

Mar 16, 2026 · Artificial Intelligence

Does Synthetic Data Have a Future? Evidence‑Based Conclusions

A detailed investigation of two public programming‑training datasets shows that AI‑only synthetic data suffers from severe quality issues, and even AI‑plus‑expert review yields only about ten percent usable examples, proving that high‑quality training data still requires domain experts and rigorous quality‑control processes.

AI trainingdata labelingexpert review

0 likes · 16 min read

Does Synthetic Data Have a Future? Evidence‑Based Conclusions

Alibaba Cloud Infrastructure

Feb 12, 2026 · Cloud Native

How to Seamlessly Move AI Data Between OSS and CPFS with Kubernetes VolumePopulator

This article explains how Kubernetes VolumePopulator can automatically transfer AI training data from low‑cost OSS storage to high‑performance CPFS volumes, enabling on‑demand model loading, cost‑effective hot‑cold data management, and fully automated lifecycle handling in cloud‑native AI workloads.

AI trainingCPFSOSS

0 likes · 9 min read

How to Seamlessly Move AI Data Between OSS and CPFS with Kubernetes VolumePopulator

Wu Shixiong's Large Model Academy

Feb 3, 2026 · Artificial Intelligence

Why Loss Masking Is the Hidden Key to Effective LLM Fine‑Tuning

The article explains how loss masking in supervised fine‑tuning of large language models prevents the model from learning irrelevant tokens such as user inputs, system prompts, tool outputs, and padding, thereby focusing training on the assistant’s responses and improving performance and generalization.

AI trainingLLMPrompt Engineering

0 likes · 10 min read

Why Loss Masking Is the Hidden Key to Effective LLM Fine‑Tuning

Data Party THU

Jan 22, 2026 · Artificial Intelligence

Unlocking Large Model Training: Pretraining, Fine‑Tuning, and Alignment Explained

This article breaks down the three core stages of large language model training—pretraining, supervised fine‑tuning, and alignment—detailing their objectives, typical data formats, scale requirements, and the latest techniques such as RLHF and DPO.

AI trainingalignmentpretraining

0 likes · 11 min read

Unlocking Large Model Training: Pretraining, Fine‑Tuning, and Alignment Explained

Baobao Algorithm Notes

Dec 25, 2025 · Artificial Intelligence

TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs

The article analyzes TeleChat3-105B-A4.7-Thinking, the first domestically built 100‑billion‑parameter Mixture‑of‑Experts model, detailing its multi‑dimensional evaluation, three‑stage training pipeline, hardware‑level optimizations, fine‑grained architecture, and its significance for the evolving AI competition landscape.

AI trainingChinese AIMixture of Experts

0 likes · 6 min read

TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs

Alibaba Cloud Infrastructure

Dec 17, 2025 · Cloud Native

AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration

The article examines how the rise of large‑model AI training reintroduces the need for gang scheduling in Kubernetes, contrasting the rigid resource requirements of HPC‑style workloads with cloud‑native elasticity, and outlines the historical evolution, current implementations, and future directions for achieving more flexible, high‑throughput compute orchestration.

AI trainingCloud NativeGang Scheduling

0 likes · 22 min read

AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration

ShiZhen AI

Dec 17, 2025 · Artificial Intelligence

Step-by-Step Guide: Train a Lerobot Robotic Arm from Scratch on GPUFree

This tutorial walks you through renting a GPUFree RTX 4090 cloud instance, uploading your Lerobot dataset, launching training via a lightweight Flask web UI, automatically shutting down the server, and downloading the trained model, all with detailed code snippets and practical tips.

AI trainingFlaskGPUFree

0 likes · 11 min read

Step-by-Step Guide: Train a Lerobot Robotic Arm from Scratch on GPUFree

Amazon Cloud Developers

Nov 20, 2025 · Cloud Computing

Double Bandwidth, 1.5× Memory: Boost AI Workloads with EC2 P6‑B300

The newly available Amazon EC2 P6‑B300 instance, powered by NVIDIA Blackwell Ultra GPUs, offers up to 2× network bandwidth and 1.5× GPU memory compared with its predecessor, delivering 6.4 Tbps EFA throughput, 2.1 TB GPU memory, and optimized storage options for large‑scale AI training and deployment, especially for MoE and multimodal models.

AI trainingAWSEC2

0 likes · 5 min read

Double Bandwidth, 1.5× Memory: Boost AI Workloads with EC2 P6‑B300

Alibaba Cloud Big Data AI Platform

Nov 17, 2025 · Artificial Intelligence

End-to-End Navigation Model Training with Isaac Sim, MobilityGen, and Cosmos Augmentation

This tutorial walks through a complete workflow for building a navigation model using Isaac Sim and MobilityGen to generate synthetic data, applying Cosmos‑Transfer1‑7B for visual data augmentation, training the X‑Mobility model via imitation learning, converting it for ROS2 deployment, and performing software‑in‑the‑loop validation.

AI trainingData AugmentationIsaac Sim

0 likes · 19 min read

End-to-End Navigation Model Training with Isaac Sim, MobilityGen, and Cosmos Augmentation

Alibaba Cloud Infrastructure

Nov 10, 2025 · Cloud Native

Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads

Koordinator v1.7.0, the open‑source Kubernetes scheduler, adds network‑topology‑aware scheduling, job‑level preemption, and support for Ascend NPU and Cambricon MLU, delivering unified heterogeneous device management, enhanced GPU sharing, comprehensive API documentation, and best‑practice guides to improve large‑scale AI training efficiency and cluster operations.

AI trainingHeterogeneous DevicesJob Preemption

0 likes · 17 min read

Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads

Instant Consumer Technology Team

Nov 7, 2025 · Artificial Intelligence

How Game‑TARS Redefines Game AI with Human‑Native Interaction and Sparse Reasoning

Game‑TARS, a general‑purpose game AI from ByteDance's Seed team, replaces custom function calls with low‑level keyboard‑mouse actions, leverages massive multimodal data, sparse‑thinking and decaying‑loss algorithms, and achieves zero‑shot mastery across diverse games, surpassing top large models like GPT‑5 and Gemini‑2.5‑Pro.

AI trainingMultimodal Datagame AI

0 likes · 10 min read

How Game‑TARS Redefines Game AI with Human‑Native Interaction and Sparse Reasoning

Open Source Linux

Nov 4, 2025 · Artificial Intelligence

Designing High‑Performance Networks for Large‑Scale AI Model Training

This article examines the challenges of building scalable, low‑latency, and cost‑effective network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, Dragonfly, and Torus—for massive GPU clusters used in training trillion‑parameter AI models, comparing multi‑rail and single‑rail designs and highlighting real‑world implementations from Tencent and Alibaba.

AI trainingCLOSDragonfly

0 likes · 8 min read

Designing High‑Performance Networks for Large‑Scale AI Model Training

IT Services Circle

Nov 2, 2025 · Artificial Intelligence

Is Windows Gaming Copilot Secretly Training AI with Your Game Screenshots?

The article reveals that Microsoft's Gaming Copilot feature captures on‑screen text via OCR and uploads it to the cloud for AI model training, discusses privacy concerns, performance impacts on games like Battlefield 6, and provides steps to disable or uninstall the feature.

AI trainingGaming CopilotPrivacy

0 likes · 6 min read

Is Windows Gaming Copilot Secretly Training AI with Your Game Screenshots?

IT Services Circle

Oct 20, 2025 · Artificial Intelligence

How NanoChat Lets Anyone Train a ChatGPT‑Like Model for $100

NanoChat, an open‑source full‑stack AI model solution created by Andrej Karpathy, enables users to train a functional chat model on a modest $100 cloud GPU rental, offering a low‑cost, hands‑on alternative to proprietary large‑language‑model services.

AI trainingLarge Language Modelcost-effective

0 likes · 4 min read

How NanoChat Lets Anyone Train a ChatGPT‑Like Model for $100

vivo Internet Technology

Oct 15, 2025 · Backend Development

Inside Vivo’s 2025 VDC: Traffic Management, Microservice Optimizations & AI GPU Platforms

The 2025 Vivo Developer Conference showcased cutting‑edge advances in traffic‑driven growth, microservice and Dubbo performance tuning, full‑link multi‑version environment automation, and GPU‑container AI training platforms, highlighting how these innovations boost efficiency, reliability, and cost‑effectiveness across Vivo’s internet services.

AI trainingDubboGPU containers

0 likes · 9 min read

Inside Vivo’s 2025 VDC: Traffic Management, Microservice Optimizations & AI GPU Platforms

Architects' Tech Alliance

Oct 12, 2025 · Artificial Intelligence

How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects

This article explains how InfiniBand’s architecture, native RDMA, GPUDirect, and evolving bandwidth enable ultra‑low‑latency, high‑throughput communication for AI model training, compares it with Ethernet, and details the role of RoCEv2 and other high‑performance interconnect technologies.

AI trainingGPU interconnectInfiniBand

0 likes · 9 min read

How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects

DataFunSummit

Sep 23, 2025 · Artificial Intelligence

How PCache Supercharges Large‑Scale AI Training Storage Performance

This talk explores large‑scale AI training storage challenges and presents PCache, a high‑performance, cloud‑native caching system that optimizes metadata, read/write paths, deployment, and high‑availability, delivering significant throughput gains and cost savings for massive model training workloads.

AI trainingCachingPCache

0 likes · 25 min read

How PCache Supercharges Large‑Scale AI Training Storage Performance

Architects' Tech Alliance

Sep 15, 2025 · Artificial Intelligence

Why NVLink Beats PCIe for AI Training: A Deep Dive into GPU Interconnects

This article examines the differences between Scale‑Out and Scale‑Up networking in AI compute clusters, comparing PCIe, Ethernet, InfiniBand, NVLink, UALink, and emerging standards like UB‑Mesh, and explains how each technology impacts bandwidth, latency, scalability, and cost for large‑scale model training.

AI trainingGPU interconnectNVLink

0 likes · 28 min read

Why NVLink Beats PCIe for AI Training: A Deep Dive into GPU Interconnects

DataFunTalk

Sep 3, 2025 · Artificial Intelligence

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Alluxio’s distributed caching dramatically accelerates AI training and checkpointing workloads, achieving up to 99.57% GPU utilization and linear scaling across clusters in the MLPerf Storage v2.0 benchmark, while using cost‑effective commodity hardware to eliminate I/O bottlenecks.

AI trainingAlluxioGPU Utilization

0 likes · 11 min read

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

IT Services Circle

Aug 31, 2025 · Artificial Intelligence

Meta’s Dirty Secret: Training AI with 2,396 Adult Films

Meta has been accused of illegally downloading 2,396 paid adult videos since 2018 to train its AI models, including Meta Movie Gen and LLaMA, prompting lawsuits that could cost up to $359 million, highlighting broader industry concerns over copyright infringement in AI training data.

AI trainingLegal lawsuitMeta

0 likes · 6 min read

Meta’s Dirty Secret: Training AI with 2,396 Adult Films

Architects' Tech Alliance

Aug 18, 2025 · Artificial Intelligence

How Large Model Training Dominates Compute and What New Techniques Can Change It

This article explains why pre‑training large AI models consumes 90‑99% of total compute, describes the full training and inference pipelines, introduces resource‑saving strategies such as PD‑separation, and reviews market trends and infrastructure challenges shaping the next generation of AI systems.

AI InfrastructureAI trainingGPU architecture

0 likes · 13 min read

How Large Model Training Dominates Compute and What New Techniques Can Change It

Architects' Tech Alliance

Jul 19, 2025 · Artificial Intelligence

Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC

This article compares the main networking technologies used in large‑scale AI GPU clusters—NVLink, InfiniBand, RoCE Ethernet, and the emerging DDC full‑schedule fabric—examining latency, lossless transmission, congestion control, cost, power and scalability to help engineers choose the optimal solution for training massive language models.

AI trainingDDCData Center

0 likes · 15 min read

Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC

Smart Era Software Development

May 30, 2025 · Artificial Intelligence

How Tencent’s TRMT Boosted DeepSeek’s Communication: A Chinese Open‑Source Success

Tencent’s Star‑Network team partnered with DeepSeek to open‑source the DeepEP communication library, then used its self‑developed TRMT stack to overcome RoCE limitations, achieving up to 100% speedup on RoCEv2 and 30% on InfiniBand, cutting training costs and inference latency for large MoE models.

AI trainingDeepEPDeepSeek

0 likes · 8 min read

How Tencent’s TRMT Boosted DeepSeek’s Communication: A Chinese Open‑Source Success

Architect

May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingExpert Parallelismdata parallelism

0 likes · 14 min read

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

Architect

May 18, 2025 · Artificial Intelligence

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.

AI trainingGPU memoryMemory optimization

0 likes · 14 min read

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

Baidu Geek Talk

May 14, 2025 · Industry Insights

How RapidFS Boosts AI Model Training with 10 TiB/s Throughput

The article explains how large‑scale AI model training and inference require massive data handling, describes the RapidFS storage acceleration cluster deployed on a 30,000‑card Kunlun chip system with hundreds of domestic CPU servers, and presents performance tests showing linear throughput scaling up to over 1 TiB/s, demonstrating the impact of high‑performance storage on compute efficiency.

AI trainingHigh-performance computingRapidFS

0 likes · 5 min read

How RapidFS Boosts AI Model Training with 10 TiB/s Throughput

Baidu Intelligent Cloud Tech Hub

Apr 25, 2025 · Operations

How RapidFS Accelerates AI Model Training with 10 TiB/s Storage Performance

The article explains how RapidFS, a near‑compute storage acceleration solution built on BOS object storage, delivers up to 10 TiB/s throughput for massive AI model training, detailing its architecture, deployment on a 30,000‑card Kunlun cluster, and performance test results that show linear scaling from 20 to 70 nodes.

AI trainingHigh-performance computingRapidFS

0 likes · 6 min read

How RapidFS Accelerates AI Model Training with 10 TiB/s Storage Performance

Architects' Tech Alliance

Apr 10, 2025 · Artificial Intelligence

Which NVIDIA GPU Is Right for Your AI Compute Center? A Deep Dive into A100, H100, A800, H800, and H20

This article analyzes NVIDIA's A100, H100, A800, H800, and H20 GPUs, compares their architectures, performance, and pricing, and provides a step‑by‑step guide for building a private AI compute center tailored to training, inference, and high‑performance computing workloads.

A100AI trainingGPU

0 likes · 11 min read

Which NVIDIA GPU Is Right for Your AI Compute Center? A Deep Dive into A100, H100, A800, H800, and H20

Architects' Tech Alliance

Apr 6, 2025 · Fundamentals

PCIe vs NVLink: How Modern GPU Interconnects Power AI Training

As AI models grow to trillion‑parameter scales, training them demands massive GPU clusters whose performance is increasingly limited by network bandwidth; this article examines why traditional PCIe interconnects become bottlenecks and how NVIDIA's NVLink and NVSwitch technologies dramatically improve multi‑GPU communication and overall system efficiency.

AI trainingGPUHigh-performance computing

0 likes · 12 min read

PCIe vs NVLink: How Modern GPU Interconnects Power AI Training

Architects' Tech Alliance

Apr 3, 2025 · Artificial Intelligence

Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20

This article examines the latest Nvidia GPU lineup—including A100, H100, A800, H800, and the upcoming H20—detailing their architectures, performance metrics for AI training and inference, cost considerations, and provides a step‑by‑step guide for building a high‑performance compute center.

AI trainingCompute clusterGPU performance

0 likes · 11 min read

Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20

AI Algorithm Path

Apr 2, 2025 · Artificial Intelligence

Master the Three Essential LLM Training Stages for 2025

The article breaks down the three core stages of large‑language‑model training—pre‑training, supervised fine‑tuning, and RLHF—explaining their purpose, methods, and concrete examples while noting DeepSeek‑R1’s recent breakthrough and its implications for AI development.

AI trainingDeepSeekLLM

0 likes · 5 min read

Master the Three Essential LLM Training Stages for 2025

DataFunTalk

Mar 24, 2025 · Artificial Intelligence

DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights

The interview explores DeepSeek R1's open‑source weights, its multi‑stage training pipeline—including pre‑training, supervised fine‑tuning, and RLHF—alongside innovations such as self‑consistency, chain‑of‑thought prompting, distillation, MoE architectures, and cost considerations, highlighting its impact on the future of large language models.

AI trainingChain-of-ThoughtDeepSeek

0 likes · 20 min read

DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights

Baidu Geek Talk

Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingdistributed systemseBPF

0 likes · 22 min read

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

Baobao Algorithm Notes

Mar 16, 2025 · Artificial Intelligence

Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

This article details how a 7B parameter language model, fine‑tuned with DeepSeek's GRPO reinforcement‑learning algorithm and a carefully crafted multi‑component reward system, learned to solve Sudoku puzzles without any cold‑start data, outperforming a comparable 3B model and revealing key insights for structured reasoning tasks.

AI trainingGRPOQwen

0 likes · 15 min read

Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

Volcano Engine Developer Services

Mar 14, 2025 · Fundamentals

How 3FS, vePFS, and CloudFS Stack Up in AI Training Workloads – A Deep Dive

This article compares 3FS, vePFS, and CloudFS across metadata and data planes, presents detailed benchmark results for AI training scenarios, analyzes architectural trade‑offs, and draws insights for future cloud‑native file storage development.

AI trainingCloud Nativedistributed file systems

0 likes · 38 min read

How 3FS, vePFS, and CloudFS Stack Up in AI Training Workloads – A Deep Dive

Architect

Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO

0 likes · 19 min read

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

Baidu Intelligent Cloud Tech Hub

Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing

0 likes · 25 min read

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

Volcano Engine Developer Services

Mar 7, 2025 · Operations

Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training

This article dives deep into DeepSeek's 3FS parallel file system, detailing its four-component architecture, RDMA‑based high‑speed networking, client options, metadata and storage services, replication protocols, dynamic stripe sizing, and recovery mechanisms that enable efficient AI model training and inference.

AI trainingDistributed File SystemRDMA

0 likes · 21 min read

Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training

Architects' Tech Alliance

Feb 15, 2025 · Industry Insights

Choosing the Right NVIDIA GPU for AI: A100, H100, A800, H800 & H20 Explained

This article provides a detailed technical analysis of NVIDIA's A100, H100, A800, H800 and H20 GPUs, compares their architectures, performance and cost, and offers step‑by‑step guidance on building a private AI compute center, selecting hardware, software stacks and budgeting for different workloads.

AI trainingGPUHardware Selection

0 likes · 11 min read

Choosing the Right NVIDIA GPU for AI: A100, H100, A800, H800 & H20 Explained

JavaEdge

Feb 8, 2025 · Artificial Intelligence

Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights

This article provides a detailed technical analysis of DeepSeek's R1 large language model, covering its background, architecture, training methods, hardware optimizations, performance claims, user impressions, deployment options, and the challenges of reproducing its results.

AI trainingDeepSeekGPU Cost

0 likes · 16 min read

Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights

AI Cyberspace

Feb 8, 2025 · Artificial Intelligence

Why 8‑GPU Servers Are Essential for LLM Training and Which Interconnect Wins

With modern large‑language‑model workloads demanding massive parallelism, 8‑GPU servers have become the norm; this article explains the roles of CPUs, compares GPU‑to‑GPU interconnect options—including PCIe direct, PCIe Switch, NVLink, and NVSwitch—detailing their architectures, bandwidths, topologies, and trade‑offs for AI training.

8-GPU serverAI trainingGPU interconnect

0 likes · 14 min read

Why 8‑GPU Servers Are Essential for LLM Training and Which Interconnect Wins

Baidu Intelligent Cloud Tech Hub

Nov 12, 2024 · Big Data

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

The article examines the evolution of data lake storage acceleration, compares various solutions, and explains how metadata, read/write, and end‑to‑end optimizations enable scalable, cost‑effective AI and big‑data workloads in cloud‑native environments.

AI trainingBig DataData Lake

0 likes · 24 min read

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

DaTaobao Tech

Aug 21, 2024 · Artificial Intelligence

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

This article provides a comprehensive, step‑by‑step guide to training customized large language models, covering industry‑specific needs, data privacy, meticulous data cleaning, optimal data‑ratio balancing, token budgeting, GPU memory accounting, LoRA fine‑tuning techniques, and practical evaluation metrics for robust AI deployment.

AI trainingData preprocessingGPU memory

0 likes · 23 min read

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

Architects' Tech Alliance

Aug 18, 2024 · Artificial Intelligence

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

AI trainingHigh‑Performance ComputingInfiniBand

0 likes · 11 min read

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

Baobao Algorithm Notes

Aug 13, 2024 · Industry Insights

What Meta’s RDMA‑over‑Ethernet Paper Reveals About Scaling AI Training Networks

This article provides a detailed technical analysis of Meta's SIGCOMM paper on RDMA over Ethernet for large‑scale AI training, examining the physical network deployment, congestion‑control mechanisms, topology choices, routing strategies, hardware design, and the practical challenges that remain.

AI trainingCongestion ControlNetwork Architecture

0 likes · 23 min read

What Meta’s RDMA‑over‑Ethernet Paper Reveals About Scaling AI Training Networks

DataFunSummit

Jul 23, 2024 · Big Data

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

AI trainingAlluxioBig Data

0 likes · 22 min read

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

Baobao Algorithm Notes

Jul 10, 2024 · Artificial Intelligence

How to Effectively Continue Pretrain Large Language Models: Scaling Laws, Data Ratios, and Practical Tips

This article explains the motivations behind domain‑specific continue pretraining for large language models, outlines a three‑step workflow—including vocabulary expansion, data replay, ratio control, and scaling‑law calculations—provides concrete hyper‑parameter recommendations, and discusses challenges across different domain types and future research directions.

AI training

0 likes · 12 min read

How to Effectively Continue Pretrain Large Language Models: Scaling Laws, Data Ratios, and Practical Tips

Architects' Tech Alliance

Jul 7, 2024 · Operations

Overview of Popular GPU/TPU Cluster Networking Technologies: NVLink, InfiniBand, RoCE, and DDC

This article reviews the main GPU/TPU cluster networking solutions—including NVLink, InfiniBand, RoCE Ethernet, and DDC full‑schedule fabrics—examining their latency, loss‑free transmission, congestion control, cost, scalability, and suitability for large‑scale LLM training workloads.

AI trainingDDCGPU networking

0 likes · 16 min read

Overview of Popular GPU/TPU Cluster Networking Technologies: NVLink, InfiniBand, RoCE, and DDC

DataFunSummit

Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data

0 likes · 22 min read

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

DataFunTalk

Jun 14, 2024 · Artificial Intelligence

Midjourney’s Diverse Data Sources: Public Datasets, Academic Research, Partner and Proprietary Data

Midjourney enhances its AI models by integrating a wide range of data sources—including public datasets like ImageNet and COCO, academic research from top conferences, partner collaborations, and its own proprietary data—while continuously updating and managing these datasets for quality, privacy, and security.

AI trainingBright DataCOCO

0 likes · 9 min read

Midjourney’s Diverse Data Sources: Public Datasets, Academic Research, Partner and Proprietary Data

Baidu Intelligent Cloud Tech Hub

May 31, 2024 · Artificial Intelligence

How Multi‑Chip Heterogeneous Clusters Power Next‑Gen Large Model Training

Using a martial‑arts analogy, the article explains why training massive AI models now requires thousands of GPUs or mixed‑chip clusters, outlines three key steps—inter‑connect, distributed parallel strategies, and accelerator acceleration—and shows how Baidu’s Baige platform achieves near‑full efficiency across GPU, Kunlun and Ascend chips.

AI trainingGPU interconnectaccelerator optimization

0 likes · 11 min read

How Multi‑Chip Heterogeneous Clusters Power Next‑Gen Large Model Training

Architects' Tech Alliance

May 19, 2024 · Industry Insights

How to Build a 10,000‑GPU Supercluster: Core Design Principles and Architecture

This article analyzes the challenges and solutions for constructing a super‑large GPU training cluster, outlining five fundamental design principles, a four‑layer plus one‑domain architecture, and practical considerations for hardware, networking, and operational reliability in AI workloads.

AI trainingGPU ClusterHigh-performance computing

0 likes · 8 min read

How to Build a 10,000‑GPU Supercluster: Core Design Principles and Architecture

IT Services Circle

May 13, 2024 · Information Security

The Hidden Costs and Ineffectiveness of CAPTCHAs

CAPTCHAs, originally designed as human‑based computation tools to block bots, have become costly, discriminatory, and largely ineffective security measures that waste billions of dollars annually while providing profit to service providers, prompting a 2024 debate on their continued use.

AI trainingHuman Computationaccessibility

0 likes · 8 min read

The Hidden Costs and Ineffectiveness of CAPTCHAs

Architects' Tech Alliance

May 5, 2024 · Artificial Intelligence

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

The article examines how InfiniBand’s specialized features—collective communication, in‑network computing, adaptive routing, congestion control, cut‑through forwarding, shallow buffering, and self‑healing—are optimized for large‑scale AI‑generated content (AIGC) training, delivering higher bandwidth, lower latency, and greater fault tolerance than Ethernet alternatives.

AI trainingAIGCAdaptive routing

0 likes · 10 min read

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

360 Smart Cloud

Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingCloud NativeInfiniBand

0 likes · 12 min read

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

Architects' Tech Alliance

Apr 15, 2024 · Artificial Intelligence

Decoding GPU Server Topologies: From PCIe to NVLink for Large‑Model Training

This article provides a detailed technical overview of modern multi‑GPU server architectures—including PCIe switches, NVLink, NVSwitch, and HBM—explaining their hardware topologies, bandwidth characteristics, monitoring methods, and network choices to help engineers design efficient AI training clusters.

AI trainingGPUHBM

0 likes · 18 min read

Decoding GPU Server Topologies: From PCIe to NVLink for Large‑Model Training

Architects' Tech Alliance

Apr 10, 2024 · Industry Insights

Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes

This article provides a detailed technical breakdown of modern multi‑GPU server nodes, covering component composition, storage network cards, NVSwitch interconnects, bandwidth calculations, and the architectural differences between NVIDIA A100/A800 and H100/H800 configurations for AI training workloads.

A100AI trainingGPU

0 likes · 12 min read

Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes

Architects' Tech Alliance

Mar 27, 2024 · Industry Insights

Why AI Large‑Model Training Needs Ultra‑High‑Bandwidth, Low‑Latency Networks

The rapid growth of AI model sizes has created unprecedented demands on network bandwidth, latency, stability, and automation, making efficient RDMA‑based interconnects, advanced congestion control, and intelligent deployment essential for scaling distributed training clusters to thousands of GPUs.

AI InfrastructureAI trainingDistributed Computing

0 likes · 11 min read

Why AI Large‑Model Training Needs Ultra‑High‑Bandwidth, Low‑Latency Networks

Model Perspective

Mar 16, 2024 · Artificial Intelligence

What Watching a TV Drama Reveals About AI Model Training and Learning Strategies

The article draws parallels between expert viewers dissecting the drama "The Legend of Zhen Huan," efficient paper‑reading techniques, and the active‑prediction plus contrast‑learning approach that underpins modern AI model training, highlighting how proactive thinking boosts both personal and machine learning outcomes.

AI trainingActive LearningStrategic Thinking

0 likes · 8 min read

What Watching a TV Drama Reveals About AI Model Training and Learning Strategies

Ops Development & AI Practice

Mar 13, 2024 · Artificial Intelligence

How Vector Retrieval Powers AI Model Training and Real-World Applications

Vector retrieval, based on converting data into high‑dimensional vectors and measuring similarity, enables fast, accurate search across massive datasets, supporting AI tasks such as search engines, recommendation, NLP, and computer vision, and plays a crucial role in large‑model training for data selection, anomaly detection, and model optimization.

AI trainingInformation RetrievalRecommendation Systems

0 likes · 6 min read

How Vector Retrieval Powers AI Model Training and Real-World Applications

Architects' Tech Alliance

Feb 29, 2024 · Industry Insights

Choosing the Right GPU Cluster Network: NVLink, InfiniBand, RoCE & DDC Explained

This article examines the key GPU/TPU cluster networking options—NVLink, InfiniBand, RoCE Ethernet, and emerging DDC full‑scheduling fabrics—detailing their latency, loss‑less transmission, congestion control, cost, power, and scalability considerations for large‑scale AI training deployments.

AI trainingDDC fabricGPU networking

0 likes · 18 min read

Choosing the Right GPU Cluster Network: NVLink, InfiniBand, RoCE & DDC Explained

Alibaba Cloud Native

Feb 21, 2024 · Cloud Native

How Fluid & JindoCache Accelerate Large‑Scale AI Training in a Cloud‑Native Environment

This article examines the challenges of data‑intensive AI training on heterogeneous cloud‑native infrastructure and explains how the Fluid framework combined with JindoCache and KubeDL provides distributed caching, metadata acceleration, and seamless POSIX access to dramatically improve I/O performance, GPU utilization, and cost efficiency.

AI trainingData CachingFluid

0 likes · 18 min read

How Fluid & JindoCache Accelerate Large‑Scale AI Training in a Cloud‑Native Environment

Architects' Tech Alliance

Sep 9, 2023 · Industry Insights

Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP

The article analyzes AI‑training traffic patterns, critiques existing flow‑based, flowlet‑based, and packet‑based ECMP load‑balancing, introduces the NSLB solution tailored for AI clusters, and presents experimental results showing up to 113% speed improvement and sub‑millisecond failover with DPFF, while also discussing direct‑topology and intelligent lossless networking techniques.

AI trainingDPFFData Center

0 likes · 11 min read

Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP

Alibaba Cloud Native

Aug 3, 2023 · Cloud Native

How Koordinator + KubeDL Revolutionize AI Model Training on Kubernetes

This article explains how the open‑source Koordinator scheduler, combined with KubeDL, tackles the resource‑intensive demands of large‑scale AI and LLM training on Kubernetes by introducing heterogeneous resource management, elastic quota, coscheduling, and fine‑grained GPU & RDMA allocation.

AI trainingGPUKoordinator

0 likes · 17 min read

How Koordinator + KubeDL Revolutionize AI Model Training on Kubernetes

Alibaba Cloud Big Data AI Platform

Jun 19, 2023 · Cloud Computing

Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training

This article examines the challenges of scaling AI model training beyond single-chip limits, introduces Alibaba Cloud’s Predictable Network architecture—including high‑performance Ethernet, dual‑uplink, and adaptive routing—and compares its performance, scalability, and reliability against InfiniBand, showing how Ethernet can meet AI workloads with minimal loss.

AI trainingEthernet vs InfiniBandPredictable Network

0 likes · 27 min read

Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training

Alibaba Cloud Infrastructure

Jun 16, 2023 · Cloud Computing

Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

The article examines how Alibaba Cloud’s Predictable Network, InfiniBand versus Ethernet trade‑offs, and the HPN high‑performance network design together address the extreme bandwidth, latency, scalability and reliability requirements of modern large‑model AI training workloads in cloud data centers.

AI trainingCloud ComputingEthernet

0 likes · 24 min read

Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

DataFunTalk

May 25, 2023 · Artificial Intelligence

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

This article examines the storage bottlenecks in large‑scale AI training, evaluates local‑disk and Alluxio‑based distributed caching strategies, proposes uniform cache eviction and replica‑aware global policies, and introduces the SiloD framework for coordinated compute‑storage scheduling to dramatically improve GPU utilization and overall cluster throughput.

AI trainingAlluxioCache Eviction

0 likes · 16 min read

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

Programmer DD

Apr 21, 2023 · Artificial Intelligence

Is Microsoft Illegally Using Twitter Data to Train AI? Elon Musk Threatens Lawsuit

Elon Musk announced plans to sue Microsoft for allegedly using Twitter data without permission to train AI models, sparking a heated debate over data rights, API pricing changes, and the broader competition between major tech platforms in the AI landscape.

AI trainingElon MuskMicrosoft

0 likes · 9 min read

Is Microsoft Illegally Using Twitter Data to Train AI? Elon Musk Threatens Lawsuit

Tencent Cloud Developer

Mar 22, 2023 · Artificial Intelligence

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

AI trainingDistributed ComputingGPU clusters

0 likes · 19 min read

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Baidu Geek Talk

Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques

This article systematically examines the major performance bottlenecks in AI model training, explains the underlying hardware and software causes, and presents a comprehensive set of acceleration strategies—including data‑loading optimizations, compute‑side enhancements, communication tricks, and the AIAK‑Training suite—backed by real‑world case studies and quantitative results.

AI trainingAIAK-TrainingGPU Acceleration

0 likes · 33 min read

How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques

Baidu Intelligent Cloud Tech Hub

Dec 22, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques

This article systematically analyzes the main performance bottlenecks in AI model training, explains why acceleration is essential, and presents current hardware‑ and software‑based solutions—including data‑loading optimizations, operator fusion, mixed‑precision and Tensor Core usage, as well as distributed communication strategies—followed by real‑world case studies of Baidu's AIAK‑Training suite that demonstrate significant speed‑ups.

AI trainingGPU AccelerationPerformance Optimization

0 likes · 31 min read

How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques

JavaScript

Nov 7, 2022 · Artificial Intelligence

Can GitHub’s Copilot Legally Use Open‑Source Code? A New Lawsuit Unveils the Debate

Veteran programmers have filed a class-action lawsuit alleging that GitHub’s Copilot AI violates copyright by training on their contributed code without proper attribution, licensing notices, or permission, claiming repeated breaches of Section 1202 and infringing creators’ rights.

AI trainingCopilotGitHub

0 likes · 1 min read

Can GitHub’s Copilot Legally Use Open‑Source Code? A New Lawsuit Unveils the Debate

Baidu Intelligent Cloud Tech Hub

Oct 19, 2022 · Artificial Intelligence

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

This article examines the comprehensive challenges AI applications face from storage to compute, traces the evolution of AI training infrastructure, analyzes key bottlenecks such as compute acceleration, resource scheduling, massive data handling and data flow, and presents Baidu Cloud's storage acceleration solutions—including parallel file systems, caching, and the Fluid scheduler—to dramatically improve AI training performance.

AI trainingCloud NativeData Lake

0 likes · 38 min read

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

AntTech

Oct 9, 2022 · Cloud Computing

Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization

Sky Computing, introduced by Ant Technology Research Institute, proposes a cloud‑agnostic platform that abstracts heterogeneous public and private clouds into a unified service layer, enabling applications to seamlessly migrate workloads across clouds, reduce costs, avoid vendor lock‑in, and support AI training via the SkyML prototype.

AI trainingCloud ComputingMulti-Cloud

0 likes · 54 min read

Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization

Baidu Geek Talk

Jul 26, 2022 · Industry Insights

How Baidu’s Canghai Storage Powers High‑Performance Computing: Challenges and Solutions

This article analyzes the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and high‑performance data analysis—examines Baidu’s internal practices, and presents the Canghai storage platform with its object storage, parallel file system (PFS) and RapidFS solutions that address throughput, latency, and scalability requirements.

AI trainingHigh-performance computingcloud storage

0 likes · 31 min read

How Baidu’s Canghai Storage Powers High‑Performance Computing: Challenges and Solutions

Baidu Intelligent Cloud Tech Hub

Jul 21, 2022 · Cloud Computing

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

This article explains the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and HPDA—then details Baidu’s unified storage platform, object storage BOS, and runtime solutions PFS and RapidFS, illustrating their architecture, features, and a real‑world autonomous‑driving customer case.

AI trainingData Lakecloud storage

0 likes · 29 min read

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

Architects' Tech Alliance

Aug 25, 2021 · Industry Insights

Can Storage Class Memory Transform Data Centers? A Deep Dive into SCM Benefits and Challenges

This article examines the emerging Storage Class Memory (SCM) market, outlines its various technologies, evaluates performance and cost trade‑offs, explores three concrete use cases—AI training acceleration, instant data recovery, and greener data‑center operation—and discusses the latency and workload‑model challenges that must be solved for widespread adoption.

AI trainingData CenterMemory Technology

0 likes · 16 min read

Can Storage Class Memory Transform Data Centers? A Deep Dive into SCM Benefits and Challenges

Tencent Architect

Feb 23, 2021 · Artificial Intelligence

Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform

This article investigates why AI training tasks on Tencent's Xingchen compute platform experience severe I/O slowdown when using CephFS, analyzes the underlying Ceph‑FUSE and MDS mechanisms, and proposes metadata‑caching and file‑caching optimizations that can accelerate training speed by three to four times.

AI trainingCeph-FUSECephFS

0 likes · 21 min read

Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform

Alibaba Cloud Native

May 12, 2020 · Artificial Intelligence

Boosting Cloud‑Native AI Training with Alluxio: Performance Tuning on Kubernetes

This article examines the challenges of large‑scale deep‑learning model training on Kubernetes, analyzes performance bottlenecks caused by Alluxio‑FUSE integration, and presents a series of configuration and system‑level optimizations that dramatically improve data‑access speed and overall training throughput.

AI trainingAlluxioCloud Native

0 likes · 22 min read

Boosting Cloud‑Native AI Training with Alluxio: Performance Tuning on Kubernetes

UCloud Tech

Mar 24, 2020 · Artificial Intelligence

Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations

A detailed case study reveals why PyTorch training on UFS file storage suffers severe I/O bottlenecks, compares it with local SSD and SSHFS, and presents practical optimizations such as using cv2.imdecode, caching DataLoader handles, and converting small‑file datasets into large UFS files to close the performance gap.

AI trainingOptimizationPyTorch

0 likes · 14 min read

Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations

Architects' Tech Alliance

Dec 24, 2019 · Fundamentals

Design Considerations and Benefits of Storage Class Memory (SCM) for Data‑Intensive Applications

The article examines the emerging Storage Class Memory (SCM) market, outlines its various technologies, discusses performance and cost trade‑offs, and highlights how SCM can accelerate AI training, enable fast data recovery, reduce data‑center power consumption, and presents the challenges of latency and system integration.

AI trainingSCMStorage Class Memory

0 likes · 15 min read

Design Considerations and Benefits of Storage Class Memory (SCM) for Data‑Intensive Applications

UCloud Tech

May 8, 2019 · Artificial Intelligence

How UAI-Train Accelerated Face Recognition Model Training by 85% for a FinTech Leader

The UAI-Train distributed GPU platform cut a 7‑million‑image face‑recognition training cycle from a week to a day, slashed GPU costs by up to 90%, and boosted algorithm optimization efficiency by 85.7% for the fintech company Paipaidai.

AI trainingInsightfaceMXNet

0 likes · 7 min read

How UAI-Train Accelerated Face Recognition Model Training by 85% for a FinTech Leader

iQIYI Technical Product Team

Jan 4, 2019 · Artificial Intelligence

Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization

iQIYI built a cloud‑based deep‑learning training platform called Jarvis, replacing the initial Runonce service, by containerizing GPU tasks, adopting Ceph S3 storage with FUSE, optimizing data pipelines, and addressing compute, storage, and networking challenges to improve scalability and reduce GPU idle time.

AI trainingGPU computingStorage Optimization

0 likes · 9 min read

Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization