Tagged articles
88 articles
Page 1 of 1
Weekly Large Model Application
Weekly Large Model Application
May 5, 2026 · Artificial Intelligence

Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training

The article argues that successful speech model training starts with understanding user scenarios, then selecting appropriate data, and finally choosing metrics, detailing six key questions, data sourcing strategies, evaluation criteria, and compliance considerations to avoid the misconception that sheer data volume guarantees performance.

AI trainingModel Evaluationdata collection
0 likes · 6 min read
Why More GPUs and Data Aren’t Enough: Defining Scenarios and Data for Speech Model Training
CodeTrend
CodeTrend
Apr 24, 2026 · Artificial Intelligence

How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained

The article explains why pretrained LLMs cannot call tools, then breaks down the three‑stage training pipeline—Supervised Fine‑Tuning, Reinforcement Learning from Human Feedback, and knowledge distillation—showing how each step teaches models to read tool schemas, decide when to invoke a tool, generate JSON calls, and finally transfer the capability to smaller models with LoRA.

AI trainingFunction CallingLLM
0 likes · 19 min read
How Large Language Models Acquire Tool‑Calling Ability: SFT, RLHF & LoRA Explained
Machine Heart
Machine Heart
Apr 23, 2026 · Industry Insights

Meta Forces Employee Mouse‑and‑Keyboard Tracking to Train AI, Sparking Outrage

Meta is installing software on U.S. employees' computers to capture mouse movements, clicks, and keystrokes for AI model training, a move detailed in an internal memo that has provoked strong backlash, raised privacy concerns, and highlighted the company's broader push toward autonomous AI agents amid industry‑wide automation trends.

AI AgentsAI trainingMeta
0 likes · 9 min read
Meta Forces Employee Mouse‑and‑Keyboard Tracking to Train AI, Sparking Outrage
Machine Heart
Machine Heart
Apr 19, 2026 · Artificial Intelligence

How Google Turns Your CAPTCHA Clicks into Training Data for the Next Generation of AI

The article explains how YouTube’s AI‑video rating and Google’s reCAPTCHA system covertly collect billions of user interactions each day, converting them into labeled data that fuels Google’s computer‑vision models such as Veo, Maps and Waymo, effectively turning routine security checks into a massive, unpaid AI training workforce.

AI trainingComputer VisionGoogle
0 likes · 7 min read
How Google Turns Your CAPTCHA Clicks into Training Data for the Next Generation of AI
Fun with Large Models
Fun with Large Models
Apr 17, 2026 · Artificial Intelligence

Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro

This guide walks you through interpreting training logs and loss curves, diagnosing common issues such as oscillation, under‑fitting, and over‑fitting, and applying concrete adjustments to learning rate, LoRA settings, batch size, and epochs, with scenario‑specific strategies to turn a novice into a tuning expert.

AI trainingLarge ModelLoRA
0 likes · 23 min read
Mastering Large Model Training: Practical Parameter Tuning from Beginner to Pro
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 4, 2026 · Artificial Intelligence

Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start

The paper reveals that over‑optimizing supervised fine‑tuning (SFT) for large language models can diminish their reinforcement‑learning (RL) potential, proposes an Adaptive Early‑Stop Loss (AESL) that balances accuracy and output diversity during cold‑start, and demonstrates across multiple LLMs that AESL consistently yields superior RL results.

AI trainingAdaptive Early‑Stop LossLLM
0 likes · 11 min read
Why the Best SFT Checkpoint May Hurt RL Performance: Adaptive Early‑Stop Loss (AESL) for LLM Cold‑Start
21CTO
21CTO
Mar 26, 2026 · Industry Insights

GitHub Will Harvest Your Copilot Data to Train AI – What Developers Need to Know

Starting April 24, GitHub will collect user interaction data—including code inputs, outputs, snippets, context, comments, repository structure, and feedback—to train its AI models, affecting Copilot Free, Pro, and Pro+ users while offering an opt‑out option via settings, and mirroring similar policies at Anthropic, JetBrains, and Microsoft.

AI trainingCopilotGitHub
0 likes · 4 min read
GitHub Will Harvest Your Copilot Data to Train AI – What Developers Need to Know
AI Engineering
AI Engineering
Mar 16, 2026 · Artificial Intelligence

Does Synthetic Data Have a Future? Evidence‑Based Conclusions

A detailed investigation of two public programming‑training datasets shows that AI‑only synthetic data suffers from severe quality issues, and even AI‑plus‑expert review yields only about ten percent usable examples, proving that high‑quality training data still requires domain experts and rigorous quality‑control processes.

AI trainingModel Evaluationdata labeling
0 likes · 16 min read
Does Synthetic Data Have a Future? Evidence‑Based Conclusions
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 12, 2026 · Cloud Native

How to Seamlessly Move AI Data Between OSS and CPFS with Kubernetes VolumePopulator

This article explains how Kubernetes VolumePopulator can automatically transfer AI training data from low‑cost OSS storage to high‑performance CPFS volumes, enabling on‑demand model loading, cost‑effective hot‑cold data management, and fully automated lifecycle handling in cloud‑native AI workloads.

AI trainingCPFSCloud Native Storage
0 likes · 9 min read
How to Seamlessly Move AI Data Between OSS and CPFS with Kubernetes VolumePopulator
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Feb 3, 2026 · Artificial Intelligence

Why Loss Masking Is the Hidden Key to Effective LLM Fine‑Tuning

The article explains how loss masking in supervised fine‑tuning of large language models prevents the model from learning irrelevant tokens such as user inputs, system prompts, tool outputs, and padding, thereby focusing training on the assistant’s responses and improving performance and generalization.

AI trainingFine-tuningLLM
0 likes · 10 min read
Why Loss Masking Is the Hidden Key to Effective LLM Fine‑Tuning
Baobao Algorithm Notes
Baobao Algorithm Notes
Dec 25, 2025 · Artificial Intelligence

TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs

The article analyzes TeleChat3-105B-A4.7-Thinking, the first domestically built 100‑billion‑parameter Mixture‑of‑Experts model, detailing its multi‑dimensional evaluation, three‑stage training pipeline, hardware‑level optimizations, fine‑grained architecture, and its significance for the evolving AI competition landscape.

AI trainingChinese AIMixture of Experts
0 likes · 6 min read
TeleChat3-105B: China’s First 100B‑Scale MoE Model and Its Technical Breakthroughs
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Dec 17, 2025 · Cloud Native

AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration

The article examines how the rise of large‑model AI training reintroduces the need for gang scheduling in Kubernetes, contrasting the rigid resource requirements of HPC‑style workloads with cloud‑native elasticity, and outlines the historical evolution, current implementations, and future directions for achieving more flexible, high‑throughput compute orchestration.

AI trainingCloud NativeGang Scheduling
0 likes · 22 min read
AI Training Revives Gang Scheduling in Kubernetes for Elastic Resource Orchestration
ShiZhen AI
ShiZhen AI
Dec 17, 2025 · Artificial Intelligence

Step-by-Step Guide: Train a Lerobot Robotic Arm from Scratch on GPUFree

This tutorial walks you through renting a GPUFree RTX 4090 cloud instance, uploading your Lerobot dataset, launching training via a lightweight Flask web UI, automatically shutting down the server, and downloading the trained model, all with detailed code snippets and practical tips.

AI trainingFlaskGPUFree
0 likes · 11 min read
Step-by-Step Guide: Train a Lerobot Robotic Arm from Scratch on GPUFree
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 17, 2025 · Artificial Intelligence

End-to-End Navigation Model Training with Isaac Sim, MobilityGen, and Cosmos Augmentation

This tutorial walks through a complete workflow for building a navigation model using Isaac Sim and MobilityGen to generate synthetic data, applying Cosmos‑Transfer1‑7B for visual data augmentation, training the X‑Mobility model via imitation learning, converting it for ROS2 deployment, and performing software‑in‑the‑loop validation.

AI trainingIsaac SimROS2
0 likes · 19 min read
End-to-End Navigation Model Training with Isaac Sim, MobilityGen, and Cosmos Augmentation
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 10, 2025 · Cloud Native

Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads

Koordinator v1.7.0, the open‑source Kubernetes scheduler, adds network‑topology‑aware scheduling, job‑level preemption, and support for Ascend NPU and Cambricon MLU, delivering unified heterogeneous device management, enhanced GPU sharing, comprehensive API documentation, and best‑practice guides to improve large‑scale AI training efficiency and cluster operations.

AI trainingHeterogeneous DevicesJob Preemption
0 likes · 17 min read
Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 7, 2025 · Artificial Intelligence

How Game‑TARS Redefines Game AI with Human‑Native Interaction and Sparse Reasoning

Game‑TARS, a general‑purpose game AI from ByteDance's Seed team, replaces custom function calls with low‑level keyboard‑mouse actions, leverages massive multimodal data, sparse‑thinking and decaying‑loss algorithms, and achieves zero‑shot mastery across diverse games, surpassing top large models like GPT‑5 and Gemini‑2.5‑Pro.

AI trainingMultimodal Datagame AI
0 likes · 10 min read
How Game‑TARS Redefines Game AI with Human‑Native Interaction and Sparse Reasoning
Open Source Linux
Open Source Linux
Nov 4, 2025 · Artificial Intelligence

Designing High‑Performance Networks for Large‑Scale AI Model Training

This article examines the challenges of building scalable, low‑latency, and cost‑effective network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, Dragonfly, and Torus—for massive GPU clusters used in training trillion‑parameter AI models, comparing multi‑rail and single‑rail designs and highlighting real‑world implementations from Tencent and Alibaba.

AI trainingCLOSDragonfly
0 likes · 8 min read
Designing High‑Performance Networks for Large‑Scale AI Model Training
IT Services Circle
IT Services Circle
Nov 2, 2025 · Artificial Intelligence

Is Windows Gaming Copilot Secretly Training AI with Your Game Screenshots?

The article reveals that Microsoft's Gaming Copilot feature captures on‑screen text via OCR and uploads it to the cloud for AI model training, discusses privacy concerns, performance impacts on games like Battlefield 6, and provides steps to disable or uninstall the feature.

AI trainingGaming CopilotWindows
0 likes · 6 min read
Is Windows Gaming Copilot Secretly Training AI with Your Game Screenshots?
IT Services Circle
IT Services Circle
Oct 20, 2025 · Artificial Intelligence

How NanoChat Lets Anyone Train a ChatGPT‑Like Model for $100

NanoChat, an open‑source full‑stack AI model solution created by Andrej Karpathy, enables users to train a functional chat model on a modest $100 cloud GPU rental, offering a low‑cost, hands‑on alternative to proprietary large‑language‑model services.

AI trainingcost-effectivelarge language model
0 likes · 4 min read
How NanoChat Lets Anyone Train a ChatGPT‑Like Model for $100
vivo Internet Technology
vivo Internet Technology
Oct 15, 2025 · Backend Development

Inside Vivo’s 2025 VDC: Traffic Management, Microservice Optimizations & AI GPU Platforms

The 2025 Vivo Developer Conference showcased cutting‑edge advances in traffic‑driven growth, microservice and Dubbo performance tuning, full‑link multi‑version environment automation, and GPU‑container AI training platforms, highlighting how these innovations boost efficiency, reliability, and cost‑effectiveness across Vivo’s internet services.

AI trainingDevOpsDubbo
0 likes · 9 min read
Inside Vivo’s 2025 VDC: Traffic Management, Microservice Optimizations & AI GPU Platforms
Architects' Tech Alliance
Architects' Tech Alliance
Oct 12, 2025 · Artificial Intelligence

How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects

This article explains how InfiniBand’s architecture, native RDMA, GPUDirect, and evolving bandwidth enable ultra‑low‑latency, high‑throughput communication for AI model training, compares it with Ethernet, and details the role of RoCEv2 and other high‑performance interconnect technologies.

AI trainingGPU interconnectHigh‑Performance Networking
0 likes · 9 min read
How InfiniBand Powers AI Training: Deep Dive into RDMA, RoCEv2, and High‑Speed Interconnects
DataFunSummit
DataFunSummit
Sep 23, 2025 · Artificial Intelligence

How PCache Supercharges Large‑Scale AI Training Storage Performance

This talk explores large‑scale AI training storage challenges and presents PCache, a high‑performance, cloud‑native caching system that optimizes metadata, read/write paths, deployment, and high‑availability, delivering significant throughput gains and cost savings for massive model training workloads.

AI trainingPCacheStorage Optimization
0 likes · 25 min read
How PCache Supercharges Large‑Scale AI Training Storage Performance
Architects' Tech Alliance
Architects' Tech Alliance
Sep 15, 2025 · Artificial Intelligence

Why NVLink Beats PCIe for AI Training: A Deep Dive into GPU Interconnects

This article examines the differences between Scale‑Out and Scale‑Up networking in AI compute clusters, comparing PCIe, Ethernet, InfiniBand, NVLink, UALink, and emerging standards like UB‑Mesh, and explains how each technology impacts bandwidth, latency, scalability, and cost for large‑scale model training.

AI trainingGPU interconnectNVLink
0 likes · 28 min read
Why NVLink Beats PCIe for AI Training: A Deep Dive into GPU Interconnects
DataFunTalk
DataFunTalk
Sep 3, 2025 · Artificial Intelligence

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Alluxio’s distributed caching dramatically accelerates AI training and checkpointing workloads, achieving up to 99.57% GPU utilization and linear scaling across clusters in the MLPerf Storage v2.0 benchmark, while using cost‑effective commodity hardware to eliminate I/O bottlenecks.

AI trainingAlluxioGPU utilization
0 likes · 11 min read
How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization
IT Services Circle
IT Services Circle
Aug 31, 2025 · Artificial Intelligence

Meta’s Dirty Secret: Training AI with 2,396 Adult Films

Meta has been accused of illegally downloading 2,396 paid adult videos since 2018 to train its AI models, including Meta Movie Gen and LLaMA, prompting lawsuits that could cost up to $359 million, highlighting broader industry concerns over copyright infringement in AI training data.

AI trainingLegal lawsuitMeta
0 likes · 6 min read
Meta’s Dirty Secret: Training AI with 2,396 Adult Films
Architects' Tech Alliance
Architects' Tech Alliance
Aug 18, 2025 · Artificial Intelligence

How Large Model Training Dominates Compute and What New Techniques Can Change It

This article explains why pre‑training large AI models consumes 90‑99% of total compute, describes the full training and inference pipelines, introduces resource‑saving strategies such as PD‑separation, and reviews market trends and infrastructure challenges shaping the next generation of AI systems.

AI InfrastructureAI trainingGPU architecture
0 likes · 13 min read
How Large Model Training Dominates Compute and What New Techniques Can Change It
Architects' Tech Alliance
Architects' Tech Alliance
Jul 19, 2025 · Artificial Intelligence

Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC

This article compares the main networking technologies used in large‑scale AI GPU clusters—NVLink, InfiniBand, RoCE Ethernet, and the emerging DDC full‑schedule fabric—examining latency, lossless transmission, congestion control, cost, power and scalability to help engineers choose the optimal solution for training massive language models.

AI trainingDDCData center
0 likes · 15 min read
Best GPU Cluster Network for Large‑Scale AI: NVLink, InfiniBand, RoCE & DDC
Architect
Architect
May 26, 2025 · Artificial Intelligence

Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism

This article explains the memory limits of a single GPU and systematically introduces data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism, describing their communication costs, advantages, drawbacks, and practical implementation details for training large AI models.

AI trainingData ParallelismExpert Parallelism
0 likes · 14 min read
Parallelism Strategies for Large-Scale Model Training: Data, Tensor, Pipeline, Sequence, and Expert Parallelism
Architect
Architect
May 18, 2025 · Artificial Intelligence

How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting

This article breaks down GPU memory consumption for large Transformer models, explains how to estimate each component—parameters, optimizer state, activations, gradients—and shows how parallelism, mixed precision, and recomputation strategies can dramatically reduce the footprint.

AI trainingGPU MemoryMemory Optimization
0 likes · 14 min read
How Much GPU Memory Can One Model Use? A Deep Dive into Transformer Memory Accounting
Baidu Geek Talk
Baidu Geek Talk
May 14, 2025 · Industry Insights

How RapidFS Boosts AI Model Training with 10 TiB/s Throughput

The article explains how large‑scale AI model training and inference require massive data handling, describes the RapidFS storage acceleration cluster deployed on a 30,000‑card Kunlun chip system with hundreds of domestic CPU servers, and presents performance tests showing linear throughput scaling up to over 1 TiB/s, demonstrating the impact of high‑performance storage on compute efficiency.

AI trainingHigh‑performance computingPerformance Testing
0 likes · 5 min read
How RapidFS Boosts AI Model Training with 10 TiB/s Throughput
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Apr 25, 2025 · Operations

How RapidFS Accelerates AI Model Training with 10 TiB/s Storage Performance

The article explains how RapidFS, a near‑compute storage acceleration solution built on BOS object storage, delivers up to 10 TiB/s throughput for massive AI model training, detailing its architecture, deployment on a 30,000‑card Kunlun cluster, and performance test results that show linear scaling from 20 to 70 nodes.

AI trainingHigh‑performance computingPerformance Testing
0 likes · 6 min read
How RapidFS Accelerates AI Model Training with 10 TiB/s Storage Performance
Architects' Tech Alliance
Architects' Tech Alliance
Apr 6, 2025 · Fundamentals

PCIe vs NVLink: How Modern GPU Interconnects Power AI Training

As AI models grow to trillion‑parameter scales, training them demands massive GPU clusters whose performance is increasingly limited by network bandwidth; this article examines why traditional PCIe interconnects become bottlenecks and how NVIDIA's NVLink and NVSwitch technologies dramatically improve multi‑GPU communication and overall system efficiency.

AI trainingGPUHigh‑performance computing
0 likes · 12 min read
PCIe vs NVLink: How Modern GPU Interconnects Power AI Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 3, 2025 · Artificial Intelligence

Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20

This article examines the latest Nvidia GPU lineup—including A100, H100, A800, H800, and the upcoming H20—detailing their architectures, performance metrics for AI training and inference, cost considerations, and provides a step‑by‑step guide for building a high‑performance compute center.

AI trainingCompute clusterGPU performance
0 likes · 11 min read
Which Nvidia GPU Wins the AI Race? A Deep Dive into A100, H100, A800, H800 & H20
AI Algorithm Path
AI Algorithm Path
Apr 2, 2025 · Artificial Intelligence

Master the Three Essential LLM Training Stages for 2025

The article breaks down the three core stages of large‑language‑model training—pre‑training, supervised fine‑tuning, and RLHF—explaining their purpose, methods, and concrete examples while noting DeepSeek‑R1’s recent breakthrough and its implications for AI development.

AI trainingDeepSeekLLM
0 likes · 5 min read
Master the Three Essential LLM Training Stages for 2025
DataFunTalk
DataFunTalk
Mar 24, 2025 · Artificial Intelligence

DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights

The interview explores DeepSeek R1's open‑source weights, its multi‑stage training pipeline—including pre‑training, supervised fine‑tuning, and RLHF—alongside innovations such as self‑consistency, chain‑of‑thought prompting, distillation, MoE architectures, and cost considerations, highlighting its impact on the future of large language models.

AI trainingDeepSeekRLHF
0 likes · 20 min read
DeepSeek R1: Open‑Source Reasoning Model and Multi‑Stage Training Insights
Baidu Geek Talk
Baidu Geek Talk
Mar 17, 2025 · Industry Insights

From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability

This article traces the decade‑long evolution of AI training stability—from early small‑model manual operations to large‑scale, multi‑thousand‑GPU clusters—detailing metrics like invalid training time, fault‑tolerance architectures, eBPF‑based hidden‑fault detection, BCCL enhancements, multi‑level restart strategies, and trigger‑based checkpointing that together shrink downtime from minutes to seconds.

AI trainingDistributed SystemsInfrastructure
0 likes · 22 min read
From Manual Restarts to Automated Fault Tolerance: The Evolution of AI Training Stability
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 16, 2025 · Artificial Intelligence

Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?

This article details how a 7B parameter language model, fine‑tuned with DeepSeek's GRPO reinforcement‑learning algorithm and a carefully crafted multi‑component reward system, learned to solve Sudoku puzzles without any cold‑start data, outperforming a comparable 3B model and revealing key insights for structured reasoning tasks.

AI trainingGRPOQwen
0 likes · 15 min read
Can a 7B LLM Master Sudoku From Scratch Using Reinforcement Learning?
Architect
Architect
Mar 10, 2025 · Artificial Intelligence

What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations

This article analyzes DeepSeek’s latest large‑model breakthroughs, covering the MLA attention compression, GRPO alignment algorithm, MoE load‑balancing redesign, multi‑stage training pipelines, reinforcement‑learning tricks, and performance comparisons with GPT‑4o‑Mini and Llama 3.1, highlighting both strengths and remaining challenges.

AI trainingDeepSeekGRPO
0 likes · 19 min read
What Makes DeepSeek’s New Architecture a Game‑Changer? Inside MLA, GRPO, and MoE Innovations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing
0 likes · 25 min read
How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training
Volcano Engine Developer Services
Volcano Engine Developer Services
Mar 7, 2025 · Operations

Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training

This article dives deep into DeepSeek's 3FS parallel file system, detailing its four-component architecture, RDMA‑based high‑speed networking, client options, metadata and storage services, replication protocols, dynamic stripe sizing, and recovery mechanisms that enable efficient AI model training and inference.

AI trainingDistributed File SystemRDMA
0 likes · 21 min read
Inside 3FS: How DeepSeek’s Parallel File System Powers AI Training
JavaEdge
JavaEdge
Feb 8, 2025 · Artificial Intelligence

Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights

This article provides a detailed technical analysis of DeepSeek's R1 large language model, covering its background, architecture, training methods, hardware optimizations, performance claims, user impressions, deployment options, and the challenges of reproducing its results.

AI trainingDeepSeekGPU Cost
0 likes · 16 min read
Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights
AI Cyberspace
AI Cyberspace
Feb 8, 2025 · Artificial Intelligence

Why 8‑GPU Servers Are Essential for LLM Training and Which Interconnect Wins

With modern large‑language‑model workloads demanding massive parallelism, 8‑GPU servers have become the norm; this article explains the roles of CPUs, compares GPU‑to‑GPU interconnect options—including PCIe direct, PCIe Switch, NVLink, and NVSwitch—detailing their architectures, bandwidths, topologies, and trade‑offs for AI training.

8-GPU serverAI trainingGPU interconnect
0 likes · 14 min read
Why 8‑GPU Servers Are Essential for LLM Training and Which Interconnect Wins
DaTaobao Tech
DaTaobao Tech
Aug 21, 2024 · Artificial Intelligence

Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning

This article provides a comprehensive, step‑by‑step guide to training customized large language models, covering industry‑specific needs, data privacy, meticulous data cleaning, optimal data‑ratio balancing, token budgeting, GPU memory accounting, LoRA fine‑tuning techniques, and practical evaluation metrics for robust AI deployment.

AI trainingFine-tuningGPU Memory
0 likes · 23 min read
Mastering Custom Large‑Model Training: Data Strategies, LoRA Tricks, and Resource Planning
Architects' Tech Alliance
Architects' Tech Alliance
Aug 18, 2024 · Artificial Intelligence

RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training

The article explains how RDMA technologies—including InfiniBand, RoCE, and iWARP—provide high‑throughput, low‑latency, CPU‑free data transfer for massive generative AI model training, compares their architectures, and discusses modern network designs and load‑balancing strategies to optimize AI‑focused data‑center networks.

AI trainingHigh‑Performance ComputingInfiniBand
0 likes · 11 min read
RDMA, InfiniBand, RoCE, and iWARP: High‑Performance Networking for Large‑Scale Generative AI Model Training
DataFunSummit
DataFunSummit
Jul 23, 2024 · Big Data

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

AI trainingAlluxioBig Data
0 likes · 22 min read
Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 10, 2024 · Artificial Intelligence

How to Effectively Continue Pretrain Large Language Models: Scaling Laws, Data Ratios, and Practical Tips

This article explains the motivations behind domain‑specific continue pretraining for large language models, outlines a three‑step workflow—including vocabulary expansion, data replay, ratio control, and scaling‑law calculations—provides concrete hyper‑parameter recommendations, and discusses challenges across different domain types and future research directions.

AI training
0 likes · 12 min read
How to Effectively Continue Pretrain Large Language Models: Scaling Laws, Data Ratios, and Practical Tips
DataFunSummit
DataFunSummit
Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions
DataFunTalk
DataFunTalk
Jun 14, 2024 · Artificial Intelligence

Midjourney’s Diverse Data Sources: Public Datasets, Academic Research, Partner and Proprietary Data

Midjourney enhances its AI models by integrating a wide range of data sources—including public datasets like ImageNet and COCO, academic research from top conferences, partner collaborations, and its own proprietary data—while continuously updating and managing these datasets for quality, privacy, and security.

AI trainingBright DataCOCO
0 likes · 9 min read
Midjourney’s Diverse Data Sources: Public Datasets, Academic Research, Partner and Proprietary Data
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 31, 2024 · Artificial Intelligence

How Multi‑Chip Heterogeneous Clusters Power Next‑Gen Large Model Training

Using a martial‑arts analogy, the article explains why training massive AI models now requires thousands of GPUs or mixed‑chip clusters, outlines three key steps—inter‑connect, distributed parallel strategies, and accelerator acceleration—and shows how Baidu’s Baige platform achieves near‑full efficiency across GPU, Kunlun and Ascend chips.

AI trainingGPU interconnectaccelerator optimization
0 likes · 11 min read
How Multi‑Chip Heterogeneous Clusters Power Next‑Gen Large Model Training
Architects' Tech Alliance
Architects' Tech Alliance
May 19, 2024 · Industry Insights

How to Build a 10,000‑GPU Supercluster: Core Design Principles and Architecture

This article analyzes the challenges and solutions for constructing a super‑large GPU training cluster, outlining five fundamental design principles, a four‑layer plus one‑domain architecture, and practical considerations for hardware, networking, and operational reliability in AI workloads.

AI trainingGPU clusterHigh‑performance computing
0 likes · 8 min read
How to Build a 10,000‑GPU Supercluster: Core Design Principles and Architecture
IT Services Circle
IT Services Circle
May 13, 2024 · Information Security

The Hidden Costs and Ineffectiveness of CAPTCHAs

CAPTCHAs, originally designed as human‑based computation tools to block bots, have become costly, discriminatory, and largely ineffective security measures that waste billions of dollars annually while providing profit to service providers, prompting a 2024 debate on their continued use.

AI trainingCaptchaHuman Computation
0 likes · 8 min read
The Hidden Costs and Ineffectiveness of CAPTCHAs
Architects' Tech Alliance
Architects' Tech Alliance
May 5, 2024 · Artificial Intelligence

Why InfiniBand Is the Secret Weapon for AIGC Training Performance

The article examines how InfiniBand’s specialized features—collective communication, in‑network computing, adaptive routing, congestion control, cut‑through forwarding, shallow buffering, and self‑healing—are optimized for large‑scale AI‑generated content (AIGC) training, delivering higher bandwidth, lower latency, and greater fault tolerance than Ethernet alternatives.

AI trainingAIGCAdaptive routing
0 likes · 10 min read
Why InfiniBand Is the Secret Weapon for AIGC Training Performance
360 Smart Cloud
360 Smart Cloud
Apr 25, 2024 · Cloud Native

Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training

This article explains how to construct high‑performance RoCE v2 and InfiniBand networks within a cloud‑native Kubernetes environment, detailing the underlying technologies, required components, configuration steps, and performance test results that demonstrate significant communication speed improvements for large‑scale AI model training.

AI trainingCloud NativeHigh‑Performance Networking
0 likes · 12 min read
Building High‑Performance RoCE v2 and InfiniBand Networks in a Cloud‑Native Environment for Large‑Model Training
Architects' Tech Alliance
Architects' Tech Alliance
Apr 10, 2024 · Industry Insights

Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes

This article provides a detailed technical breakdown of modern multi‑GPU server nodes, covering component composition, storage network cards, NVSwitch interconnects, bandwidth calculations, and the architectural differences between NVIDIA A100/A800 and H100/H800 configurations for AI training workloads.

A100AI trainingGPU
0 likes · 12 min read
Inside the GPU Server: Architecture of A100/A800 and H100/H800 Nodes
Model Perspective
Model Perspective
Mar 16, 2024 · Artificial Intelligence

What Watching a TV Drama Reveals About AI Model Training and Learning Strategies

The article draws parallels between expert viewers dissecting the drama "The Legend of Zhen Huan," efficient paper‑reading techniques, and the active‑prediction plus contrast‑learning approach that underpins modern AI model training, highlighting how proactive thinking boosts both personal and machine learning outcomes.

AI trainingPredictionactive learning
0 likes · 8 min read
What Watching a TV Drama Reveals About AI Model Training and Learning Strategies
Ops Development & AI Practice
Ops Development & AI Practice
Mar 13, 2024 · Artificial Intelligence

How Vector Retrieval Powers AI Model Training and Real-World Applications

Vector retrieval, based on converting data into high‑dimensional vectors and measuring similarity, enables fast, accurate search across massive datasets, supporting AI tasks such as search engines, recommendation, NLP, and computer vision, and plays a crucial role in large‑model training for data selection, anomaly detection, and model optimization.

AI trainingRecommendation SystemsVector Retrieval
0 likes · 6 min read
How Vector Retrieval Powers AI Model Training and Real-World Applications
Alibaba Cloud Native
Alibaba Cloud Native
Feb 21, 2024 · Cloud Native

How Fluid & JindoCache Accelerate Large‑Scale AI Training in a Cloud‑Native Environment

This article examines the challenges of data‑intensive AI training on heterogeneous cloud‑native infrastructure and explains how the Fluid framework combined with JindoCache and KubeDL provides distributed caching, metadata acceleration, and seamless POSIX access to dramatically improve I/O performance, GPU utilization, and cost efficiency.

AI trainingData CachingFluid
0 likes · 18 min read
How Fluid & JindoCache Accelerate Large‑Scale AI Training in a Cloud‑Native Environment
Architects' Tech Alliance
Architects' Tech Alliance
Sep 9, 2023 · Industry Insights

Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP

The article analyzes AI‑training traffic patterns, critiques existing flow‑based, flowlet‑based, and packet‑based ECMP load‑balancing, introduces the NSLB solution tailored for AI clusters, and presents experimental results showing up to 113% speed improvement and sub‑millisecond failover with DPFF, while also discussing direct‑topology and intelligent lossless networking techniques.

AI trainingDPFFData center
0 likes · 11 min read
Can NSLB Double AI Training Speed? Inside the 113% Performance Gain Over ECMP
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 19, 2023 · Cloud Computing

Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training

This article examines the challenges of scaling AI model training beyond single-chip limits, introduces Alibaba Cloud’s Predictable Network architecture—including high‑performance Ethernet, dual‑uplink, and adaptive routing—and compares its performance, scalability, and reliability against InfiniBand, showing how Ethernet can meet AI workloads with minimal loss.

AI trainingEthernet vs InfiniBandHigh‑Performance Networking
0 likes · 27 min read
Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 16, 2023 · Cloud Computing

Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training

The article examines how Alibaba Cloud’s Predictable Network, InfiniBand versus Ethernet trade‑offs, and the HPN high‑performance network design together address the extreme bandwidth, latency, scalability and reliability requirements of modern large‑model AI training workloads in cloud data centers.

AI trainingHigh‑performance computingInfiniBand
0 likes · 24 min read
Predictable Network and High‑Performance Network Architecture for Large‑Scale AI Training
DataFunTalk
DataFunTalk
May 25, 2023 · Artificial Intelligence

Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD

This article examines the storage bottlenecks in large‑scale AI training, evaluates local‑disk and Alluxio‑based distributed caching strategies, proposes uniform cache eviction and replica‑aware global policies, and introduces the SiloD framework for coordinated compute‑storage scheduling to dramatically improve GPU utilization and overall cluster throughput.

AI trainingAlluxioCache Eviction
0 likes · 16 min read
Optimizing Distributed Cache for Large-Scale Deep Learning Training with Alluxio and SiloD
Tencent Cloud Developer
Tencent Cloud Developer
Mar 22, 2023 · Artificial Intelligence

Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training

Tencent’s Star Network delivers a 1.6 Tbps Ethernet‑RDMA fabric, fat‑tree topology supporting up to 4 K GPUs, multi‑track traffic aggregation and adaptive heterogeneous links plus a custom TCCL library, cutting AllReduce overhead from 35 % to 3.7 %, speeding AI training iterations by 32 % while automating deployment and providing sub‑second self‑healing.

AI trainingGPU clustersRDMA
0 likes · 19 min read
Tencent Star Network: High‑Performance GPU Cluster Architecture for Large‑Scale AI Model Training
Baidu Geek Talk
Baidu Geek Talk
Dec 27, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques

This article systematically examines the major performance bottlenecks in AI model training, explains the underlying hardware and software causes, and presents a comprehensive set of acceleration strategies—including data‑loading optimizations, compute‑side enhancements, communication tricks, and the AIAK‑Training suite—backed by real‑world case studies and quantitative results.

AI trainingAIAK-TrainingDistributed Training
0 likes · 33 min read
How to Supercharge AI Model Training: Bottlenecks and Cutting‑Edge Acceleration Techniques
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Dec 22, 2022 · Artificial Intelligence

How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques

This article systematically analyzes the main performance bottlenecks in AI model training, explains why acceleration is essential, and presents current hardware‑ and software‑based solutions—including data‑loading optimizations, operator fusion, mixed‑precision and Tensor Core usage, as well as distributed communication strategies—followed by real‑world case studies of Baidu's AIAK‑Training suite that demonstrate significant speed‑ups.

AI trainingDistributed TrainingGPU Acceleration
0 likes · 31 min read
How to Supercharge AI Model Training: Bottlenecks and Acceleration Techniques
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 19, 2022 · Artificial Intelligence

Why Storage Systems Bottleneck AI Training and How to Accelerate Them

This article examines the comprehensive challenges AI applications face from storage to compute, traces the evolution of AI training infrastructure, analyzes key bottlenecks such as compute acceleration, resource scheduling, massive data handling and data flow, and presents Baidu Cloud's storage acceleration solutions—including parallel file systems, caching, and the Fluid scheduler—to dramatically improve AI training performance.

AI trainingCloud NativeData Lake
0 likes · 38 min read
Why Storage Systems Bottleneck AI Training and How to Accelerate Them
AntTech
AntTech
Oct 9, 2022 · Cloud Computing

Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization

Sky Computing, introduced by Ant Technology Research Institute, proposes a cloud‑agnostic platform that abstracts heterogeneous public and private clouds into a unified service layer, enabling applications to seamlessly migrate workloads across clouds, reduce costs, avoid vendor lock‑in, and support AI training via the SkyML prototype.

AI trainingCost Optimizationcloud computing
0 likes · 54 min read
Sky Computing: A Multi‑Cloud Computing Platform for Transparent Resource Utilization
Baidu Geek Talk
Baidu Geek Talk
Jul 26, 2022 · Industry Insights

How Baidu’s Canghai Storage Powers High‑Performance Computing: Challenges and Solutions

This article analyzes the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and high‑performance data analysis—examines Baidu’s internal practices, and presents the Canghai storage platform with its object storage, parallel file system (PFS) and RapidFS solutions that address throughput, latency, and scalability requirements.

AI trainingHigh‑performance computingcloud storage
0 likes · 31 min read
How Baidu’s Canghai Storage Powers High‑Performance Computing: Challenges and Solutions
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 21, 2022 · Cloud Computing

How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads

This article explains the storage challenges of high‑performance computing—including traditional HPC, AI‑driven HPC, and HPDA—then details Baidu’s unified storage platform, object storage BOS, and runtime solutions PFS and RapidFS, illustrating their architecture, features, and a real‑world autonomous‑driving customer case.

AI trainingData Lakecloud storage
0 likes · 29 min read
How Baidu’s Cloud Storage Powers High‑Performance Computing and AI Workloads
Architects' Tech Alliance
Architects' Tech Alliance
Aug 25, 2021 · Industry Insights

Can Storage Class Memory Transform Data Centers? A Deep Dive into SCM Benefits and Challenges

This article examines the emerging Storage Class Memory (SCM) market, outlines its various technologies, evaluates performance and cost trade‑offs, explores three concrete use cases—AI training acceleration, instant data recovery, and greener data‑center operation—and discusses the latency and workload‑model challenges that must be solved for widespread adoption.

AI trainingData centerMemory Technology
0 likes · 16 min read
Can Storage Class Memory Transform Data Centers? A Deep Dive into SCM Benefits and Challenges
Tencent Architect
Tencent Architect
Feb 23, 2021 · Artificial Intelligence

Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform

This article investigates why AI training tasks on Tencent's Xingchen compute platform experience severe I/O slowdown when using CephFS, analyzes the underlying Ceph‑FUSE and MDS mechanisms, and proposes metadata‑caching and file‑caching optimizations that can accelerate training speed by three to four times.

AI trainingCeph-FUSECephFS
0 likes · 21 min read
Analysis and Optimization of CephFS I/O Performance for AI Training on the Xingchen Compute Platform
Alibaba Cloud Native
Alibaba Cloud Native
May 12, 2020 · Artificial Intelligence

Boosting Cloud‑Native AI Training with Alluxio: Performance Tuning on Kubernetes

This article examines the challenges of large‑scale deep‑learning model training on Kubernetes, analyzes performance bottlenecks caused by Alluxio‑FUSE integration, and presents a series of configuration and system‑level optimizations that dramatically improve data‑access speed and overall training throughput.

AI trainingAlluxioCloud Native
0 likes · 22 min read
Boosting Cloud‑Native AI Training with Alluxio: Performance Tuning on Kubernetes
UCloud Tech
UCloud Tech
Mar 24, 2020 · Artificial Intelligence

Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations

A detailed case study reveals why PyTorch training on UFS file storage suffers severe I/O bottlenecks, compares it with local SSD and SSHFS, and presents practical optimizations such as using cv2.imdecode, caching DataLoader handles, and converting small‑file datasets into large UFS files to close the performance gap.

AI trainingPyTorchUFS
0 likes · 14 min read
Why Does PyTorch Struggle with UFS Storage? Insights and Optimizations
Architects' Tech Alliance
Architects' Tech Alliance
Dec 24, 2019 · Fundamentals

Design Considerations and Benefits of Storage Class Memory (SCM) for Data‑Intensive Applications

The article examines the emerging Storage Class Memory (SCM) market, outlines its various technologies, discusses performance and cost trade‑offs, and highlights how SCM can accelerate AI training, enable fast data recovery, reduce data‑center power consumption, and presents the challenges of latency and system integration.

AI trainingSCMStorage Class Memory
0 likes · 15 min read
Design Considerations and Benefits of Storage Class Memory (SCM) for Data‑Intensive Applications
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 4, 2019 · Artificial Intelligence

Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization

iQIYI built a cloud‑based deep‑learning training platform called Jarvis, replacing the initial Runonce service, by containerizing GPU tasks, adopting Ceph S3 storage with FUSE, optimizing data pipelines, and addressing compute, storage, and networking challenges to improve scalability and reduce GPU idle time.

AI trainingDeep LearningGPU computing
0 likes · 9 min read
Building a Deep Learning Training Platform on Cloud: Challenges, Runonce Service, and Storage Optimization