Tagged articles

186 articles

Page 2 of 2

Apr 14, 2025 · Artificial Intelligence

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

PaddlePaddle Framework 3.0 delivers five breakthroughs—dynamic‑static unified automatic parallelism, integrated training‑inference pipelines, high‑order scientific differentiation, a neural‑network compiler with automatic operator fusion, and streamlined heterogeneous chip adaptation—drastically reducing development effort, boosting training speed, and expanding compatibility for large‑scale AI models.

AI InfrastructureDistributed TrainingModel Inference Optimization

0 likes · 23 min read

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

Fighter's World

Mar 29, 2025 · Industry Insights

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

The podcast recap dissects a year of rapid AI change, highlighting surprise‑fast open‑source model releases, shifting foundation‑model dynamics, the rise of GPT wrappers, over‑hyped agents, undervalued memory, product‑market fit debates, infrastructure opportunities, and lingering mysteries like RL in non‑verifiable domains.

AI InfrastructureAI trendsGPT wrappers

0 likes · 22 min read

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

Baobao Algorithm Notes

Mar 13, 2025 · Artificial Intelligence

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.

AI InfrastructureExpert ParallelismGPU memory optimization

0 likes · 18 min read

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

Architects' Tech Alliance

Mar 9, 2025 · Industry Insights

DeepSeek’s AI Ecosystem: From Core Tech to Market Impact

This article provides a comprehensive analysis of DeepSeek, covering its foundational AI research, technology stack, product offerings, and the broader upstream, midstream, and downstream AI industry landscape, including hardware, server, cloud, and market trends.

AI InfrastructureArtificial IntelligenceDeepSeek

0 likes · 13 min read

DeepSeek’s AI Ecosystem: From Core Tech to Market Impact

AntData

Mar 5, 2025 · Cloud Native

DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure

This article provides an in‑depth analysis of DeepSeek's open‑source 3FS distributed storage system, focusing on its network communication module, RDMA‑based design, core classes such as IBSocket, Listener, and IOWorker, and how these innovations advance high‑performance AI infrastructure.

AI InfrastructureFolly CoroutinesRDMA

0 likes · 15 min read

DeepSeek 3FS Network Communication Module: Design, Implementation, and Impact on AI Infrastructure

dbaplus Community

Feb 23, 2025 · Databases

Why Vector Databases Are Really Just Search Engines in Disguise

The article traces the evolution of embedding technology from a secret weapon of tech giants to a mainstream developer tool, explains the rapid rise and subsequent integration of vector databases into traditional search engines, and argues that vector databases are essentially search engines with added vector capabilities.

AI InfrastructureRAGdatabase integration

0 likes · 9 min read

Why Vector Databases Are Really Just Search Engines in Disguise

Architects' Tech Alliance

Feb 19, 2025 · Industry Insights

Why DeepSeek One‑Stop AI Machines Are Redefining Private Model Deployment

The surge in demand for private AI deployment has prompted multiple vendors to launch DeepSeek one‑stop machines—integrated hardware solutions that support the full DeepSeek model family, offering higher stability, easier setup, customization, cost savings, and data security across diverse industry scenarios.

AI InfrastructureAI hardwareDeepSeek

0 likes · 7 min read

Why DeepSeek One‑Stop AI Machines Are Redefining Private Model Deployment

Alibaba Cloud Infrastructure

Jan 20, 2025 · Cloud Computing

2024 Alibaba Cloud Infrastructure Network Team: AI‑Scale Network Innovations, Academic Achievements, Open‑Source Contributions and Industry Outreach

The 2024 report of Alibaba Cloud's Infrastructure Network team details AI‑driven network breakthroughs, high‑performance protocol stacks, large‑scale monitoring systems, numerous top‑conference paper acceptances, open‑source ecosystem initiatives, and extensive industry outreach, highlighting the evolving AI infra landscape.

AI InfrastructureConference PapersData Center Networking

0 likes · 19 min read

2024 Alibaba Cloud Infrastructure Network Team: AI‑Scale Network Innovations, Academic Achievements, Open‑Source Contributions and Industry Outreach

21CTO

Jan 7, 2025 · Artificial Intelligence

Why AI Data Centers Will Keep Spending Billions Through 2025 – Is the Boom Sustainable?

The article examines the massive AI data‑center spending surge, highlighting Microsoft's $80 billion pledge, Amazon's $75 billion capex plans, market forecasts that AI servers will dominate the server market by 2025, and the sustainability concerns surrounding this rapid growth.

AI InfrastructureAI serversAmazon

0 likes · 6 min read

Why AI Data Centers Will Keep Spending Billions Through 2025 – Is the Boom Sustainable?

DataFunSummit

Dec 30, 2024 · Artificial Intelligence

Colossal-AI: A Scalable Framework for Distributed Training of Large Models

This presentation introduces the challenges of the large‑model era, describes the Colossal‑AI architecture—including N‑dimensional parallelism, heterogeneous storage, and zero‑code experience—shows benchmark results and real‑world use cases, and answers audience questions about its integration with PyTorch and advanced parallel strategies.

AI InfrastructureColossal-AIHeterogeneous Storage

0 likes · 11 min read

Colossal-AI: A Scalable Framework for Distributed Training of Large Models

AI Cyberspace

Dec 17, 2024 · Artificial Intelligence

Why AWS’s Self‑Designed Chips Are Redefining AI Infrastructure

At AWS re:Invent 2024, Amazon unveiled its self‑designed AI hardware trio—Graviton 4 CPU, Nitro 5 DPU, and Trainium 2 accelerator—explaining the innovation, efficiency, and cost advantages driving the strategy, and detailing how these chips power next‑generation cloud services, ultra‑high‑performance servers, and massive AI super‑computing clusters.

AI InfrastructureAI hardwareAWS

0 likes · 20 min read

Why AWS’s Self‑Designed Chips Are Redefining AI Infrastructure

Alibaba Cloud Developer

Nov 28, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference

Mooncake, an open-source KVCache-centric inference architecture co-developed by Alibaba Cloud and Tsinghua University's MADSys lab, dramatically improves large-model throughput and reduces cost by decoupling resources, standardizing cache pooling, and integrating with frameworks like vLLM, sparking broad industry interest.

AI InfrastructureKVCacheOpen source

0 likes · 4 min read

Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference

DevOps

Nov 27, 2024 · Artificial Intelligence

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

The article analyzes Elon Musk’s Colossus AI supercomputer—its 100,000 NVIDIA H100 GPUs, record‑fast 122‑day construction, vertical‑integration strategy, and the broader implications for U.S. AI infrastructure dominance and China’s competing challenges in funding and chip supply.

AI InfrastructureAI strategyElon Musk

0 likes · 13 min read

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

Architects' Tech Alliance

Nov 17, 2024 · Industry Insights

What Drives China's AI Server Market? A Deep Dive into Supply Chain, Demand, and Competition

This article provides a comprehensive analysis of China's AI server industry, covering upstream component markets, midstream shipment and revenue trends, downstream application demand, server classifications, major players, and future policy and technology drivers, all backed by recent market data and charts.

AI InfrastructureAI serversChina

0 likes · 16 min read

What Drives China's AI Server Market? A Deep Dive into Supply Chain, Demand, and Competition

Alibaba Cloud Infrastructure

Nov 13, 2024 · Industry Insights

Why GPU Scale‑Up Interconnects Need a New Protocol – Inside UALink and Alibaba’s Alink

The article analyzes the growing demand for high‑bandwidth, low‑latency GPU Scale‑Up interconnects in AI clusters, explains why existing Ethernet and RDMA solutions fall short, and examines the industry‑wide UALink alliance and Alibaba's Alink System as a new open‑ecosystem solution.

AI InfrastructureAlink SystemGPU

0 likes · 12 min read

Why GPU Scale‑Up Interconnects Need a New Protocol – Inside UALink and Alibaba’s Alink

Baidu Intelligent Cloud Tech Hub

Oct 28, 2024 · Cloud Native

How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era

The talk outlines Baidu Smart Cloud's comprehensive cloud‑native redesign—including ultra‑elastic compute, AI‑focused storage, high‑performance networking, AI‑driven operations, and edge‑distributed services—illustrated with automotive and fintech case studies that demonstrate how enterprises can accelerate digital transformation in the AI‑native age.

AI InfrastructureData LakeEdge Computing

0 likes · 12 min read

How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era

Architects' Tech Alliance

Oct 21, 2024 · Fundamentals

Why, What, and How of RDMA in AI Networks: Architecture, Protocols, and Future Directions

This article explains the motivations behind RDMA, describes its architecture, key components, and protocols such as RoCEv2, and discusses future technical challenges for scaling RDMA in large AI and HPC data‑center networks.

AI InfrastructureData Center NetworkingNetwork Protocols

0 likes · 19 min read

Why, What, and How of RDMA in AI Networks: Architecture, Protocols, and Future Directions

360 Tech Engineering

Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI InfrastructureGPU clusterInference Acceleration

0 likes · 21 min read

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

Alibaba Cloud Infrastructure

Oct 12, 2024 · Fundamentals

Alibaba Cloud Server R&D Team Publishes Three Papers on High‑Density PCIe 6.0, 100G‑PAM4 Ethernet, and Immersion‑Cooling PCB Materials at IEEE EPEPS 2024 and PCB West 2024

Alibaba Cloud's server R&D team presented three research papers at IEEE EPEPS 2024 and PCB West 2024 covering high‑density PCIe 6.0 crosstalk optimization, 100G‑PAM4 Ethernet performance under air and immersion cooling, and sustainable low‑cost PCB materials for immersion‑cooled computer systems, highlighting their relevance to AI infrastructure and data‑center design.

AI InfrastructureHigh-speed interconnectImmersion Cooling

0 likes · 10 min read

Alibaba Cloud Server R&D Team Publishes Three Papers on High‑Density PCIe 6.0, 100G‑PAM4 Ethernet, and Immersion‑Cooling PCB Materials at IEEE EPEPS 2024 and PCB West 2024

360 Zhihui Cloud Developer

Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI InfrastructureDistributed TrainingGPU cluster

0 likes · 22 min read

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

Baidu Geek Talk

Oct 9, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

This article analyzes Baidu's Baige 4.0 AI infrastructure, detailing its four‑layer architecture, XMAN 5.0 hardware, HPN network, BCCL communication library, and AIAK inference upgrades, and explains how these innovations address large‑model training and inference challenges while boosting performance, utilization, and cost efficiency.

AI InfrastructureCluster ManagementGPU Acceleration

0 likes · 16 min read

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

Baidu Intelligent Cloud Tech Hub

Sep 29, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

The article details Baidu Baige 4.0’s four‑layer AI infrastructure—hardware, cluster components, training‑inference acceleration, and platform tools—highlighting its heterogeneous computing, high‑performance networking, fault‑tolerant communication library, and optimizations that boost large‑model training and inference efficiency.

AI InfrastructureHigh‑Performance Networkingheterogeneous computing

0 likes · 17 min read

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

DataFunSummit

Sep 24, 2024 · Artificial Intelligence

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

AI InfrastructureMultimodal Datadata pipelines

0 likes · 20 min read

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

Data Thinking Notes

Sep 19, 2024 · Artificial Intelligence

Why AI Has Only a Seven-Year History—and What AI+ Means for the Future

In this speech, Wang Jian reflects on the evolution of artificial intelligence, arguing that modern AI is fundamentally different from its early concepts, emphasizing the pivotal roles of data, models, and infrastructure, and exploring the transformative impact of AI+, transformers, and cloud platforms on future innovation.

AI InfrastructureAI+Artificial Intelligence

0 likes · 18 min read

Why AI Has Only a Seven-Year History—and What AI+ Means for the Future

Architects' Tech Alliance

Sep 17, 2024 · Industry Insights

Why Intelligent Computing Centers Are the Backbone of China’s AI Boom

The article explains what an Intelligent Computing Center (智算中心) is, analyzes its extensive upstream and downstream industry chain, describes the cutting‑edge AI computing architecture that powers it, forecasts massive growth in AI compute capacity by 2028, and outlines regional deployment strategies and service models such as leasing, data, operation, and talent cultivation.

AI InfrastructureAI computingIntelligent Computing Center

0 likes · 11 min read

Why Intelligent Computing Centers Are the Backbone of China’s AI Boom

21CTO

Sep 10, 2024 · Artificial Intelligence

Why AI Has Only a Seven‑Year History and What AI Infrastructure Means for the Future

In this speech, academician Wang Jian reflects on the short, seven‑year history of modern AI, distinguishes AI, AI+ and AI infrastructure, explains how data, models and compute power have become the new foundational layer, and examines the roles of Google, OpenAI, transformers, and cloud services in shaping today’s AI revolution.

@DataAI InfrastructureAI+

0 likes · 20 min read

Why AI Has Only a Seven‑Year History and What AI Infrastructure Means for the Future

Baobao Algorithm Notes

Jul 24, 2024 · Artificial Intelligence

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

This article dissects Meta’s Llama 3 405‑billion‑parameter model, covering its dense Transformer design, data‑mixing strategy, two‑stage scaling‑law prediction, 4‑D parallelism, custom hardware clusters, training schedules, post‑training alignment methods, and the extensive evaluation results that benchmark it against leading LLMs.

AI InfrastructureDistributed TrainingLlama 3

0 likes · 56 min read

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

NewBeeNLP

Jul 24, 2024 · Industry Insights

From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)

The article traces the evolution of large‑model training and inference infrastructure from the early “black‑iron” era (2019‑2021) through the “golden” boom (2022‑2023) to the emerging “silver” phase (2024‑), highlighting key research breakthroughs, open‑source frameworks, hardware trends, market dynamics, and practical challenges for engineers entering the field.

AI InfrastructureInferenceLarge Model

0 likes · 22 min read

From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)

Architects' Tech Alliance

Jul 15, 2024 · Artificial Intelligence

Why Model-as-a-Service (MaaS) Is Shaping the Future of AI Deployment

This article examines the Model-as-a-Service (MaaS) paradigm, tracing its origins, defining its expanded capabilities for large‑model ecosystems, outlining the full‑stack services it offers, and analyzing current industry adoption, deployment models, and the technical and regulatory challenges that must be addressed for scalable AI rollout.

AI InfrastructureAI deploymentCloud AI

0 likes · 11 min read

Why Model-as-a-Service (MaaS) Is Shaping the Future of AI Deployment

DataFunTalk

Jul 8, 2024 · Artificial Intelligence

Challenges and Techniques for Distributed Training of Large Language Models

This article discusses the historical background, major challenges such as massive compute and memory demands, and the technical ecosystem—including data parallelism, pipeline parallelism, and optimization strategies like DeepSpeed and 1F1B—to enable efficient distributed training of large language models.

AI InfrastructureDeepSpeedPipeline Parallelism

0 likes · 22 min read

Challenges and Techniques for Distributed Training of Large Language Models

21CTO

Jun 7, 2024 · Artificial Intelligence

Why AI Gateways Are the Next Evolution of API Gateways

AI gateways have emerged as essential infrastructure for modern AI applications, offering specialized security, load balancing, cost management, and observability that go beyond traditional API gateways, and understanding their differences and deployment considerations is crucial for developers and ops teams.

AI InfrastructureAI gatewayCost Management

0 likes · 10 min read

Why AI Gateways Are the Next Evolution of API Gateways

Alibaba Cloud Big Data AI Platform

May 24, 2024 · Artificial Intelligence

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DeepRec Extension enhances large‑scale sparse model training by adding automatic elastic training, resource‑aware scheduling, real‑time monitoring, and efficient fault‑tolerance mechanisms, enabling lower cost, higher throughput, and more reliable distributed training for AI workloads.

AI InfrastructureDeepRecSparse Models

0 likes · 13 min read

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

Baidu Tech Salon

May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.

AI InfrastructureAIAK-LLMBaidu Baige

0 likes · 26 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Geek Talk

May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.

AI InfrastructureBaiduInference Optimization

0 likes · 25 min read

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

Baidu Intelligent Cloud Tech Hub

May 15, 2024 · Artificial Intelligence

How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference

The article explains the scaling challenges of ever‑larger LLMs, introduces the MFU performance metric, surveys industry parallelism and memory‑saving techniques, and details Baidu’s AIAK‑LLM suite—including resource, component and acceleration layers—as well as concrete training and inference optimizations that raise MFU by 30‑60% and cut deployment costs.

AI InfrastructureLarge ModelMFU

0 likes · 25 min read

How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference

ZhongAn Tech Team

May 13, 2024 · Artificial Intelligence

Weekly Tech Overview: AI Advances, Mobile Game Store, and Industry Insights

This weekly tech roundup covers Microsoft’s upcoming mobile game store, Alibaba Cloud’s Tongyi Qianwen 2.5 AI model, Google DeepMind’s AlphaFold 3 for drug discovery, TikTok’s AI‑content labeling, 神州信息’s AI‑native product, Apple’s on‑device AI chips, expert views on scaling laws, and news on Fei‑Fei Li’s startup, Apple’s China tax, and Buffett’s Apple stake reduction.

AIAI InfrastructureEntrepreneurship

0 likes · 7 min read

Weekly Tech Overview: AI Advances, Mobile Game Store, and Industry Insights

Architects' Tech Alliance

May 9, 2024 · Artificial Intelligence

AI Servers: Market Opportunities, Architecture, and Future Demand Driven by Generative AI

The article examines how the surge of generative AI (AIGC) is fueling rapid growth in AI server demand, detailing the emerging AIGC ecosystem, server hardware composition, model scaling, heterogeneous computing, training vs. inference workloads, market size forecasts, and the competitive landscape of AI server manufacturers.

AI InfrastructureAI serversGPU

0 likes · 15 min read

AI Servers: Market Opportunities, Architecture, and Future Demand Driven by Generative AI

ITPUB

Apr 27, 2024 · Databases

How Vector Databases Enable High‑Dimensional Stock Quant Analysis

This interview‑style guide explores how vector databases handle massive, high‑dimensional time‑series data for quantitative stock trading, detailing data scaling challenges, selection criteria, and why the research team chose LanceDB over alternatives for efficient, scalable financial analysis.

AI InfrastructureLanceDBQuantitative Finance

0 likes · 7 min read

How Vector Databases Enable High‑Dimensional Stock Quant Analysis

Architects' Tech Alliance

Apr 25, 2024 · Industry Insights

What China’s AI Labs Learned from Scaling Domestic Large‑Model Training

The article analyzes the computational characteristics and system challenges of training large AI models on domestic platforms, examines framework parallelism and future algorithms, and proposes six strategic measures—including scaling compute, improving data management, building a national R&D team, and boosting AI‑chip investment—to accelerate China’s AI leadership.

AI InfrastructureModel Trainingdomestic AI

0 likes · 5 min read

What China’s AI Labs Learned from Scaling Domestic Large‑Model Training

DataFunSummit

Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI InfrastructureData ParallelismDeepSpeed

0 likes · 23 min read

Challenges and Techniques in Distributed Training of Large Language Models

Architects' Tech Alliance

Mar 27, 2024 · Industry Insights

Why AI Large‑Model Training Needs Ultra‑High‑Bandwidth, Low‑Latency Networks

The rapid growth of AI model sizes has created unprecedented demands on network bandwidth, latency, stability, and automation, making efficient RDMA‑based interconnects, advanced congestion control, and intelligent deployment essential for scaling distributed training clusters to thousands of GPUs.

AI InfrastructureAI trainingRDMA

0 likes · 11 min read

Why AI Large‑Model Training Needs Ultra‑High‑Bandwidth, Low‑Latency Networks

Bilibili Tech

Mar 15, 2024 · Artificial Intelligence

Hardware Resource Estimation and Bottleneck Analysis for Large Language Models (LLMs)

The article analyzes the compute, memory, and communication resources required to train and run large language models, quantifies bottlenecks such as the massive FLOP demand, terabyte‑scale GPU memory, and high‑bandwidth interconnect needs, and evaluates parallelism strategies and bandwidth estimates to guide hardware and software design for scaling LLMs.

AI InfrastructureHardwareLLM

0 likes · 53 min read

Hardware Resource Estimation and Bottleneck Analysis for Large Language Models (LLMs)

DataFunSummit

Mar 14, 2024 · Artificial Intelligence

Multi‑Level Efficiency Challenges and Emerging Paradigms for Large AI Models

The article examines how large AI models are moving toward a unified, low‑knowledge‑density paradigm that raises computational efficiency challenges across model, algorithm, framework, and infrastructure layers, while also highlighting NVIDIA's GTC 2024 China AI Day sessions that showcase practical solutions and upcoming training opportunities.

AI InfrastructureAI conferencesNVIDIA GTC

0 likes · 10 min read

Multi‑Level Efficiency Challenges and Emerging Paradigms for Large AI Models

Baidu Geek Talk

Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureDistributed TrainingFault Diagnosis

0 likes · 11 min read

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

Baidu Intelligent Cloud Tech Hub

Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureDistributed TrainingFault Diagnosis

0 likes · 11 min read

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

JD Retail Technology

Jan 30, 2024 · Artificial Intelligence

Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models

The article details JD Retail's evolution from TensorFlow‑based sparse training to a custom high‑performance parameter server and a fully GPU‑accelerated, multi‑node, multi‑card synchronous training framework that leverages GPU‑RDMA, two‑level CPU‑DRAM/GPU‑HBM caching, and pipeline parallelism to overcome storage, I/O, and compute challenges of trillion‑parameter recommendation systems.

AI InfrastructureGPU AccelerationParameter Server

0 likes · 12 min read

Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models

DataFunSummit

Jan 22, 2024 · Artificial Intelligence

Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI

This article introduces Colossal‑AI, an open‑source platform that tackles the challenges of training, fine‑tuning, and deploying massive AI models by leveraging efficient memory management, N‑dimensional parallelism, and high‑performance inference to dramatically reduce cost and improve scalability across thousands of GPUs.

AI InfrastructureColossal-AIDistributed Training

0 likes · 21 min read

Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI

Alibaba Cloud Infrastructure

Nov 11, 2023 · Cloud Computing

Alibaba Cloud Executive Discusses IPv6 Deployment, Global Collaboration, and AI‑Driven Network Evolution at the 2023 Wuzhen Internet Forum

In a detailed interview at the 2023 Wuzhen Internet Forum, Alibaba Cloud’s infrastructure lead Cai Dezhong outlines the three‑phase IPv6 rollout, highlights organizational and technical innovations, stresses the need for global cooperation, and explains how IPv6 underpins the next generation AI infrastructure and predictable high‑performance networking.

AI InfrastructureGlobal CollaborationHigh‑Performance Networking

0 likes · 9 min read

Alibaba Cloud Executive Discusses IPv6 Deployment, Global Collaboration, and AI‑Driven Network Evolution at the 2023 Wuzhen Internet Forum

Baidu Intelligent Cloud Tech Hub

Oct 10, 2023 · Artificial Intelligence

How AI Infrastructure Fuels High‑Quality Digital Economy Growth

The article summarizes a Baidu Cloud Intelligence conference speech and whitepaper, explaining how AI foundations and large‑model infrastructure reshape applications, boost enterprise digital transformation, and drive regional economic development, offering a roadmap for high‑quality digital economy advancement.

AI InfrastructureDigital Economyenterprise transformation

0 likes · 11 min read

How AI Infrastructure Fuels High‑Quality Digital Economy Growth

Baidu Intelligent Cloud Tech Hub

Sep 21, 2023 · Artificial Intelligence

How Baidu Cloud Integrates AI and Cloud to Accelerate Autonomous Driving

At the 2023 Baidu Cloud Intelligence Conference, Baidu AI Cloud outlined a comprehensive, four‑layer solution—spanning distributed cloud infrastructure, AI‑focused compute, data compliance, and end‑to‑end toolchains—to address the challenges of electric, intelligent vehicles, large‑model deployment, and regulatory compliance in autonomous driving.

AI Infrastructureautonomous drivingcloud computing

0 likes · 12 min read

How Baidu Cloud Integrates AI and Cloud to Accelerate Autonomous Driving

Alibaba Cloud Big Data AI Platform

Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI InfrastructureAutoTunerLLM inference

0 likes · 11 min read

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

Efficient Ops

Jun 11, 2023 · Artificial Intelligence

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

The article explains how AIGC models demand massive GPU compute, why network bandwidth and latency become the critical limiting factors, and how the Distributed Disaggregated Chassis (DDC) architecture addresses these challenges with scalable, high‑throughput networking solutions.

AI InfrastructureAIGCDDC

0 likes · 13 min read

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

Alibaba Cloud Infrastructure

May 19, 2023 · Artificial Intelligence

Immersion Liquid Cooling Forum on AI Infrastructure: Key Insights and Industry Perspectives

The May 15 Beijing forum gathered experts from leading tech firms and research institutes to discuss immersion liquid cooling as a vital solution for AI infrastructure's growing compute and thermal challenges, presenting current trends, technical designs, material research, and future sustainable development directions.

AI InfrastructureImmersion Coolingdata center

0 likes · 7 min read

Immersion Liquid Cooling Forum on AI Infrastructure: Key Insights and Industry Perspectives

Baidu Tech Salon

May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseDistributed Training

0 likes · 17 min read

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

Amap Tech

May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data

0 likes · 29 min read

A 20‑Year Review of AI Infrastructure Milestones

Baidu Intelligent Cloud Tech Hub

May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureDistributed TrainingGPU clusters

0 likes · 19 min read

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

DataFunSummit

Apr 27, 2023 · Artificial Intelligence

Baidu's Interoperability Solutions for Federated Learning: Principles, JinKe Alliance, and the Open‑Source HIGHFLIP Protocol

The article presents Baidu's comprehensive approach to federated‑learning interoperability, covering the underlying principles, the JinKe Alliance bottom‑layer solution, the high‑level HIGHFLIP protocol, and a comparative discussion of white‑box, gray‑box, and black‑box integration strategies.

AI InfrastructureBaiduFederated Learning

0 likes · 11 min read

Baidu's Interoperability Solutions for Federated Learning: Principles, JinKe Alliance, and the Open‑Source HIGHFLIP Protocol

Alibaba Cloud Big Data AI Platform

Apr 24, 2023 · Artificial Intelligence

How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models

Alibaba Cloud’s PAI platform unveils TePDist, an HLO‑based automatic distributed deep‑learning system that decouples strategy search from model code, offers client/server architecture, supports SPMD and pipeline parallelism, delivers high performance on GPT, MoE and other models, and is now open‑source.

AI InfrastructureDistributed Deep LearningHLO IR

0 likes · 4 min read

How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models

DataFunSummit

Apr 20, 2023 · Artificial Intelligence

SenseTime Unveils Multimodal ‘SenseNova’ Large Model System and Its Industry Applications

SenseTime introduced its visual‑centric multimodal large‑model platform SenseNova, detailing model scaling, extensive AI infrastructure, diverse industry deployments such as autonomous driving and generative content, and the challenges of compute efficiency and data acquisition in the race for advanced AI.

AI InfrastructureComputer Visionlarge models

0 likes · 13 min read

SenseTime Unveils Multimodal ‘SenseNova’ Large Model System and Its Industry Applications

DataFunTalk

Mar 31, 2023 · Artificial Intelligence

Estimating the Resource and Cost Requirements for Large Language Model Training and Inference

The article analyses the computational resources, hardware costs, and human investment needed to train and serve large language models such as GPT‑3, discusses practical cost calculations, highlights the challenges faced by Chinese AI teams, and argues for sustained, long‑term funding to achieve meaningful breakthroughs.

AI InfrastructureChina AIInference

0 likes · 14 min read

Estimating the Resource and Cost Requirements for Large Language Model Training and Inference

21CTO

Mar 31, 2023 · Artificial Intelligence

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

ColossalChat, an open‑source project built on LLaMA, offers a full RLHF pipeline—including supervised fine‑tuning, reward‑model training, and reinforcement learning—enabling low‑cost, bilingual ChatGPT‑like models with 4‑bit quantized inference, detailed code, dataset, and performance optimizations.

AI InfrastructureColossalAIModel Quantization

0 likes · 12 min read

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

Tencent Cloud Developer

Mar 22, 2023 · Artificial Intelligence

How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations

This article analyzes Tencent's AngelPTM framework, detailing its ZeRO-Cache strategy, unified storage management, multi‑stream async execution, SSD tiered storage, and performance benchmarks that show up to 95% larger model capacity and over 44% speedup compared to community solutions.

AI InfrastructureGPU AccelerationMemory Optimization

0 likes · 12 min read

How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations

Baidu Geek Talk

Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI InfrastructureCost ModelDistributed Training

0 likes · 27 min read

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

Python Programming Learning Circle

Mar 21, 2023 · Artificial Intelligence

Why Replicating ChatGPT in China Demands Massive AI Infrastructure and Cloud Computing

The article explains that reproducing ChatGPT in China is not just a matter of funding but requires extensive expertise in large‑scale language model training, massive compute resources, optimized cloud infrastructure, and deep AI research, as demonstrated by Alibaba's DAMO Academy efforts.

AI InfrastructureChatGPTModel Training

0 likes · 10 min read

Why Replicating ChatGPT in China Demands Massive AI Infrastructure and Cloud Computing

Hulu Beijing

Mar 16, 2023 · Artificial Intelligence

Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions

This article explores Hulu’s five‑year‑old machine‑learning training platform, detailing its three‑layer architecture, the shift from single‑node to distributed training, and the technical solutions—including parameter servers, Ring AllReduce, Kubernetes, Volcano, and Horovod—that enable scalable AI workloads across GPU, CPU, and storage resources.

AI InfrastructureDistributed TrainingHulu

0 likes · 13 min read

Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions

AntTech

Mar 13, 2023 · Artificial Intelligence

Thoughts on the Next‑Generation AI Infrastructure: Green and Shared Model‑as‑a‑Service

In this conference talk, He Zhengyu of Ant Group outlines the challenges of large‑model AI, proposes a green, shared, model‑centric infrastructure built on foundation models, cloud‑native MLOps, and Model‑as‑a‑Service (MaaS) to lower cost and accelerate AI adoption across industries.

AI InfrastructureCloud NativeMLOps

0 likes · 14 min read

Thoughts on the Next‑Generation AI Infrastructure: Green and Shared Model‑as‑a‑Service

Tencent Advertising Technology

Mar 2, 2023 · Artificial Intelligence

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

This article details Tencent's development of the 1‑trillion‑parameter HunYuan‑NLP model, covering its MoE architecture, cost‑effective pre‑training strategies, distributed training framework, model compression toolkit, and successful deployment across advertising, gaming, and other Tencent services.

AI InfrastructureMixture of Expertslarge language model

0 likes · 17 min read

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

Baidu Intelligent Cloud Tech Hub

Feb 23, 2023 · Artificial Intelligence

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

This article explains how Baidu's intelligent cloud overcomes the compute and storage walls of large‑scale model training by combining hardware design, network topology, and software optimizations such as pipeline, tensor, and expert parallelism, cost‑model‑driven placement, and future‑proof AI infrastructure evolution.

AI InfrastructureBaidu CloudCost Model

0 likes · 28 min read

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

DataFunSummit

Nov 18, 2022 · Artificial Intelligence

DataFun Summit 2022: AI Foundations, Large‑Scale Model Training, and AI Infrastructure

The DataFun Summit 2022 brings together leading AI researchers and industry experts to discuss deep‑learning frameworks, ultra‑large model training, AI chips, compilers, MLOps, and end‑to‑end AI infrastructure, offering live streaming of six thematic forums and dozens of technical talks.

AIAI InfrastructureMLOps

0 likes · 30 min read

DataFun Summit 2022: AI Foundations, Large‑Scale Model Training, and AI Infrastructure

Xiaohongshu Tech REDtech

Nov 11, 2022 · Artificial Intelligence

Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)

Xiaohongshu’s in‑house LarC platform powers real‑time, multimodal recommendation, life‑search, and generative‑AI commercial content for its 200 million‑user community by processing billions of daily feedback samples, employing conflict‑free parameter servers, diversified sequence modeling, and large‑scale representation learning to deliver personalized, fresh, and diverse user experiences.

AI InfrastructureMachine Learning PlatformMultimodal AI

0 likes · 13 min read

Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)

Baidu Intelligent Cloud Tech Hub

Jul 13, 2022 · Artificial Intelligence

Unlocking GPU Efficiency: Baidu’s Dual‑Engine Container Virtualization for AI

This article explores Baidu’s cutting‑edge GPU container virtualization architecture, detailing the challenges of low GPU utilization in AI workloads, the dual‑engine (user‑space and kernel‑space) isolation mechanisms, various mixing strategies, performance evaluations, and best‑practice recommendations for maximizing resource efficiency in large‑scale AI deployments.

AI InfrastructureGPU virtualizationMixed Scheduling

0 likes · 31 min read

Unlocking GPU Efficiency: Baidu’s Dual‑Engine Container Virtualization for AI

Baidu Geek Talk

Jul 6, 2022 · Artificial Intelligence

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

The article examines the industry trend toward ever‑larger AI models, compares their parameter scale to the human brain, outlines the computational and memory challenges of training such models, and details advanced parallelism techniques and Baidu's high‑performance cluster solutions that enable efficient, stable large‑scale model training.

AI InfrastructureBaiduCluster Computing

0 likes · 17 min read

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

ITPUB

Jun 2, 2022 · Artificial Intelligence

Why AI Needs Modular Infrastructure: Lessons from LLVM and the Future of ML Systems

The article examines how monolithic AI toolchains hinder innovation, recounts the historical fragmentation of software in the 1990s, highlights LLVM's modular architecture as a turning point, and argues for a new, composable AI infrastructure to make machine learning more accessible and scalable.

AI InfrastructureLLVMML compilers

0 likes · 11 min read

Why AI Needs Modular Infrastructure: Lessons from LLVM and the Future of ML Systems

DataFunTalk

Apr 17, 2022 · Artificial Intelligence

DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status

DeepRec, developed since 2016 by Alibaba, is a specialized sparse‑model training engine that addresses feature elasticity, training performance, and deployment challenges through dynamic elastic features, optimized runtimes, distributed training frameworks, incremental model export, and multi‑level storage, and is now being open‑sourced for broader industry collaboration.

AI InfrastructureDeepRecRuntime Optimization

0 likes · 15 min read

DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status

DataFunTalk

Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureDistributed TrainingGPU clustering

0 likes · 17 min read

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

DataFunTalk

Nov 24, 2020 · Artificial Intelligence

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

This presentation explains how knowledge graphs serve as the foundation for new‑infrastructure initiatives, detailing the evolution of AI from perception to cognition, the role of big‑data centers, DIKW modeling, intelligent data governance, and the construction of a cognitive AI middle‑platform for industry applications.

AI InfrastructureArtificial IntelligenceBig Data

0 likes · 18 min read

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

360 Tech Engineering

Sep 14, 2020 · Artificial Intelligence

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

TensorNet is a TensorFlow‑based distributed training framework that tackles the challenges of massive data and billions of sparse parameters in advertising and recommendation systems by enabling near‑infinite sparse feature dimensions, drastically reducing synchronization overhead, and delivering up to 35% inference speed improvements.

AI InfrastructureDistributed TrainingTensorFlow

0 likes · 8 min read

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

JD Tech Talk

Jun 3, 2020 · Artificial Intelligence

JD Digital Science Unveils Fast Secure Federated Learning Framework and Two Industry‑First Techniques

JD Digital Science introduced its fast secure federated learning framework, highlighted two pioneering technologies—a kernel‑based nonlinear federated learning algorithm and a distributed fast homomorphic encryption method—both accepted at KDD 2020, and discussed their industrial applications, privacy benefits, and regulatory relevance.

AI InfrastructureFederated LearningKDD2020

0 likes · 6 min read

JD Digital Science Unveils Fast Secure Federated Learning Framework and Two Industry‑First Techniques

Alibaba Cloud Developer

Mar 17, 2020 · Artificial Intelligence

How AI Engineering Powers Modern Enterprises: From Deep Learning to Cloud Infrastructure

This article explores the fundamentals and evolution of artificial intelligence, its applications in perception and decision‑making, the role of deep learning, the importance of compute power and cloud platforms, and how enterprises can strategically adopt AI and data‑driven solutions to drive business value.

AI Infrastructuremachine learning

0 likes · 15 min read

How AI Engineering Powers Modern Enterprises: From Deep Learning to Cloud Infrastructure

AntTech

Oct 17, 2019 · Artificial Intelligence

From a 30‑Year Coding Journey to AI Infrastructure: Wang Yi’s Story and the Open‑Source Projects SQLFlow and ElasticDL

The article chronicles Wang Yi’s three‑decade programming career, his moves across Tencent, Google, Baidu and Ant Financial, and explains how his open‑source AI infrastructure projects SQLFlow and ElasticDL transform model development for analysts while promoting a culture of code review and practical engineering.

AI InfrastructureCode reviewElasticDL

0 likes · 12 min read

From a 30‑Year Coding Journey to AI Infrastructure: Wang Yi’s Story and the Open‑Source Projects SQLFlow and ElasticDL

Alibaba Cloud Developer

Jun 12, 2019 · Artificial Intelligence

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

AI InfrastructureDeep LearningDistributed Training

0 likes · 17 min read

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Didi Tech

Apr 4, 2019 · Artificial Intelligence

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

Since 2016 DiDi has evolved its machine‑learning platform from isolated, workshop‑style GPU servers to a cloud‑native, Kubernetes‑driven architecture that unifies resource management, introduces custom parameter‑server and serving frameworks, provides autotuning, external SaaS offerings such as Elastic Inference and JianShu, and aims for a 3.0 unified internal‑external AI marketplace.

AI InfrastructureGPU computingKubernetes

0 likes · 19 min read

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

Alibaba Cloud Developer

Jan 18, 2019 · Artificial Intelligence

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Euler, Alibaba's newly open‑sourced graph deep‑learning framework, combines distributed graph processing with neural network training to handle billions of nodes and edges, supports heterogeneous graphs, offers built‑in algorithms, and has already boosted advertising, fraud detection, and other industry applications.

AI InfrastructureEuler frameworkdistributed computing

0 likes · 11 min read

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Meituan Technology Team

Oct 25, 2018 · Artificial Intelligence

Deep Learning System Design and Parallel Computing Solutions at Meituan

Meituan built a custom deep‑learning platform that combines data‑parallel and hybrid parallelism across multi‑GPU/cluster hardware, uses coarse‑grained scheduling and Kaldi‑derived acoustic algorithms, and supports fast NLU model hot‑updates, achieving near‑linear GPU scaling and 6–7× speedups over traditional solutions.

AI InfrastructureNLUSystem Architecture

0 likes · 13 min read

Deep Learning System Design and Parallel Computing Solutions at Meituan

Architecture Digest

Aug 15, 2017 · Artificial Intelligence

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

The article explains why AI engineers need foundational infrastructure knowledge—covering big‑data processing, cloud services, containerization, MapReduce, and deep‑learning platforms—to effectively solve real‑world problems, collaborate with teams, and build scalable, maintainable AI solutions.

AI InfrastructureBig DataMapReduce

0 likes · 14 min read

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

21CTO

Jul 16, 2017 · Artificial Intelligence

Why Every AI Engineer Must Master Infrastructure Basics

In the AI era, engineers need more than cutting‑edge algorithms—they must understand infrastructure, deployment, scalability, and team collaboration, as illustrated by four practical reasons and Google’s architectural breakthroughs that bridge big data, machine learning, and deep learning.

AI InfrastructureGoogleSoftware Architecture

0 likes · 17 min read

Why Every AI Engineer Must Master Infrastructure Basics