Tagged articles
186 articles
Page 2 of 2
Baidu Geek Talk
Baidu Geek Talk
Apr 14, 2025 · Artificial Intelligence

PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development

PaddlePaddle Framework 3.0 delivers five breakthroughs—dynamic‑static unified automatic parallelism, integrated training‑inference pipelines, high‑order scientific differentiation, a neural‑network compiler with automatic operator fusion, and streamlined heterogeneous chip adaptation—drastically reducing development effort, boosting training speed, and expanding compatibility for large‑scale AI models.

AI InfrastructureDistributed TrainingModel Inference Optimization
0 likes · 23 min read
PaddlePaddle Framework 3.0: Five Core Breakthroughs Reshaping Large Model Development
Fighter's World
Fighter's World
Mar 29, 2025 · Industry Insights

A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast

The podcast recap dissects a year of rapid AI change, highlighting surprise‑fast open‑source model releases, shifting foundation‑model dynamics, the rise of GPT wrappers, over‑hyped agents, undervalued memory, product‑market fit debates, infrastructure opportunities, and lingering mysteries like RL in non‑verifiable domains.

AI InfrastructureAI trendsGPT wrappers
0 likes · 22 min read
A Year in AI: Key Insights from the Unsupervised Learning & Latent Space Podcast
Baobao Algorithm Notes
Baobao Algorithm Notes
Mar 13, 2025 · Artificial Intelligence

Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability

This article analyzes Deepseek's EP‑based inference architecture for V3/R1 models, comparing it with TP, detailing how EP reduces memory and compute overhead, boosts batch size, cuts GPU memory usage, and introduces reliability, scalability, and maintainability challenges for large‑scale deployments.

AI InfrastructureExpert ParallelismGPU memory optimization
0 likes · 18 min read
Why EP Outperforms TP for Deepseek V3/R1 Inference: Cost, Performance, and Reliability
Architects' Tech Alliance
Architects' Tech Alliance
Mar 9, 2025 · Industry Insights

DeepSeek’s AI Ecosystem: From Core Tech to Market Impact

This article provides a comprehensive analysis of DeepSeek, covering its foundational AI research, technology stack, product offerings, and the broader upstream, midstream, and downstream AI industry landscape, including hardware, server, cloud, and market trends.

AI InfrastructureArtificial IntelligenceDeepSeek
0 likes · 13 min read
DeepSeek’s AI Ecosystem: From Core Tech to Market Impact
dbaplus Community
dbaplus Community
Feb 23, 2025 · Databases

Why Vector Databases Are Really Just Search Engines in Disguise

The article traces the evolution of embedding technology from a secret weapon of tech giants to a mainstream developer tool, explains the rapid rise and subsequent integration of vector databases into traditional search engines, and argues that vector databases are essentially search engines with added vector capabilities.

AI InfrastructureRAGdatabase integration
0 likes · 9 min read
Why Vector Databases Are Really Just Search Engines in Disguise
Architects' Tech Alliance
Architects' Tech Alliance
Feb 19, 2025 · Industry Insights

Why DeepSeek One‑Stop AI Machines Are Redefining Private Model Deployment

The surge in demand for private AI deployment has prompted multiple vendors to launch DeepSeek one‑stop machines—integrated hardware solutions that support the full DeepSeek model family, offering higher stability, easier setup, customization, cost savings, and data security across diverse industry scenarios.

AI InfrastructureAI hardwareDeepSeek
0 likes · 7 min read
Why DeepSeek One‑Stop AI Machines Are Redefining Private Model Deployment
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jan 20, 2025 · Cloud Computing

2024 Alibaba Cloud Infrastructure Network Team: AI‑Scale Network Innovations, Academic Achievements, Open‑Source Contributions and Industry Outreach

The 2024 report of Alibaba Cloud's Infrastructure Network team details AI‑driven network breakthroughs, high‑performance protocol stacks, large‑scale monitoring systems, numerous top‑conference paper acceptances, open‑source ecosystem initiatives, and extensive industry outreach, highlighting the evolving AI infra landscape.

AI InfrastructureConference PapersData Center Networking
0 likes · 19 min read
2024 Alibaba Cloud Infrastructure Network Team: AI‑Scale Network Innovations, Academic Achievements, Open‑Source Contributions and Industry Outreach
DataFunSummit
DataFunSummit
Dec 30, 2024 · Artificial Intelligence

Colossal-AI: A Scalable Framework for Distributed Training of Large Models

This presentation introduces the challenges of the large‑model era, describes the Colossal‑AI architecture—including N‑dimensional parallelism, heterogeneous storage, and zero‑code experience—shows benchmark results and real‑world use cases, and answers audience questions about its integration with PyTorch and advanced parallel strategies.

AI InfrastructureColossal-AIHeterogeneous Storage
0 likes · 11 min read
Colossal-AI: A Scalable Framework for Distributed Training of Large Models
AI Cyberspace
AI Cyberspace
Dec 17, 2024 · Artificial Intelligence

Why AWS’s Self‑Designed Chips Are Redefining AI Infrastructure

At AWS re:Invent 2024, Amazon unveiled its self‑designed AI hardware trio—Graviton 4 CPU, Nitro 5 DPU, and Trainium 2 accelerator—explaining the innovation, efficiency, and cost advantages driving the strategy, and detailing how these chips power next‑generation cloud services, ultra‑high‑performance servers, and massive AI super‑computing clusters.

AI InfrastructureAI hardwareAWS
0 likes · 20 min read
Why AWS’s Self‑Designed Chips Are Redefining AI Infrastructure
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 28, 2024 · Artificial Intelligence

Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference

Mooncake, an open-source KVCache-centric inference architecture co-developed by Alibaba Cloud and Tsinghua University's MADSys lab, dramatically improves large-model throughput and reduces cost by decoupling resources, standardizing cache pooling, and integrating with frameworks like vLLM, sparking broad industry interest.

AI InfrastructureKVCacheOpen source
0 likes · 4 min read
Mooncake: Open-Source KVCache-Centric Architecture Boosting Large-Model Inference
DevOps
DevOps
Nov 27, 2024 · Artificial Intelligence

Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure

The article analyzes Elon Musk’s Colossus AI supercomputer—its 100,000 NVIDIA H100 GPUs, record‑fast 122‑day construction, vertical‑integration strategy, and the broader implications for U.S. AI infrastructure dominance and China’s competing challenges in funding and chip supply.

AI InfrastructureAI strategyElon Musk
0 likes · 13 min read
Elon Musk’s Colossus Supercomputer: Building 100,000 GPUs in 122 Days and Its Impact on AI Infrastructure
Architects' Tech Alliance
Architects' Tech Alliance
Nov 17, 2024 · Industry Insights

What Drives China's AI Server Market? A Deep Dive into Supply Chain, Demand, and Competition

This article provides a comprehensive analysis of China's AI server industry, covering upstream component markets, midstream shipment and revenue trends, downstream application demand, server classifications, major players, and future policy and technology drivers, all backed by recent market data and charts.

AI InfrastructureAI serversChina
0 likes · 16 min read
What Drives China's AI Server Market? A Deep Dive into Supply Chain, Demand, and Competition
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 13, 2024 · Industry Insights

Why GPU Scale‑Up Interconnects Need a New Protocol – Inside UALink and Alibaba’s Alink

The article analyzes the growing demand for high‑bandwidth, low‑latency GPU Scale‑Up interconnects in AI clusters, explains why existing Ethernet and RDMA solutions fall short, and examines the industry‑wide UALink alliance and Alibaba's Alink System as a new open‑ecosystem solution.

AI InfrastructureAlink SystemGPU
0 likes · 12 min read
Why GPU Scale‑Up Interconnects Need a New Protocol – Inside UALink and Alibaba’s Alink
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 28, 2024 · Cloud Native

How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era

The talk outlines Baidu Smart Cloud's comprehensive cloud‑native redesign—including ultra‑elastic compute, AI‑focused storage, high‑performance networking, AI‑driven operations, and edge‑distributed services—illustrated with automotive and fintech case studies that demonstrate how enterprises can accelerate digital transformation in the AI‑native age.

AI InfrastructureData LakeEdge Computing
0 likes · 12 min read
How Baidu Smart Cloud Reinvents Cloud‑Native Infrastructure for the AI‑Native Era
360 Tech Engineering
360 Tech Engineering
Oct 15, 2024 · Artificial Intelligence

Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration

The article details the design and deployment of 360's AI Compute Center, covering GPU server selection, high‑performance networking, Kubernetes‑based cluster management, advanced scheduling, training and inference acceleration techniques, and a comprehensive AI development platform with visualization and fault‑tolerance features.

AI InfrastructureGPU clusterInference Acceleration
0 likes · 21 min read
Implementation and Optimization of 360 AI Compute Center: Infrastructure, Network, Kubernetes, and Training/Inference Acceleration
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 12, 2024 · Fundamentals

Alibaba Cloud Server R&D Team Publishes Three Papers on High‑Density PCIe 6.0, 100G‑PAM4 Ethernet, and Immersion‑Cooling PCB Materials at IEEE EPEPS 2024 and PCB West 2024

Alibaba Cloud's server R&D team presented three research papers at IEEE EPEPS 2024 and PCB West 2024 covering high‑density PCIe 6.0 crosstalk optimization, 100G‑PAM4 Ethernet performance under air and immersion cooling, and sustainable low‑cost PCB materials for immersion‑cooled computer systems, highlighting their relevance to AI infrastructure and data‑center design.

AI InfrastructureHigh-speed interconnectImmersion Cooling
0 likes · 10 min read
Alibaba Cloud Server R&D Team Publishes Three Papers on High‑Density PCIe 6.0, 100G‑PAM4 Ethernet, and Immersion‑Cooling PCB Materials at IEEE EPEPS 2024 and PCB West 2024
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 11, 2024 · Artificial Intelligence

How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling

This article details the design and implementation of 360’s AI Computing Center, covering server selection, network topology, Kubernetes scheduling, training and inference acceleration, and the AI platform’s core, visualization, and fault‑tolerance capabilities for large‑scale AI workloads.

AI InfrastructureDistributed TrainingGPU cluster
0 likes · 22 min read
How 360 Built a Thousand‑GPU AI Supercomputer with Kubernetes and Advanced Scheduling
Baidu Geek Talk
Baidu Geek Talk
Oct 9, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency

This article analyzes Baidu's Baige 4.0 AI infrastructure, detailing its four‑layer architecture, XMAN 5.0 hardware, HPN network, BCCL communication library, and AIAK inference upgrades, and explains how these innovations address large‑model training and inference challenges while boosting performance, utilization, and cost efficiency.

AI InfrastructureCluster ManagementGPU Acceleration
0 likes · 16 min read
How Baidu’s Baige 4.0 Architecture Redefines AI Compute Efficiency
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 29, 2024 · Artificial Intelligence

How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training

The article details Baidu Baige 4.0’s four‑layer AI infrastructure—hardware, cluster components, training‑inference acceleration, and platform tools—highlighting its heterogeneous computing, high‑performance networking, fault‑tolerant communication library, and optimizations that boost large‑model training and inference efficiency.

AI InfrastructureHigh‑Performance Networkingheterogeneous computing
0 likes · 17 min read
How Baidu’s Baige 4.0 Redefines AI Infrastructure for Large‑Model Training
DataFunSummit
DataFunSummit
Sep 24, 2024 · Artificial Intelligence

Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training

The article discusses the challenges of training ever‑larger AI models on internet‑scale data, critiques traditional batch ETL pipelines, and proposes a streaming data‑flow architecture with dynamic data selection and a shared‑memory/Alluxio middle layer to decouple data processing from model training, improving efficiency and scalability.

AI InfrastructureMultimodal Datadata pipelines
0 likes · 20 min read
Streaming Data Pipelines and Scaling Laws for Efficient Large‑Model Training
Data Thinking Notes
Data Thinking Notes
Sep 19, 2024 · Artificial Intelligence

Why AI Has Only a Seven-Year History—and What AI+ Means for the Future

In this speech, Wang Jian reflects on the evolution of artificial intelligence, arguing that modern AI is fundamentally different from its early concepts, emphasizing the pivotal roles of data, models, and infrastructure, and exploring the transformative impact of AI+, transformers, and cloud platforms on future innovation.

AI InfrastructureAI+Artificial Intelligence
0 likes · 18 min read
Why AI Has Only a Seven-Year History—and What AI+ Means for the Future
Architects' Tech Alliance
Architects' Tech Alliance
Sep 17, 2024 · Industry Insights

Why Intelligent Computing Centers Are the Backbone of China’s AI Boom

The article explains what an Intelligent Computing Center (智算中心) is, analyzes its extensive upstream and downstream industry chain, describes the cutting‑edge AI computing architecture that powers it, forecasts massive growth in AI compute capacity by 2028, and outlines regional deployment strategies and service models such as leasing, data, operation, and talent cultivation.

AI InfrastructureAI computingIntelligent Computing Center
0 likes · 11 min read
Why Intelligent Computing Centers Are the Backbone of China’s AI Boom
21CTO
21CTO
Sep 10, 2024 · Artificial Intelligence

Why AI Has Only a Seven‑Year History and What AI Infrastructure Means for the Future

In this speech, academician Wang Jian reflects on the short, seven‑year history of modern AI, distinguishes AI, AI+ and AI infrastructure, explains how data, models and compute power have become the new foundational layer, and examines the roles of Google, OpenAI, transformers, and cloud services in shaping today’s AI revolution.

@DataAI InfrastructureAI+
0 likes · 20 min read
Why AI Has Only a Seven‑Year History and What AI Infrastructure Means for the Future
Baobao Algorithm Notes
Baobao Algorithm Notes
Jul 24, 2024 · Artificial Intelligence

What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure

This article dissects Meta’s Llama 3 405‑billion‑parameter model, covering its dense Transformer design, data‑mixing strategy, two‑stage scaling‑law prediction, 4‑D parallelism, custom hardware clusters, training schedules, post‑training alignment methods, and the extensive evaluation results that benchmark it against leading LLMs.

AI InfrastructureDistributed TrainingLlama 3
0 likes · 56 min read
What Powers Meta’s Llama 3 405B? Inside the Architecture, Scaling Laws, and Massive Training Infrastructure
NewBeeNLP
NewBeeNLP
Jul 24, 2024 · Industry Insights

From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)

The article traces the evolution of large‑model training and inference infrastructure from the early “black‑iron” era (2019‑2021) through the “golden” boom (2022‑2023) to the emerging “silver” phase (2024‑), highlighting key research breakthroughs, open‑source frameworks, hardware trends, market dynamics, and practical challenges for engineers entering the field.

AI InfrastructureInferenceLarge Model
0 likes · 22 min read
From Black Iron to Silver: The Evolution of Large Model Infrastructure (2019‑2024)
Architects' Tech Alliance
Architects' Tech Alliance
Jul 15, 2024 · Artificial Intelligence

Why Model-as-a-Service (MaaS) Is Shaping the Future of AI Deployment

This article examines the Model-as-a-Service (MaaS) paradigm, tracing its origins, defining its expanded capabilities for large‑model ecosystems, outlining the full‑stack services it offers, and analyzing current industry adoption, deployment models, and the technical and regulatory challenges that must be addressed for scalable AI rollout.

AI InfrastructureAI deploymentCloud AI
0 likes · 11 min read
Why Model-as-a-Service (MaaS) Is Shaping the Future of AI Deployment
DataFunTalk
DataFunTalk
Jul 8, 2024 · Artificial Intelligence

Challenges and Techniques for Distributed Training of Large Language Models

This article discusses the historical background, major challenges such as massive compute and memory demands, and the technical ecosystem—including data parallelism, pipeline parallelism, and optimization strategies like DeepSpeed and 1F1B—to enable efficient distributed training of large language models.

AI InfrastructureDeepSpeedPipeline Parallelism
0 likes · 22 min read
Challenges and Techniques for Distributed Training of Large Language Models
21CTO
21CTO
Jun 7, 2024 · Artificial Intelligence

Why AI Gateways Are the Next Evolution of API Gateways

AI gateways have emerged as essential infrastructure for modern AI applications, offering specialized security, load balancing, cost management, and observability that go beyond traditional API gateways, and understanding their differences and deployment considerations is crucial for developers and ops teams.

AI InfrastructureAI gatewayCost Management
0 likes · 10 min read
Why AI Gateways Are the Next Evolution of API Gateways
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
May 24, 2024 · Artificial Intelligence

How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance

DeepRec Extension enhances large‑scale sparse model training by adding automatic elastic training, resource‑aware scheduling, real‑time monitoring, and efficient fault‑tolerance mechanisms, enabling lower cost, higher throughput, and more reliable distributed training for AI workloads.

AI InfrastructureDeepRecSparse Models
0 likes · 13 min read
How DeepRec Extension Boosts Distributed Sparse Model Training with Elasticity and Fault Tolerance
Baidu Tech Salon
Baidu Tech Salon
May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM

Baidu Baige’s AIAK‑LLM suite accelerates large‑model training and inference by boosting Model FLOPS Utilization through techniques such as TP communication overlap, hybrid recompute, zero‑offload, automatic parallel‑strategy search, multi‑chip support, and inference‑specific optimizations, achieving over 60 % speedup and seamless Hugging Face integration.

AI InfrastructureAIAK-LLMBaidu Baige
0 likes · 26 min read
Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM
Baidu Geek Talk
Baidu Geek Talk
May 15, 2024 · Artificial Intelligence

Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations

The talk outlines how Baidu’s Baige AIAK‑LLM suite tackles the exploding compute demands of trillion‑parameter models by boosting Model FLOPS Utilization through advanced parallelism, memory‑saving recompute, zero‑offload, adaptive scheduling, and cross‑chip orchestration, delivering 30‑60% training and inference speedups and a unified cloud product.

AI InfrastructureBaiduInference Optimization
0 likes · 25 min read
Accelerating Large Model Training and Inference with Baidu Baige AIAK‑LLM: Challenges, Techniques, and Optimizations
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 15, 2024 · Artificial Intelligence

How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference

The article explains the scaling challenges of ever‑larger LLMs, introduces the MFU performance metric, surveys industry parallelism and memory‑saving techniques, and details Baidu’s AIAK‑LLM suite—including resource, component and acceleration layers—as well as concrete training and inference optimizations that raise MFU by 30‑60% and cut deployment costs.

AI InfrastructureLarge ModelMFU
0 likes · 25 min read
How Baidu’s AIAK‑LLM Supercharges Large‑Model Training and Inference
ZhongAn Tech Team
ZhongAn Tech Team
May 13, 2024 · Artificial Intelligence

Weekly Tech Overview: AI Advances, Mobile Game Store, and Industry Insights

This weekly tech roundup covers Microsoft’s upcoming mobile game store, Alibaba Cloud’s Tongyi Qianwen 2.5 AI model, Google DeepMind’s AlphaFold 3 for drug discovery, TikTok’s AI‑content labeling, 神州信息’s AI‑native product, Apple’s on‑device AI chips, expert views on scaling laws, and news on Fei‑Fei Li’s startup, Apple’s China tax, and Buffett’s Apple stake reduction.

AIAI InfrastructureEntrepreneurship
0 likes · 7 min read
Weekly Tech Overview: AI Advances, Mobile Game Store, and Industry Insights
Architects' Tech Alliance
Architects' Tech Alliance
May 9, 2024 · Artificial Intelligence

AI Servers: Market Opportunities, Architecture, and Future Demand Driven by Generative AI

The article examines how the surge of generative AI (AIGC) is fueling rapid growth in AI server demand, detailing the emerging AIGC ecosystem, server hardware composition, model scaling, heterogeneous computing, training vs. inference workloads, market size forecasts, and the competitive landscape of AI server manufacturers.

AI InfrastructureAI serversGPU
0 likes · 15 min read
AI Servers: Market Opportunities, Architecture, and Future Demand Driven by Generative AI
ITPUB
ITPUB
Apr 27, 2024 · Databases

How Vector Databases Enable High‑Dimensional Stock Quant Analysis

This interview‑style guide explores how vector databases handle massive, high‑dimensional time‑series data for quantitative stock trading, detailing data scaling challenges, selection criteria, and why the research team chose LanceDB over alternatives for efficient, scalable financial analysis.

AI InfrastructureLanceDBQuantitative Finance
0 likes · 7 min read
How Vector Databases Enable High‑Dimensional Stock Quant Analysis
Architects' Tech Alliance
Architects' Tech Alliance
Apr 25, 2024 · Industry Insights

What China’s AI Labs Learned from Scaling Domestic Large‑Model Training

The article analyzes the computational characteristics and system challenges of training large AI models on domestic platforms, examines framework parallelism and future algorithms, and proposes six strategic measures—including scaling compute, improving data management, building a national R&D team, and boosting AI‑chip investment—to accelerate China’s AI leadership.

AI InfrastructureModel Trainingdomestic AI
0 likes · 5 min read
What China’s AI Labs Learned from Scaling Domestic Large‑Model Training
DataFunSummit
DataFunSummit
Mar 31, 2024 · Artificial Intelligence

Challenges and Techniques in Distributed Training of Large Language Models

This article reviews the rapid development of large language models since 2019, outlines the historical background, identifies key challenges such as massive compute demand, memory constraints, and system complexity, and then details distributed training technologies—including data parallelism, pipeline parallelism, and advanced optimization strategies—while also discussing future research directions and answering common questions.

AI InfrastructureData ParallelismDeepSpeed
0 likes · 23 min read
Challenges and Techniques in Distributed Training of Large Language Models
Bilibili Tech
Bilibili Tech
Mar 15, 2024 · Artificial Intelligence

Hardware Resource Estimation and Bottleneck Analysis for Large Language Models (LLMs)

The article analyzes the compute, memory, and communication resources required to train and run large language models, quantifies bottlenecks such as the massive FLOP demand, terabyte‑scale GPU memory, and high‑bandwidth interconnect needs, and evaluates parallelism strategies and bandwidth estimates to guide hardware and software design for scaling LLMs.

AI InfrastructureHardwareLLM
0 likes · 53 min read
Hardware Resource Estimation and Bottleneck Analysis for Large Language Models (LLMs)
DataFunSummit
DataFunSummit
Mar 14, 2024 · Artificial Intelligence

Multi‑Level Efficiency Challenges and Emerging Paradigms for Large AI Models

The article examines how large AI models are moving toward a unified, low‑knowledge‑density paradigm that raises computational efficiency challenges across model, algorithm, framework, and infrastructure layers, while also highlighting NVIDIA's GTC 2024 China AI Day sessions that showcase practical solutions and upcoming training opportunities.

AI InfrastructureAI conferencesNVIDIA GTC
0 likes · 10 min read
Multi‑Level Efficiency Challenges and Emerging Paradigms for Large AI Models
Baidu Geek Talk
Baidu Geek Talk
Mar 6, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis

The article explains why collective communication is critical for distributed large‑model training, outlines the new requirements for system reliability, and introduces Baidu’s Collective Communication Library (BCCL), detailing its enhanced observability, fault‑diagnosis, stability, and performance optimizations that raise effective training time to 98 % and bandwidth utilization to 95 %.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Large‑Model Training with Real‑Time Observability and Fault Diagnosis
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 1, 2024 · Artificial Intelligence

How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis

Baidu’s Collective Communication Library (BCCL) enhances large‑model distributed training by improving real‑time bandwidth monitoring, fault diagnosis, network stability, and performance, leveraging RDMA networks and GPU‑specific optimizations to increase effective training time to 98% and bandwidth utilization to 95%.

AI InfrastructureDistributed TrainingFault Diagnosis
0 likes · 11 min read
How Baidu’s BCCL Boosts Distributed AI Training with Real‑Time Observability and Fault Diagnosis
JD Retail Technology
JD Retail Technology
Jan 30, 2024 · Artificial Intelligence

Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models

The article details JD Retail's evolution from TensorFlow‑based sparse training to a custom high‑performance parameter server and a fully GPU‑accelerated, multi‑node, multi‑card synchronous training framework that leverages GPU‑RDMA, two‑level CPU‑DRAM/GPU‑HBM caching, and pipeline parallelism to overcome storage, I/O, and compute challenges of trillion‑parameter recommendation systems.

AI InfrastructureGPU AccelerationParameter Server
0 likes · 12 min read
Next-Generation Multi‑GPU Synchronous Training Architecture for Large‑Scale Sparse Recommendation Models
DataFunSummit
DataFunSummit
Jan 22, 2024 · Artificial Intelligence

Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI

This article introduces Colossal‑AI, an open‑source platform that tackles the challenges of training, fine‑tuning, and deploying massive AI models by leveraging efficient memory management, N‑dimensional parallelism, and high‑performance inference to dramatically reduce cost and improve scalability across thousands of GPUs.

AI InfrastructureColossal-AIDistributed Training
0 likes · 21 min read
Improving Efficiency of Large‑Scale AI Model Training, Fine‑tuning, and Deployment with Colossal‑AI
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 11, 2023 · Cloud Computing

Alibaba Cloud Executive Discusses IPv6 Deployment, Global Collaboration, and AI‑Driven Network Evolution at the 2023 Wuzhen Internet Forum

In a detailed interview at the 2023 Wuzhen Internet Forum, Alibaba Cloud’s infrastructure lead Cai Dezhong outlines the three‑phase IPv6 rollout, highlights organizational and technical innovations, stresses the need for global cooperation, and explains how IPv6 underpins the next generation AI infrastructure and predictable high‑performance networking.

AI InfrastructureGlobal CollaborationHigh‑Performance Networking
0 likes · 9 min read
Alibaba Cloud Executive Discusses IPv6 Deployment, Global Collaboration, and AI‑Driven Network Evolution at the 2023 Wuzhen Internet Forum
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 10, 2023 · Artificial Intelligence

How AI Infrastructure Fuels High‑Quality Digital Economy Growth

The article summarizes a Baidu Cloud Intelligence conference speech and whitepaper, explaining how AI foundations and large‑model infrastructure reshape applications, boost enterprise digital transformation, and drive regional economic development, offering a roadmap for high‑quality digital economy advancement.

AI InfrastructureDigital Economyenterprise transformation
0 likes · 11 min read
How AI Infrastructure Fuels High‑Quality Digital Economy Growth
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 21, 2023 · Artificial Intelligence

How Baidu Cloud Integrates AI and Cloud to Accelerate Autonomous Driving

At the 2023 Baidu Cloud Intelligence Conference, Baidu AI Cloud outlined a comprehensive, four‑layer solution—spanning distributed cloud infrastructure, AI‑focused compute, data compliance, and end‑to‑end toolchains—to address the challenges of electric, intelligent vehicles, large‑model deployment, and regulatory compliance in autonomous driving.

AI Infrastructureautonomous drivingcloud computing
0 likes · 12 min read
How Baidu Cloud Integrates AI and Cloud to Accelerate Autonomous Driving
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 19, 2023 · Artificial Intelligence

BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner

BladeLLM, Alibaba Cloud’s large‑model inference engine, pushes the limits of LLMs by supporting ultra‑long context lengths up to 70 K tokens, leveraging novel RaggedAttention and a DNN‑based AutoTuner to deliver superior performance, memory efficiency, and low‑latency inference across diverse workloads.

AI InfrastructureAutoTunerLLM inference
0 likes · 11 min read
BladeLLM: Ultra‑Long Context LLM Inference via RaggedAttention & AutoTuner
Efficient Ops
Efficient Ops
Jun 11, 2023 · Artificial Intelligence

Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It

The article explains how AIGC models demand massive GPU compute, why network bandwidth and latency become the critical limiting factors, and how the Distributed Disaggregated Chassis (DDC) architecture addresses these challenges with scalable, high‑throughput networking solutions.

AI InfrastructureAIGCDDC
0 likes · 13 min read
Why Network Bandwidth Is the Real Bottleneck for AIGC and How DDC Solves It
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
May 19, 2023 · Artificial Intelligence

Immersion Liquid Cooling Forum on AI Infrastructure: Key Insights and Industry Perspectives

The May 15 Beijing forum gathered experts from leading tech firms and research institutes to discuss immersion liquid cooling as a vital solution for AI infrastructure's growing compute and thermal challenges, presenting current trends, technical designs, material research, and future sustainable development directions.

AI InfrastructureImmersion Coolingdata center
0 likes · 7 min read
Immersion Liquid Cooling Forum on AI Infrastructure: Key Insights and Industry Perspectives
Baidu Tech Salon
Baidu Tech Salon
May 11, 2023 · Artificial Intelligence

Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models

The article details Baidu's development of a massive high‑performance GPU/IB cluster, its architectural design, the challenges of training trillion‑parameter models, and how the integrated AI stack—spanning hardware, framework, and resource management—overcomes compute, memory, and communication bottlenecks to accelerate large‑model training.

AI InfrastructureBaidu AI BaseDistributed Training
0 likes · 17 min read
Inside Baidu’s High‑Performance GPU Cluster: Powering the Next‑Gen AI Models
Amap Tech
Amap Tech
May 11, 2023 · Artificial Intelligence

A 20‑Year Review of AI Infrastructure Milestones

Over the past two decades, AI infrastructure has evolved from early distributed storage and MapReduce to GPU programming, modern package managers, in‑memory processing, deep‑learning frameworks, parameter servers, AI compilers, synthetic data pipelines, open‑source model hubs, and today’s large‑scale Kubernetes‑based clusters, forming the essential foundation for every breakthrough.

AI CompilersAI InfrastructureBig Data
0 likes · 29 min read
A 20‑Year Review of AI Infrastructure Milestones
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
May 9, 2023 · Artificial Intelligence

How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models

This article explains how Baidu built a massive, high‑performance GPU/IB cluster, optimized its architecture and software stack, and integrated AI frameworks and resource management to overcome compute, memory, and communication bottlenecks, enabling efficient training of trillion‑parameter large models.

AI InfrastructureDistributed TrainingGPU clusters
0 likes · 19 min read
How Baidu’s High‑Performance GPU Cluster Powers the Next Generation of Large‑Scale AI Models
DataFunSummit
DataFunSummit
Apr 27, 2023 · Artificial Intelligence

Baidu's Interoperability Solutions for Federated Learning: Principles, JinKe Alliance, and the Open‑Source HIGHFLIP Protocol

The article presents Baidu's comprehensive approach to federated‑learning interoperability, covering the underlying principles, the JinKe Alliance bottom‑layer solution, the high‑level HIGHFLIP protocol, and a comparative discussion of white‑box, gray‑box, and black‑box integration strategies.

AI InfrastructureBaiduFederated Learning
0 likes · 11 min read
Baidu's Interoperability Solutions for Federated Learning: Principles, JinKe Alliance, and the Open‑Source HIGHFLIP Protocol
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 24, 2023 · Artificial Intelligence

How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models

Alibaba Cloud’s PAI platform unveils TePDist, an HLO‑based automatic distributed deep‑learning system that decouples strategy search from model code, offers client/server architecture, supports SPMD and pipeline parallelism, delivers high performance on GPT, MoE and other models, and is now open‑source.

AI InfrastructureDistributed Deep LearningHLO IR
0 likes · 4 min read
How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models
DataFunSummit
DataFunSummit
Apr 20, 2023 · Artificial Intelligence

SenseTime Unveils Multimodal ‘SenseNova’ Large Model System and Its Industry Applications

SenseTime introduced its visual‑centric multimodal large‑model platform SenseNova, detailing model scaling, extensive AI infrastructure, diverse industry deployments such as autonomous driving and generative content, and the challenges of compute efficiency and data acquisition in the race for advanced AI.

AI InfrastructureComputer Visionlarge models
0 likes · 13 min read
SenseTime Unveils Multimodal ‘SenseNova’ Large Model System and Its Industry Applications
DataFunTalk
DataFunTalk
Mar 31, 2023 · Artificial Intelligence

Estimating the Resource and Cost Requirements for Large Language Model Training and Inference

The article analyses the computational resources, hardware costs, and human investment needed to train and serve large language models such as GPT‑3, discusses practical cost calculations, highlights the challenges faced by Chinese AI teams, and argues for sustained, long‑term funding to achieve meaningful breakthroughs.

AI InfrastructureChina AIInference
0 likes · 14 min read
Estimating the Resource and Cost Requirements for Large Language Model Training and Inference
21CTO
21CTO
Mar 31, 2023 · Artificial Intelligence

How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline

ColossalChat, an open‑source project built on LLaMA, offers a full RLHF pipeline—including supervised fine‑tuning, reward‑model training, and reinforcement learning—enabling low‑cost, bilingual ChatGPT‑like models with 4‑bit quantized inference, detailed code, dataset, and performance optimizations.

AI InfrastructureColossalAIModel Quantization
0 likes · 12 min read
How ColossalChat Replicates ChatGPT with a Complete Open‑Source RLHF Pipeline
Tencent Cloud Developer
Tencent Cloud Developer
Mar 22, 2023 · Artificial Intelligence

How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations

This article analyzes Tencent's AngelPTM framework, detailing its ZeRO-Cache strategy, unified storage management, multi‑stream async execution, SSD tiered storage, and performance benchmarks that show up to 95% larger model capacity and over 44% speedup compared to community solutions.

AI InfrastructureGPU AccelerationMemory Optimization
0 likes · 12 min read
How AngelPTM Cuts Large Model Training Costs with ZeRO-Cache Optimizations
Baidu Geek Talk
Baidu Geek Talk
Mar 21, 2023 · Artificial Intelligence

Infrastructure Challenges and Solutions for Large‑Scale AI Model Training

The article explains how the massive compute and storage demands of today’s large language models create a “compute wall” and “storage wall,” and describes Baidu Intelligent Cloud’s four‑layer full‑stack infrastructure—combining advanced parallelism techniques, optimized GPU networking, static‑graph compilation, and cost‑model‑driven placement—to train trillion‑parameter models efficiently.

AI InfrastructureCost ModelDistributed Training
0 likes · 27 min read
Infrastructure Challenges and Solutions for Large‑Scale AI Model Training
Python Programming Learning Circle
Python Programming Learning Circle
Mar 21, 2023 · Artificial Intelligence

Why Replicating ChatGPT in China Demands Massive AI Infrastructure and Cloud Computing

The article explains that reproducing ChatGPT in China is not just a matter of funding but requires extensive expertise in large‑scale language model training, massive compute resources, optimized cloud infrastructure, and deep AI research, as demonstrated by Alibaba's DAMO Academy efforts.

AI InfrastructureChatGPTModel Training
0 likes · 10 min read
Why Replicating ChatGPT in China Demands Massive AI Infrastructure and Cloud Computing
Hulu Beijing
Hulu Beijing
Mar 16, 2023 · Artificial Intelligence

Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions

This article explores Hulu’s five‑year‑old machine‑learning training platform, detailing its three‑layer architecture, the shift from single‑node to distributed training, and the technical solutions—including parameter servers, Ring AllReduce, Kubernetes, Volcano, and Horovod—that enable scalable AI workloads across GPU, CPU, and storage resources.

AI InfrastructureDistributed TrainingHulu
0 likes · 13 min read
Inside Hulu’s Distributed Training Platform: Architecture, Challenges, and Solutions
Tencent Advertising Technology
Tencent Advertising Technology
Mar 2, 2023 · Artificial Intelligence

Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications

This article details Tencent's development of the 1‑trillion‑parameter HunYuan‑NLP model, covering its MoE architecture, cost‑effective pre‑training strategies, distributed training framework, model compression toolkit, and successful deployment across advertising, gaming, and other Tencent services.

AI InfrastructureMixture of Expertslarge language model
0 likes · 17 min read
Tencent's HunYuan‑NLP 1T Large‑Scale AI Model: Training Techniques, Optimization, and Real‑World Applications
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Feb 23, 2023 · Artificial Intelligence

How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models

This article explains how Baidu's intelligent cloud overcomes the compute and storage walls of large‑scale model training by combining hardware design, network topology, and software optimizations such as pipeline, tensor, and expert parallelism, cost‑model‑driven placement, and future‑proof AI infrastructure evolution.

AI InfrastructureBaidu CloudCost Model
0 likes · 28 min read
How Baidu’s Cloud Infrastructure Tackles the Challenges of Training Massive AI Models
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Nov 11, 2022 · Artificial Intelligence

Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)

Xiaohongshu’s in‑house LarC platform powers real‑time, multimodal recommendation, life‑search, and generative‑AI commercial content for its 200 million‑user community by processing billions of daily feedback samples, employing conflict‑free parameter servers, diversified sequence modeling, and large‑scale representation learning to deliver personalized, fresh, and diverse user experiences.

AI InfrastructureMachine Learning PlatformMultimodal AI
0 likes · 13 min read
Large-Scale Deep Learning Systems and Their Application at Xiaohongshu (RED)
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 13, 2022 · Artificial Intelligence

Unlocking GPU Efficiency: Baidu’s Dual‑Engine Container Virtualization for AI

This article explores Baidu’s cutting‑edge GPU container virtualization architecture, detailing the challenges of low GPU utilization in AI workloads, the dual‑engine (user‑space and kernel‑space) isolation mechanisms, various mixing strategies, performance evaluations, and best‑practice recommendations for maximizing resource efficiency in large‑scale AI deployments.

AI InfrastructureGPU virtualizationMixed Scheduling
0 likes · 31 min read
Unlocking GPU Efficiency: Baidu’s Dual‑Engine Container Virtualization for AI
Baidu Geek Talk
Baidu Geek Talk
Jul 6, 2022 · Artificial Intelligence

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

The article examines the industry trend toward ever‑larger AI models, compares their parameter scale to the human brain, outlines the computational and memory challenges of training such models, and details advanced parallelism techniques and Baidu's high‑performance cluster solutions that enable efficient, stable large‑scale model training.

AI InfrastructureBaiduCluster Computing
0 likes · 17 min read
Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies
ITPUB
ITPUB
Jun 2, 2022 · Artificial Intelligence

Why AI Needs Modular Infrastructure: Lessons from LLVM and the Future of ML Systems

The article examines how monolithic AI toolchains hinder innovation, recounts the historical fragmentation of software in the 1990s, highlights LLVM's modular architecture as a turning point, and argues for a new, composable AI infrastructure to make machine learning more accessible and scalable.

AI InfrastructureLLVMML compilers
0 likes · 11 min read
Why AI Needs Modular Infrastructure: Lessons from LLVM and the Future of ML Systems
DataFunTalk
DataFunTalk
Apr 17, 2022 · Artificial Intelligence

DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status

DeepRec, developed since 2016 by Alibaba, is a specialized sparse‑model training engine that addresses feature elasticity, training performance, and deployment challenges through dynamic elastic features, optimized runtimes, distributed training frameworks, incremental model export, and multi‑level storage, and is now being open‑sourced for broader industry collaboration.

AI InfrastructureDeepRecRuntime Optimization
0 likes · 15 min read
DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status
DataFunTalk
DataFunTalk
Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureDistributed TrainingGPU clustering
0 likes · 17 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
DataFunTalk
DataFunTalk
Nov 24, 2020 · Artificial Intelligence

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

This presentation explains how knowledge graphs serve as the foundation for new‑infrastructure initiatives, detailing the evolution of AI from perception to cognition, the role of big‑data centers, DIKW modeling, intelligent data governance, and the construction of a cognitive AI middle‑platform for industry applications.

AI InfrastructureArtificial IntelligenceBig Data
0 likes · 18 min read
Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms
360 Tech Engineering
360 Tech Engineering
Sep 14, 2020 · Artificial Intelligence

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

TensorNet is a TensorFlow‑based distributed training framework that tackles the challenges of massive data and billions of sparse parameters in advertising and recommendation systems by enabling near‑infinite sparse feature dimensions, drastically reducing synchronization overhead, and delivering up to 35% inference speed improvements.

AI InfrastructureDistributed TrainingTensorFlow
0 likes · 8 min read
TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow
JD Tech Talk
JD Tech Talk
Jun 3, 2020 · Artificial Intelligence

JD Digital Science Unveils Fast Secure Federated Learning Framework and Two Industry‑First Techniques

JD Digital Science introduced its fast secure federated learning framework, highlighted two pioneering technologies—a kernel‑based nonlinear federated learning algorithm and a distributed fast homomorphic encryption method—both accepted at KDD 2020, and discussed their industrial applications, privacy benefits, and regulatory relevance.

AI InfrastructureFederated LearningKDD2020
0 likes · 6 min read
JD Digital Science Unveils Fast Secure Federated Learning Framework and Two Industry‑First Techniques
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 17, 2020 · Artificial Intelligence

How AI Engineering Powers Modern Enterprises: From Deep Learning to Cloud Infrastructure

This article explores the fundamentals and evolution of artificial intelligence, its applications in perception and decision‑making, the role of deep learning, the importance of compute power and cloud platforms, and how enterprises can strategically adopt AI and data‑driven solutions to drive business value.

AI Infrastructuremachine learning
0 likes · 15 min read
How AI Engineering Powers Modern Enterprises: From Deep Learning to Cloud Infrastructure
AntTech
AntTech
Oct 17, 2019 · Artificial Intelligence

From a 30‑Year Coding Journey to AI Infrastructure: Wang Yi’s Story and the Open‑Source Projects SQLFlow and ElasticDL

The article chronicles Wang Yi’s three‑decade programming career, his moves across Tencent, Google, Baidu and Ant Financial, and explains how his open‑source AI infrastructure projects SQLFlow and ElasticDL transform model development for analysts while promoting a culture of code review and practical engineering.

AI InfrastructureCode reviewElasticDL
0 likes · 12 min read
From a 30‑Year Coding Journey to AI Infrastructure: Wang Yi’s Story and the Open‑Source Projects SQLFlow and ElasticDL
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 12, 2019 · Artificial Intelligence

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

AI InfrastructureDeep LearningDistributed Training
0 likes · 17 min read
How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs
Didi Tech
Didi Tech
Apr 4, 2019 · Artificial Intelligence

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

Since 2016 DiDi has evolved its machine‑learning platform from isolated, workshop‑style GPU servers to a cloud‑native, Kubernetes‑driven architecture that unifies resource management, introduces custom parameter‑server and serving frameworks, provides autotuning, external SaaS offerings such as Elastic Inference and JianShu, and aims for a 3.0 unified internal‑external AI marketplace.

AI InfrastructureGPU computingKubernetes
0 likes · 19 min read
DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 18, 2019 · Artificial Intelligence

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Euler, Alibaba's newly open‑sourced graph deep‑learning framework, combines distributed graph processing with neural network training to handle billions of nodes and edges, supports heterogeneous graphs, offers built‑in algorithms, and has already boosted advertising, fraud detection, and other industry applications.

AI InfrastructureEuler frameworkdistributed computing
0 likes · 11 min read
How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning
Meituan Technology Team
Meituan Technology Team
Oct 25, 2018 · Artificial Intelligence

Deep Learning System Design and Parallel Computing Solutions at Meituan

Meituan built a custom deep‑learning platform that combines data‑parallel and hybrid parallelism across multi‑GPU/cluster hardware, uses coarse‑grained scheduling and Kaldi‑derived acoustic algorithms, and supports fast NLU model hot‑updates, achieving near‑linear GPU scaling and 6–7× speedups over traditional solutions.

AI InfrastructureNLUSystem Architecture
0 likes · 13 min read
Deep Learning System Design and Parallel Computing Solutions at Meituan
Architecture Digest
Architecture Digest
Aug 15, 2017 · Artificial Intelligence

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

The article explains why AI engineers need foundational infrastructure knowledge—covering big‑data processing, cloud services, containerization, MapReduce, and deep‑learning platforms—to effectively solve real‑world problems, collaborate with teams, and build scalable, maintainable AI solutions.

AI InfrastructureBig DataMapReduce
0 likes · 14 min read
Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning
21CTO
21CTO
Jul 16, 2017 · Artificial Intelligence

Why Every AI Engineer Must Master Infrastructure Basics

In the AI era, engineers need more than cutting‑edge algorithms—they must understand infrastructure, deployment, scalability, and team collaboration, as illustrated by four practical reasons and Google’s architectural breakthroughs that bridge big data, machine learning, and deep learning.

AI InfrastructureGoogleSoftware Architecture
0 likes · 17 min read
Why Every AI Engineer Must Master Infrastructure Basics