Tagged articles

AI Infrastructure

216 articles · Page 3 of 3
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Jul 13, 2022 · Artificial Intelligence

Unlocking GPU Efficiency: Baidu’s Dual‑Engine Container Virtualization for AI

This article explores Baidu’s cutting‑edge GPU container virtualization architecture, detailing the challenges of low GPU utilization in AI workloads, the dual‑engine (user‑space and kernel‑space) isolation mechanisms, various mixing strategies, performance evaluations, and best‑practice recommendations for maximizing resource efficiency in large‑scale AI deployments.

AI InfrastructureGPU virtualizationMixed Scheduling
0 likes · 31 min read
Unlocking GPU Efficiency: Baidu’s Dual‑Engine Container Virtualization for AI
Baidu Geek Talk
Baidu Geek Talk
Jul 6, 2022 · Artificial Intelligence

Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies

The article examines the industry trend toward ever‑larger AI models, compares their parameter scale to the human brain, outlines the computational and memory challenges of training such models, and details advanced parallelism techniques and Baidu's high‑performance cluster solutions that enable efficient, stable large‑scale model training.

AI InfrastructureBaiduCluster Computing
0 likes · 17 min read
Why Training Massive AI Models Demands New Cluster Architectures and Parallelism Strategies
ITPUB
ITPUB
Jun 2, 2022 · Artificial Intelligence

Why AI Needs Modular Infrastructure: Lessons from LLVM and the Future of ML Systems

The article examines how monolithic AI toolchains hinder innovation, recounts the historical fragmentation of software in the 1990s, highlights LLVM's modular architecture as a turning point, and argues for a new, composable AI infrastructure to make machine learning more accessible and scalable.

AI InfrastructureLLVMML compilers
0 likes · 11 min read
Why AI Needs Modular Infrastructure: Lessons from LLVM and the Future of ML Systems
DataFunTalk
DataFunTalk
Apr 17, 2022 · Artificial Intelligence

DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status

DeepRec, developed since 2016 by Alibaba, is a specialized sparse‑model training engine that addresses feature elasticity, training performance, and deployment challenges through dynamic elastic features, optimized runtimes, distributed training frameworks, incremental model export, and multi‑level storage, and is now being open‑sourced for broader industry collaboration.

AI InfrastructureDeepRecruntime optimization
0 likes · 15 min read
DeepRec: Alibaba’s Sparse Model Training Engine – Architecture, Features, and Open‑Source Status
DataFunTalk
DataFunTalk
Mar 16, 2021 · Artificial Intelligence

Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions

This article details Weibo's large‑scale multimodal content understanding platform, covering its background, data and model heterogeneity challenges, the end‑to‑end workflow, GPU‑heterogeneous cluster design, resource scheduling, performance optimization for distributed training and online inference, and comprehensive monitoring to ensure stable, low‑latency AI services.

AI InfrastructureGPU clusteringMultimodal AI
0 likes · 17 min read
Weibo Multimodal Content Understanding Service Architecture and GPU Heterogeneous Cluster Solutions
DataFunTalk
DataFunTalk
Nov 24, 2020 · Artificial Intelligence

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

This presentation explains how knowledge graphs serve as the foundation for new‑infrastructure initiatives, detailing the evolution of AI from perception to cognition, the role of big‑data centers, DIKW modeling, intelligent data governance, and the construction of a cognitive AI middle‑platform for industry applications.

AI InfrastructureBig DataData Governance
0 likes · 18 min read
Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms
360 Tech Engineering
360 Tech Engineering
Sep 14, 2020 · Artificial Intelligence

TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow

TensorNet is a TensorFlow‑based distributed training framework that tackles the challenges of massive data and billions of sparse parameters in advertising and recommendation systems by enabling near‑infinite sparse feature dimensions, drastically reducing synchronization overhead, and delivering up to 35% inference speed improvements.

AI InfrastructureTensorFlowdistributed training
0 likes · 8 min read
TensorNet: A Distributed Training Framework Optimized for Large-Scale Sparse Feature Models on TensorFlow
JD Tech Talk
JD Tech Talk
Jun 3, 2020 · Artificial Intelligence

JD Digital Science Unveils Fast Secure Federated Learning Framework and Two Industry‑First Techniques

JD Digital Science introduced its fast secure federated learning framework, highlighted two pioneering technologies—a kernel‑based nonlinear federated learning algorithm and a distributed fast homomorphic encryption method—both accepted at KDD 2020, and discussed their industrial applications, privacy benefits, and regulatory relevance.

AI InfrastructureKDD2020Kernel Methods
0 likes · 6 min read
JD Digital Science Unveils Fast Secure Federated Learning Framework and Two Industry‑First Techniques
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 17, 2020 · Artificial Intelligence

How AI Engineering Powers Modern Enterprises: From Deep Learning to Cloud Infrastructure

This article explores the fundamentals and evolution of artificial intelligence, its applications in perception and decision‑making, the role of deep learning, the importance of compute power and cloud platforms, and how enterprises can strategically adopt AI and data‑driven solutions to drive business value.

AI Infrastructuremachine learning
0 likes · 15 min read
How AI Engineering Powers Modern Enterprises: From Deep Learning to Cloud Infrastructure
AntTech
AntTech
Oct 17, 2019 · Artificial Intelligence

From a 30‑Year Coding Journey to AI Infrastructure: Wang Yi’s Story and the Open‑Source Projects SQLFlow and ElasticDL

The article chronicles Wang Yi’s three‑decade programming career, his moves across Tencent, Google, Baidu and Ant Financial, and explains how his open‑source AI infrastructure projects SQLFlow and ElasticDL transform model development for analysts while promoting a culture of code review and practical engineering.

AI InfrastructureElasticDLSQLFlow
0 likes · 12 min read
From a 30‑Year Coding Journey to AI Infrastructure: Wang Yi’s Story and the Open‑Source Projects SQLFlow and ElasticDL
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 12, 2019 · Artificial Intelligence

How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs

Alibaba engineers detail the PAISoar distributed training framework, showing how RDMA‑optimized hardware, Ring AllReduce algorithms, and user‑friendly APIs boost deep‑learning models—like the GreenNet CNN—to 101‑fold speedups on 128 GPUs, dramatically reducing training time from days to under a day.

AI InfrastructureGPU AccelerationPAISoar
0 likes · 17 min read
How Alibaba’s PAISoar Accelerates Deep Learning: 101× Speedup on 128 GPUs
Didi Tech
Didi Tech
Apr 4, 2019 · Artificial Intelligence

DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture

Since 2016 DiDi has evolved its machine‑learning platform from isolated, workshop‑style GPU servers to a cloud‑native, Kubernetes‑driven architecture that unifies resource management, introduces custom parameter‑server and serving frameworks, provides autotuning, external SaaS offerings such as Elastic Inference and JianShu, and aims for a 3.0 unified internal‑external AI marketplace.

AI InfrastructureGPU computingPlatform Engineering
0 likes · 19 min read
DiDi Machine Learning Platform: From Workshop‑Style Production to Cloud‑Native Architecture
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 18, 2019 · Artificial Intelligence

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Euler, Alibaba's newly open‑sourced graph deep‑learning framework, combines distributed graph processing with neural network training to handle billions of nodes and edges, supports heterogeneous graphs, offers built‑in algorithms, and has already boosted advertising, fraud detection, and other industry applications.

AI InfrastructureDistributed ComputingEuler framework
0 likes · 11 min read
How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning
Meituan Technology Team
Meituan Technology Team
Oct 25, 2018 · Artificial Intelligence

Deep Learning System Design and Parallel Computing Solutions at Meituan

Meituan built a custom deep‑learning platform that combines data‑parallel and hybrid parallelism across multi‑GPU/cluster hardware, uses coarse‑grained scheduling and Kaldi‑derived acoustic algorithms, and supports fast NLU model hot‑updates, achieving near‑linear GPU scaling and 6–7× speedups over traditional solutions.

AI InfrastructureNLUacoustic modeling
0 likes · 13 min read
Deep Learning System Design and Parallel Computing Solutions at Meituan
Architecture Digest
Architecture Digest
Aug 15, 2017 · Artificial Intelligence

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

The article explains why AI engineers need foundational infrastructure knowledge—covering big‑data processing, cloud services, containerization, MapReduce, and deep‑learning platforms—to effectively solve real‑world problems, collaborate with teams, and build scalable, maintainable AI solutions.

AI InfrastructureBig DataCloud Computing
0 likes · 14 min read
Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning
21CTO
21CTO
Jul 16, 2017 · Artificial Intelligence

Why Every AI Engineer Must Master Infrastructure Basics

In the AI era, engineers need more than cutting‑edge algorithms—they must understand infrastructure, deployment, scalability, and team collaboration, as illustrated by four practical reasons and Google’s architectural breakthroughs that bridge big data, machine learning, and deep learning.

AI InfrastructureCloud ComputingGoogle
0 likes · 17 min read
Why Every AI Engineer Must Master Infrastructure Basics