Tagged articles
4 articles
Page 1 of 1
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 9, 2025 · Artificial Intelligence

How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%

This article details Baidu Intelligent Cloud's journey in designing, constructing, and operating a 32,000‑card hybrid AI compute cluster, covering challenges in power, cooling, networking, multi‑cluster scheduling, and security, and explains how innovative hardware, software, and operational strategies achieved over 50% MFU improvement and industry‑first performance records.

AI InfrastructureGPU clustershybrid cloud
0 likes · 15 min read
How Baidu Built a 32,000‑Card AI Super‑Compute Cluster and Boosted Efficiency by 50%
HelloTech
HelloTech
Mar 30, 2023 · Operations

Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions

Hello’s technical‑risk team created a comprehensive emergency‑response system for large‑scale promotions—prioritizing core scenarios, running high‑frequency drills, modeling fault‑portraits, defining metric‑based triggers and clear rollback actions—which delivered zero incidents during the 930 Big Sale, achieved over 80 % core‑line coverage, and now aims to automate plan selection and execution.

case studyemergency planningincident response
0 likes · 16 min read
Emergency Response Planning and Practice at Hello (哈啰) for Large‑Scale Promotions
Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
Nov 28, 2022 · Backend Development

Re‑engineering a Scalable Service Health‑Check System for Cloud‑Native Ops

This article details the redesign of a service health‑check component, covering its original limitations, industry alternatives, the chosen centralized active checking approach, architectural modules, concurrency model, scaling mechanisms, gray‑release strategy, and performance optimizations for reliable distributed systems.

Backend Architecturego concurrencyoperational reliability
0 likes · 17 min read
Re‑engineering a Scalable Service Health‑Check System for Cloud‑Native Ops
Didi Tech
Didi Tech
Dec 26, 2018 · Industry Insights

How Didi Implements Full‑Chain Data Tiered Protection for Reliable Operations

Facing growing data‑driven pressures, Didi designed a full‑link data tiered protection framework that defines classification standards, integrates data levels across the entire pipeline, and applies concrete safeguards and tooling to improve resource allocation, backup reliability, and overall data reliability.

Big DataData GovernanceDidi
0 likes · 7 min read
How Didi Implements Full‑Chain Data Tiered Protection for Reliable Operations