Tagged articles

large-scale systems

59 articles · Page 1 of 1

Apr 22, 2026 · Backend Development

Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN

This article analyses how Tencent applied AI coding to its massive, high‑risk CDN LEGO backend, built a Rust‑based Nonstop proxy to probe AI limits, designed a five‑layer Harness Engineering framework with multi‑model adversarial review, identified concrete failure modes, and quantified efficiency gains while redefining developer roles.

AI codingAI safetyBackend Development

0 likes · 20 min read

Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN

Tencent Technical Engineering

Apr 21, 2026 · Backend Development

Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN LEGO Project

When AI coding hype focuses on front‑end page generation, the real challenge is whether AI can be trusted to write code for a million‑line, high‑availability CDN backend; this article details Tencent’s systematic exploration, a 20‑day Rust proxy prototype, a five‑layer Harness Engineering framework, and concrete data showing both breakthroughs and remaining risks.

AI codingBackend DevelopmentHarness Engineering

0 likes · 25 min read

Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN LEGO Project

ByteDance Data Platform

Feb 2, 2026 · Big Data

How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale

ByteDance’s StreamShield delivers a three‑layer resiliency framework—engine self‑healing, hybrid replication at the cluster level, and chaos‑tested releases—that enables over 70,000 concurrent Flink jobs on 11 million CPU cores to meet strict SLAs with second‑level startup and robust fault tolerance.

Apache FlinkByteDanceReal-Time Computing

0 likes · 6 min read

How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale

JD Tech

Jul 11, 2025 · Artificial Intelligence

How JD’s PODM‑MI Model Revolutionizes E‑Commerce Search Ranking

JD’s algorithm engineer recounts how his team transformed e‑commerce search by developing the PODM‑MI re‑ranking framework, uncovering a novel “hourglass” bottleneck in generative retrieval, and implementing lightweight solutions that boosted diversity, relevance, and order volume, culminating in a SIGIR publication.

Gaussian modelingRe‑rankinge-commerce

0 likes · 8 min read

How JD’s PODM‑MI Model Revolutionizes E‑Commerce Search Ranking

JD Retail Technology

Jul 11, 2025 · Artificial Intelligence

How JD’s PODM‑MI Model Boosted E‑commerce Search Diversity and Sales

JD’s algorithm engineer describes how a three‑layer PODM‑MI re‑ranking framework, combining Gaussian preference modeling, mutual‑information optimization, and utility‑matrix fusion, overcame the hourglass bottleneck in generative retrieval, dramatically improving search diversity, user experience, and generating over ten million additional orders.

AIRe‑rankinge-commerce

0 likes · 9 min read

How JD’s PODM‑MI Model Boosted E‑commerce Search Diversity and Sales

Alibaba Cloud Infrastructure

Jun 11, 2025 · Cloud Computing

How Alibaba’s Qi Tian Platform Secures Large-Scale Cloud Networks

This article examines Alibaba Cloud’s Qi Tian integrated operation‑management platform, detailing the challenges of massive cloud network management and the innovative data‑fusion, automated change, intent‑aware monitoring, and multi‑plane self‑healing technologies that enable secure, high‑performance operation at million‑device scale.

AICloud ComputingData Management

0 likes · 11 min read

How Alibaba’s Qi Tian Platform Secures Large-Scale Cloud Networks

SF Technology Team

May 26, 2025 · Frontend Development

How We Cut LCP by 73% for a Billion‑User Membership Site

Facing the challenges of a billion‑user membership platform, we analyzed front‑end performance metrics, applied resource slimming, lazy loading, network optimizations, and SSR/pre‑rendering techniques, achieving up to 73% LCP reduction and dramatically improving page load speed and user retention.

SSRfrontend performancelarge-scale systems

0 likes · 15 min read

How We Cut LCP by 73% for a Billion‑User Membership Site

macrozheng

Dec 28, 2024 · Operations

What Makes China’s 12306 Railway Ticketing System So Resilient?

The article examines China’s 12306 railway ticketing platform, tracing its evolution from early Unix‑based reservation software to a massive, real‑time, three‑tier distributed system that handles billions of requests during peak travel periods, highlighting its architectural challenges, high‑concurrency solutions, and unique national centralization.

ChinaHigh concurrencydistributed systems

0 likes · 9 min read

What Makes China’s 12306 Railway Ticketing System So Resilient?

Alibaba Cloud Developer

Nov 23, 2024 · Cloud Native

How Cloud‑Native Edge Collaboration Won Zhejiang’s Top Science Award

Alibaba Cloud and Zhejiang University’s cloud‑native edge‑computing platform, recognized with Zhejiang’s top science award, tackles massive data processing challenges by enabling efficient, real‑time cloud‑edge collaboration, supporting millions of edge nodes, dynamic workload scheduling, and delivering impactful applications across transportation, finance, healthcare, and major events.

cloud-edge collaborationcloud-nativeedge computing

0 likes · 5 min read

How Cloud‑Native Edge Collaboration Won Zhejiang’s Top Science Award

Xiaohongshu Tech REDtech

Sep 23, 2024 · Artificial Intelligence

AlignRec: A Joint Training Framework for Aligning Multimodal Representations with Personalized Recommendation

AlignRec is a joint‑training framework that synchronizes multimodal encoders with personalized recommendation models through a staged alignment strategy and three specialized loss functions, preserving both content and ID signals, and achieving state‑of‑the‑art performance on multiple datasets while releasing superior Amazon multimodal features.

AIEvaluation Metricsjoint training

0 likes · 11 min read

AlignRec: A Joint Training Framework for Aligning Multimodal Representations with Personalized Recommendation

DataFunSummit

Sep 18, 2024 · Artificial Intelligence

Multi‑Scenario Modeling for NetEase Cloud Music Recommendation: Architecture, Challenges, and Results

This article presents NetEase Cloud Music's multi‑scenario recommendation modeling work, covering background, overall system architecture, key modules such as unified and private domain networks, modeling objectives and difficulties, experimental results, future outlook, and a detailed Q&A session.

AINetEase Cloud Musiclarge-scale systems

0 likes · 13 min read

Multi‑Scenario Modeling for NetEase Cloud Music Recommendation: Architecture, Challenges, and Results

ITPUB

Jul 2, 2024 · Operations

Why Cost‑Cutting Undermines Tech Reliability: Lessons from Massive Internet Outages

The article examines how unrealistic cost‑reduction targets, ignored expert advice, and short‑term resource cuts have repeatedly caused large‑scale outages in major internet platforms, highlighting the labor‑, knowledge‑, and asset‑intensive nature of technical reliability and proposing sustained, expert‑led planning as a remedy.

IT Managementlarge-scale systemssystem reliability

0 likes · 11 min read

Why Cost‑Cutting Undermines Tech Reliability: Lessons from Massive Internet Outages

DaTaobao Tech

Jun 12, 2024 · Backend Development

Refactoring Large-Scale Video Streaming Engineering: Theory and Practice

The article presents a comprehensive guide to large‑scale video‑streaming system refactoring, combining theory on continuous improvement, architectural evolution, code‑quality criteria, and challenges with a practical roadmap that leverages automation, systematic analysis, engineering safeguards, static‑analysis tools, and design patterns to safely transform legacy monoliths into modular, containerized platforms.

Component Architecturecode qualityengineering practices

0 likes · 16 min read

Refactoring Large-Scale Video Streaming Engineering: Theory and Practice

DataFunSummit

Feb 11, 2024 · Artificial Intelligence

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

GPUOptimizationTraining

0 likes · 18 min read

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

Meituan Technology Team

Oct 12, 2023 · Operations

Pattern-Based Reliability Governance for Billion-Scale Traffic Systems

The article analyzes reliability governance challenges in Meituan's billion‑traffic systems, introduces pattern mining as a way to uncover common reliability issues, and presents three concrete case studies—idempotency, dependency, and over‑privilege governance—demonstrating how large‑scale traffic data and environment isolation enable low‑cost, automated reliability solutions.

Access ControlReliability Engineeringdependency governance

0 likes · 19 min read

Pattern-Based Reliability Governance for Billion-Scale Traffic Systems

Continuous Delivery 2.0

Sep 13, 2023 · Fundamentals

Overview of Google’s Software Engineering Practices

Google’s software engineering practices—including a unified source repository, Blaze build system, rigorous code review, automated testing, continuous integration, and structured project and personnel management—are detailed, offering insights and comparisons for other organizations seeking to adopt similar high‑scale development methodologies.

Continuous IntegrationGooglebuild systems

0 likes · 46 min read

Overview of Google’s Software Engineering Practices

JD Retail Technology

Aug 5, 2023 · Operations

JDV Visual Big‑Screen Platform: Architecture, Challenges, and Technical Innovations for JD.com’s 618 Promotion

The article details JDV, JD.com’s internal visual‑big‑screen data platform, describing its architecture, the demanding real‑time, cross‑midnight, and high‑stability requirements during the 618 promotion, the technical challenges faced, and the innovative solutions—including request state control, heartbeat monitoring, video recording, orchestration tools, precise stop handling, and proxy data sources—that ensured reliable large‑scale screen deployment.

Data Visualizationbackend-architecturelarge-scale systems

0 likes · 17 min read

JDV Visual Big‑Screen Platform: Architecture, Challenges, and Technical Innovations for JD.com’s 618 Promotion

Xiaohongshu Tech REDtech

Mar 21, 2023 · Artificial Intelligence

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Xiaohongshu transformed its recommendation pipeline from daily to minute‑level updates by redesigning recall, ranking and feature‑joining components, deploying a base‑plus‑incremental training scheme, migrating Spark to Flink, rewriting services in C++, and optimizing RocksDB, which yielded over 10% longer dwell time, 15% more interactions and roughly 50% higher new‑note efficiency.

Real-time Traininglarge-scale systemsmodel serving

0 likes · 20 min read

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Alimama Tech

Aug 10, 2022 · Artificial Intelligence

Overview of Alibaba Mama’s Recent Papers on Online Advertising and Recommendation Systems

Alibaba Mama’s technical team presented ten CIKM‑2022 papers that introduce novel advertising and recommendation methods—including adaptive domain networks, neural‑metric ANN search, control‑based livestream bidding, graph‑based relevance learning, hierarchical ad exposure, knowledge‑extraction pretraining, traffic forecasting, overfitting analysis, adaptive sparsity, and visual debiasing—each deployed to boost revenue and performance on Alibaba’s platforms.

AIlarge-scale systemsrecommendation

0 likes · 15 min read

Overview of Alibaba Mama’s Recent Papers on Online Advertising and Recommendation Systems

DaTaobao Tech

Jul 18, 2022 · Artificial Intelligence

Walle: An End-to-End, General-Purpose, Large-Scale Device-Cloud Collaborative Machine Learning System

Walle is Alibaba’s first end‑to‑end, general‑purpose, large‑scale device‑cloud collaborative machine‑learning platform that manages billions of mobile devices, provides a full‑stack data and compute pipeline, cuts cloud load by 87 %, reduces latency to ~100 ms, and already powers over a trillion daily ML invocations across dozens of Alibaba apps.

MNNOSDIdevice-cloud collaboration

0 likes · 11 min read

Walle: An End-to-End, General-Purpose, Large-Scale Device-Cloud Collaborative Machine Learning System

Alimama Tech

Jun 1, 2022 · Artificial Intelligence

Advances in Alibaba's Advertising Engine: Serverless Architecture, Recall, Strategy, and Creative Technologies

Alibaba Mama’s advertising engine has been transformed into a serverless, cloud‑native platform that unifies runtime, data, and business abstractions, adopts vector‑ and model‑based recall with offline pre‑computed pipelines, implements multi‑stage AI‑driven bidding and auction mechanisms, and leverages large‑scale generative AI for creative assets, thereby accelerating feature rollout, cutting latency, and boosting merchant value.

AIServerlessStrategy

0 likes · 18 min read

Advances in Alibaba's Advertising Engine: Serverless Architecture, Recall, Strategy, and Creative Technologies

Zuoyebang Tech Team

Apr 7, 2022 · Cloud Native

How Fluid Transforms Large‑Scale Data Retrieval on Kubernetes

This article explains how Zuoyebang redesigned its massive data retrieval platform by separating compute and storage with the Fluid project on Kubernetes, achieving minute‑level hundred‑TB distribution, elastic caching, and improved stability for real‑time educational services.

Compute-Storage SeparationData RetrievalFluid

0 likes · 8 min read

How Fluid Transforms Large‑Scale Data Retrieval on Kubernetes

Meituan Technology Team

Feb 17, 2022 · Cloud Native

Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions

Meituan’s cloud‑native cluster scheduling system, built on a customized Kubernetes engine, unifies multi‑cluster management, improves CPU utilization, reduces costs, and enhances stability by balancing throughput, complexity, and reliability while addressing large‑scale deployment, fault‑tolerance, and dynamic resource allocation challenges.

Cloud NativeCluster SchedulingMeituan

0 likes · 21 min read

Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions

JD Retail Technology

Dec 20, 2021 · Artificial Intelligence

Large-Scale Graph Technology in JD.com E‑commerce: Practice and AI Computing Directions

The article summarizes JD.com Vice President Bao Yongjun's presentation on applying ultra‑large‑scale graph technology to e‑commerce, covering data foundations, recommendation and fraud detection use cases, technical challenges, the Galileo graph engine, and future AI computing development directions such as chips, auto‑learning, application layers, and privacy protection.

e-commercefraud detectiongraph computing

0 likes · 7 min read

Large-Scale Graph Technology in JD.com E‑commerce: Practice and AI Computing Directions

Top Architect

Dec 11, 2021 · Databases

Scaling Zhihu’s Moneta Service with TiDB: Architecture, Performance, and Lessons Learned

Zhihu’s Moneta service, handling over a trillion rows and billions of daily writes, migrated from MySQL to TiDB, achieving millisecond query latency, high availability, and horizontal scalability, and the article details the architecture, performance metrics, migration challenges, and lessons learned from this large‑scale deployment.

Data MigrationTiDBdatabase scalability

0 likes · 13 min read

Scaling Zhihu’s Moneta Service with TiDB: Architecture, Performance, and Lessons Learned

dbaplus Community

Nov 18, 2021 · Databases

When Should You Split Your Database Tables? Practical Guidelines and Real‑World Cases

This article examines the signs that a database table has reached its limits, explains why vertical and horizontal sharding are needed, offers concrete sizing formulas, compares hash, range and consistent‑hash partitioning, and shares large‑scale case studies from Suning, JD, Meituan, Ant Financial and Taobao.

Horizontal PartitioningPerformance Optimizationdistributed transactions

0 likes · 14 min read

When Should You Split Your Database Tables? Practical Guidelines and Real‑World Cases

Alibaba Cloud Developer

Aug 4, 2021 · Cloud Computing

How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters

At USENIX ATC2021, Alibaba Cloud’s Fuxi 2.0 team presented a best‑paper‑award research showing how a partitioned‑synchronization (ParSync) scheduling architecture dramatically reduces conflicts and latency in ultra‑large production clusters, balancing efficiency, quality, and fairness without adding resources.

Cloud ComputingCluster SchedulingResource Management

0 likes · 17 min read

How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters

Baidu Intelligent Testing

Aug 3, 2021 · Operations

Stability Governance and Observability in Baidu Search: From Kepler 1.0 to Kepler 2.0

This article examines how Baidu Search achieves five‑nine‑plus availability by analyzing stability challenges, introducing the Kepler 1.0 observability stack, evolving to Kepler 2.0 with full‑trace collection, custom compression, and practical use‑cases that dramatically improve fault diagnosis and capacity management in a massive micro‑service environment.

MetricsStabilitybackend

0 likes · 18 min read

Stability Governance and Observability in Baidu Search: From Kepler 1.0 to Kepler 2.0

Baidu Geek Talk

Jun 30, 2021 · Operations

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

This article dissects Baidu Search's ultra‑large micro‑service architecture, detailing the challenges of maintaining five‑nine‑plus availability, the diverse failure modes, and the step‑by‑step evolution of its observability stack—from early log‑only analysis to the kepler1.0/kepler2.0 tracing, full‑log indexing, custom span‑id generation, and compression techniques that together enable rapid root‑cause diagnosis at massive scale.

Baidu SearchDistributed TracingMetrics

0 likes · 21 min read

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

FunTester

Jun 14, 2021 · Industry Insights

How Leading Tech Companies Design Scalable Quality Assurance Systems

The article reviews four in‑depth talks from MTSC2021 Shanghai, detailing how ZTO, Meituan, ByteDance and Kujiale build large‑scale testing frameworks, event‑tracking QA, advertising system reliability, and multi‑dimensional online inspection to ensure product quality across complex business scenarios.

industry practiceslarge-scale systemsonline stability

0 likes · 9 min read

How Leading Tech Companies Design Scalable Quality Assurance Systems

IT Architects Alliance

Jun 8, 2021 · Industry Insights

Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

This article dissects Toutiao’s rapid growth from a small startup to a platform with over 5 billion registered users, detailing its data collection pipeline, user‑modeling techniques, recommendation engine, micro‑service architecture, PaaS infrastructure, storage strategies, and push‑notification system.

Recommendation EngineToutiaodata pipeline

0 likes · 9 min read

Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

IT Architects Alliance

Jun 7, 2021 · Industry Insights

How WeChat Scales: Agile Practices and Architecture Behind Billions of Users

The article analyzes WeChat's success by detailing its three‑pronged strategy of precise product timing, agile project management, and robust technical support, and explains how the team applies agile attitudes, modular design, extensible protocols, disaster‑recovery mechanisms, and fine‑grained monitoring to operate a massive, highly available system.

MonitoringScalable ArchitectureWeChat

0 likes · 18 min read

How WeChat Scales: Agile Practices and Architecture Behind Billions of Users

58 Tech

Apr 12, 2021 · Artificial Intelligence

Deep Interest Modeling and Multi‑Channel Recommendation for 58.com Home Page

This article presents the challenges of large‑scale home‑page recommendation at 58.com, describes how behavior‑sequence models such as DIN, DIEN and Transformer are applied and evolved into double‑channel and multi‑channel deep interest architectures, and details offline and online performance optimizations that yielded significant gains in click‑through and conversion rates.

AIlarge-scale systemsrecommendation

0 likes · 19 min read

Deep Interest Modeling and Multi‑Channel Recommendation for 58.com Home Page

ITFLY8 Architecture Home

Feb 10, 2021 · Backend Development

How WeChat Scales: Inside Its Agile, Massive‑Scale Architecture

This article reveals the three‑in‑one strategy, agile mindset, modular design, extensibility, gray‑release process, and monitoring techniques that enable WeChat to handle billions of users with high availability and rapid feature delivery.

MonitoringScalable ArchitectureWeChat

0 likes · 18 min read

How WeChat Scales: Inside Its Agile, Massive‑Scale Architecture

Efficient Ops

Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Anomaly DetectionMonitoringOperations

0 likes · 4 min read

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

Continuous Delivery 2.0

Apr 13, 2020 · Operations

Facebook Configuration Management: Practices, Statistics, and Cultural Insights

This article summarizes Facebook's holistic configuration management practices, presenting cultural influences, storage growth, size distribution, update frequency, change magnitude, and author collaboration statistics, while linking to a series of translated articles that explore tools such as Configerator, GateKeeper, and MobileConfig.

Operationsconfiguration managementlarge-scale systems

0 likes · 10 min read

Facebook Configuration Management: Practices, Statistics, and Cultural Insights

Efficient Ops

Mar 25, 2020 · Operations

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

This article details JD Logistics' journey to design and implement a massive, AI‑enhanced monitoring platform that handles over three million metrics across hundreds of warehouses, addressing challenges of scale, network complexity, frequent asset changes, and integrating AIOps for proactive fault detection and resolution.

AIOpsAnomaly DetectionCMDB

0 likes · 23 min read

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

Meituan Technology Team

Dec 12, 2019 · Cloud Native

How Meituan Scaled Service Governance with OCTO Mesh: Architecture & Lessons

Meituan’s OCTO Mesh transforms its massive service governance by adopting a Service Mesh architecture with sidecar proxies, a custom control plane, and meta‑server driven routing, addressing multi‑language support, middleware coupling, heterogenous integration, and scalability challenges while detailing design choices, health‑check strategies, and operational tooling.

Cloud NativeControl PlaneService Governance

0 likes · 20 min read

How Meituan Scaled Service Governance with OCTO Mesh: Architecture & Lessons

dbaplus Community

Oct 29, 2019 · Cloud Native

How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0

Meituan‑Dianping describes its evolution from a custom Docker‑based scheduler (HULK1.0) to an open‑source Kubernetes‑based platform (HULK2.0), detailing architecture, resource‑management strategies, scheduler optimizations, Kubelet enhancements, and online‑cluster tuning that together enable stable, cost‑effective operation of a 100k+ node fleet.

Cloud NativeScheduler Optimizationcluster management

0 likes · 19 min read

How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0

21CTO

Jun 3, 2019 · Backend Development

How Didi Engineered a Scalable Large‑Scale Microservice Framework with Go

In this detailed talk, Didi senior engineer Du Huan explains the challenges of building large microservice frameworks, outlines design principles such as the Rule of Least Power, describes the evolution of service frameworks, and shares concrete implementation techniques and business benefits of Didi's Go‑based platform.

MicroservicesReliabilityService Architecture

0 likes · 29 min read

How Didi Engineered a Scalable Large‑Scale Microservice Framework with Go

Architecture Digest

May 27, 2019 · Backend Development

Design Practices for Large-Scale Microservice Frameworks

The article presents a comprehensive overview of the challenges, evolution, design principles, and concrete implementation techniques behind building a large‑scale microservice framework at Didi, illustrating how systematic abstraction, reliable I/O handling, and strict interface stability can dramatically improve development efficiency and system robustness.

GoService Architectureframework design

0 likes · 28 min read

Design Practices for Large-Scale Microservice Frameworks

Didi Tech

May 23, 2019 · Cloud Native

Design Practices for Large‑Scale Microservice Frameworks

In his Go China talk, senior Didi engineer Du Huan outlined the design and implementation of a large‑scale microservice framework that abstracts I/O, injects tracing via protocol hijacking, optimizes timers, and enforces fail‑fast circuit breaking, delivering faster development, higher stability, seamless upgrades, and a unified operating‑system‑like layer for thousands of services.

GoReliabilityService Architecture

0 likes · 29 min read

21CTO

Mar 1, 2019 · Operations

Inside Baidu’s Epic Spring Festival Red‑Envelope Operation: How 100 000 Servers Powered a Nation‑Wide Live Event

This article recounts how Baidu’s engineering, operations, and cloud teams orchestrated a massive, month‑long effort—designing a task force, procuring tens of thousands of servers, optimizing app traffic, and executing a flawless red‑envelope rollout during the 2019 Chinese New Year gala watched by over a billion people.

BaiduRed EnvelopeSpring Festival Gala

0 likes · 29 min read

Inside Baidu’s Epic Spring Festival Red‑Envelope Operation: How 100 000 Servers Powered a Nation‑Wide Live Event

DataFunTalk

Jan 8, 2019 · Artificial Intelligence

Yoo Video Bottom‑Page Recommendation System: From Zero to One Practice

This article details the end‑to‑end design, recall and ranking techniques, engineering implementation, and future research directions of Tencent's Yoo video bottom‑page recommendation system, illustrating how large‑scale video recommendation is built from business needs to deep learning models.

Embeddinglarge-scale systemsmachine learning

0 likes · 13 min read

Yoo Video Bottom‑Page Recommendation System: From Zero to One Practice

Java Backend Technology

Oct 19, 2018 · Operations

How to Ensure Stability for Billion-Request Websites: Proven Strategies

Ensuring stability for sites handling up to 100,000 requests per minute requires a combination of configuration management, feature toggles, phased deployment, robust error handling, comprehensive logging, real-time monitoring, traffic-aware throttling, service degradation, and disaster-recovery tactics, all of which are detailed in this guide.

Stabilitydeploymentlarge-scale systems

0 likes · 9 min read

How to Ensure Stability for Billion-Request Websites: Proven Strategies

Efficient Ops

Aug 16, 2018 · Operations

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

This article introduces three Tencent TEG sessions that reveal the automated operation systems behind massive storage and CDN services, billion‑level promotional event guarantees, and intelligent DCI network management, highlighting the challenges, solutions, and speaker expertise.

AutomationCDNNetwork Management

0 likes · 7 min read

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

ITPUB

May 30, 2018 · Backend Development

How JD.com Engineered Its Own Distributed Storage System for Billions of Files

This article chronicles JD.com's journey from recognizing massive storage demands to designing, building, and evolving a self‑developed distributed storage platform—JFS—that handles small and large files, powers a custom image system, object storage, and future container‑native workloads.

Backend EngineeringDistributed storageJFS

0 likes · 16 min read

How JD.com Engineered Its Own Distributed Storage System for Billions of Files

MaGe Linux Operations

Apr 18, 2018 · Operations

Essential Skills and Challenges for Large‑Scale Website Operations Engineers

This article outlines what large‑scale website operations entail, describes the full product lifecycle involvement of ops engineers, lists the technical skills and personal qualities required, examines current industry issues, and highlights key technologies such as cluster management, monitoring, fault handling, and automation.

large-scale systemssite reliability

0 likes · 19 min read

Essential Skills and Challenges for Large‑Scale Website Operations Engineers

Alibaba Cloud Developer

Aug 17, 2017 · Artificial Intelligence

From Alibaba’s Ad Algorithms to Deep Learning: A Senior AI Engineer’s Journey

Senior Alibaba algorithm expert Jing Shi shares his six‑year journey from joining the company, the challenges of large‑scale ad click‑through‑rate modeling, the evolution from linear logistic regression to deep learning, and practical advice for aspiring AI engineers and interview candidates.

AIAdvertisingCareer Advice

0 likes · 29 min read

From Alibaba’s Ad Algorithms to Deep Learning: A Senior AI Engineer’s Journey

21CTO

Aug 9, 2017 · Artificial Intelligence

How Jeff Dean Builds Intelligent Systems with Large‑Scale Deep Learning

Jeff Dean, Google Senior Fellow and head of Google Brain, presents a comprehensive overview of constructing intelligent systems using large‑scale deep learning, covering architectural strategies, scaling techniques, key challenges, and real‑world applications, with insights drawn from his seminal research and industry experience.

Google BrainJeff Deanlarge-scale systems

0 likes · 2 min read

How Jeff Dean Builds Intelligent Systems with Large‑Scale Deep Learning

Alibaba Cloud Developer

Jul 4, 2017 · Cloud Computing

Inside Alibaba Cloud’s Apsara: How Massive Scale and Open‑Source Drive Innovation

Alibaba Cloud’s chief architect Tang Hong recounts the company’s evolution from its 2009 launch, detailing the Apsara operating system’s milestones, massive scaling achievements, virtualization and container innovations, and future directions in lightweight virtualization, high‑speed hardware, and heterogeneous security, illustrating how open‑source collaboration fuels its growth.

Alibaba CloudApsaraContainer Technology

0 likes · 16 min read

Inside Alibaba Cloud’s Apsara: How Massive Scale and Open‑Source Drive Innovation

Architecture Digest

Jun 21, 2017 · Backend Development

Optimizing Billion‑Scale Video Playback: Architecture, Bandwidth, Startup, Buffering, and Success‑Rate Improvements

The talk details Tencent's QQ Space video team’s technical practices for scaling daily video playback from 50 million to over a billion views, covering rapid deployment, bandwidth control, H.265 adoption, startup latency reduction, buffering mitigation, and comprehensive success‑rate monitoring across iOS and Android platforms.

Bandwidth ControlH.265large-scale systems

0 likes · 19 min read

Optimizing Billion‑Scale Video Playback: Architecture, Bandwidth, Startup, Buffering, and Success‑Rate Improvements

21CTO

Feb 26, 2017 · Operations

How YouTube Handles 500M Daily Video Plays: Inside Its Scalable Architecture

This article dissects YouTube's massive infrastructure, detailing the basic platform, web and video services, thumbnail handling, database evolution, CDN usage, and data‑center strategies that enable over half a billion daily video clicks with a surprisingly small engineering team.

CDNYouTubedatabase

0 likes · 12 min read

How YouTube Handles 500M Daily Video Plays: Inside Its Scalable Architecture

dbaplus Community

Feb 9, 2017 · Operations

Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights

This article shares JD’s large‑scale monitoring system (MDC) design, covering its three‑tier architecture, agent‑based data collection, performance optimizations for SNMP/IPMI, low‑overhead deployment, high‑availability strategies, and practical lessons on scaling monitoring across thousands of physical machines and containers.

JDMDCMonitoring

0 likes · 10 min read

Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights

Efficient Ops

Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems

0 likes · 25 min read

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

ITFLY8 Architecture Home

Jan 18, 2017 · Backend Development

Inside the Architecture of the World’s Biggest Websites: From Wikipedia to Youku

This article surveys the technical architectures of major web platforms—including Wikipedia, Facebook, Yahoo Mail, Twitter, Google App Engine, Amazon, and Youku—highlighting their load‑balancing, caching, database, and scaling strategies to reveal how they handle massive traffic and data volumes.

CachingDatabasesbackend

0 likes · 10 min read

Inside the Architecture of the World’s Biggest Websites: From Wikipedia to Youku

Meituan Technology Team

Dec 27, 2016 · Backend Development

Ensuring Data Consistency in Meituan Hotel Direct Connection Platform

To keep its rapidly expanding hotel‑direct platform consistent despite unstable supplier interfaces, Meituan evolved from full‑batch pulls to segmented fetching, predictive trigger‑based updates, and finally supplier‑initiated pushes, creating a hybrid pull‑push architecture that ensures low‑latency, reliable product and inventory data.

Backend DevelopmentData ConsistencyMySQL replication

0 likes · 18 min read

Ensuring Data Consistency in Meituan Hotel Direct Connection Platform

Efficient Ops

Aug 28, 2016 · Operations

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Tencent's SNG team shares six practical capacity‑management techniques—performance, density, feature, fragmentation, barrel, and hardware selection methods—that helped reduce operational expenses by over a hundred million yuan annually while supporting hundreds of millions of daily active users.

Operationscapacity managementcloud infrastructure

0 likes · 10 min read

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Efficient Ops

Aug 25, 2016 · Operations

How Tencent Scales Ops Automation for Hundreds of Thousands of Servers

This article explains how Tencent transformed massive operational pressure from billions of users and half‑million servers into an automated, standardized workflow by defining clear goals, building a layered CMDB, integrating Dev and Ops, and implementing a six‑step deployment pipeline that balances efficiency with safety.

CMDBOperations AutomationTencent

0 likes · 21 min read

How Tencent Scales Ops Automation for Hundreds of Thousands of Servers