Tagged articles
59 articles
Page 1 of 1
Tencent Architect
Tencent Architect
Apr 22, 2026 · Backend Development

Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN

This article analyses how Tencent applied AI coding to its massive, high‑risk CDN LEGO backend, built a Rust‑based Nonstop proxy to probe AI limits, designed a five‑layer Harness Engineering framework with multi‑model adversarial review, identified concrete failure modes, and quantified efficiency gains while redefining developer roles.

AI CodingAI SafetyBackend Development
0 likes · 20 min read
Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN
Tencent Technical Engineering
Tencent Technical Engineering
Apr 21, 2026 · Backend Development

Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN LEGO Project

When AI coding hype focuses on front‑end page generation, the real challenge is whether AI can be trusted to write code for a million‑line, high‑availability CDN backend; this article details Tencent’s systematic exploration, a 20‑day Rust proxy prototype, a five‑layer Harness Engineering framework, and concrete data showing both breakthroughs and remaining risks.

AI CodingBackend DevelopmentHarness Engineering
0 likes · 25 min read
Can AI Safely Write Code for High‑Risk Backend Systems? Lessons from Tencent’s CDN LEGO Project
ByteDance Data Platform
ByteDance Data Platform
Feb 2, 2026 · Big Data

How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale

ByteDance’s StreamShield delivers a three‑layer resiliency framework—engine self‑healing, hybrid replication at the cluster level, and chaos‑tested releases—that enables over 70,000 concurrent Flink jobs on 11 million CPU cores to meet strict SLAs with second‑level startup and robust fault tolerance.

Apache FlinkByteDanceReal‑Time Computing
0 likes · 6 min read
How StreamShield Powers Production‑Grade Resilience for Apache Flink at Massive Scale
JD Tech
JD Tech
Jul 11, 2025 · Artificial Intelligence

How JD’s PODM‑MI Model Revolutionizes E‑Commerce Search Ranking

JD’s algorithm engineer recounts how his team transformed e‑commerce search by developing the PODM‑MI re‑ranking framework, uncovering a novel “hourglass” bottleneck in generative retrieval, and implementing lightweight solutions that boosted diversity, relevance, and order volume, culminating in a SIGIR publication.

Gaussian modelinge‑commercelarge-scale systems
0 likes · 8 min read
How JD’s PODM‑MI Model Revolutionizes E‑Commerce Search Ranking
JD Retail Technology
JD Retail Technology
Jul 11, 2025 · Artificial Intelligence

How JD’s PODM‑MI Model Boosted E‑commerce Search Diversity and Sales

JD’s algorithm engineer describes how a three‑layer PODM‑MI re‑ranking framework, combining Gaussian preference modeling, mutual‑information optimization, and utility‑matrix fusion, overcame the hourglass bottleneck in generative retrieval, dramatically improving search diversity, user experience, and generating over ten million additional orders.

AIe‑commercelarge-scale systems
0 likes · 9 min read
How JD’s PODM‑MI Model Boosted E‑commerce Search Diversity and Sales
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Jun 11, 2025 · Cloud Computing

How Alibaba’s Qi Tian Platform Secures Large-Scale Cloud Networks

This article examines Alibaba Cloud’s Qi Tian integrated operation‑management platform, detailing the challenges of massive cloud network management and the innovative data‑fusion, automated change, intent‑aware monitoring, and multi‑plane self‑healing technologies that enable secure, high‑performance operation at million‑device scale.

AIData Managementcloud computing
0 likes · 11 min read
How Alibaba’s Qi Tian Platform Secures Large-Scale Cloud Networks
SF Technology Team
SF Technology Team
May 26, 2025 · Frontend Development

How We Cut LCP by 73% for a Billion‑User Membership Site

Facing the challenges of a billion‑user membership platform, we analyzed front‑end performance metrics, applied resource slimming, lazy loading, network optimizations, and SSR/pre‑rendering techniques, achieving up to 73% LCP reduction and dramatically improving page load speed and user retention.

Resource OptimizationSSRfrontend performance
0 likes · 15 min read
How We Cut LCP by 73% for a Billion‑User Membership Site
macrozheng
macrozheng
Dec 28, 2024 · Operations

What Makes China’s 12306 Railway Ticketing System So Resilient?

The article examines China’s 12306 railway ticketing platform, tracing its evolution from early Unix‑based reservation software to a massive, real‑time, three‑tier distributed system that handles billions of requests during peak travel periods, highlighting its architectural challenges, high‑concurrency solutions, and unique national centralization.

ChinaDistributed Systemshigh concurrency
0 likes · 9 min read
What Makes China’s 12306 Railway Ticketing System So Resilient?
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 23, 2024 · Cloud Native

How Cloud‑Native Edge Collaboration Won Zhejiang’s Top Science Award

Alibaba Cloud and Zhejiang University’s cloud‑native edge‑computing platform, recognized with Zhejiang’s top science award, tackles massive data processing challenges by enabling efficient, real‑time cloud‑edge collaboration, supporting millions of edge nodes, dynamic workload scheduling, and delivering impactful applications across transportation, finance, healthcare, and major events.

Edge Computingcloud-edge collaborationcloud-native
0 likes · 5 min read
How Cloud‑Native Edge Collaboration Won Zhejiang’s Top Science Award
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 23, 2024 · Artificial Intelligence

AlignRec: A Joint Training Framework for Aligning Multimodal Representations with Personalized Recommendation

AlignRec is a joint‑training framework that synchronizes multimodal encoders with personalized recommendation models through a staged alignment strategy and three specialized loss functions, preserving both content and ID signals, and achieving state‑of‑the‑art performance on multiple datasets while releasing superior Amazon multimodal features.

AIEvaluation Metricsjoint training
0 likes · 11 min read
AlignRec: A Joint Training Framework for Aligning Multimodal Representations with Personalized Recommendation
DataFunSummit
DataFunSummit
Sep 18, 2024 · Artificial Intelligence

Multi‑Scenario Modeling for NetEase Cloud Music Recommendation: Architecture, Challenges, and Results

This article presents NetEase Cloud Music's multi‑scenario recommendation modeling work, covering background, overall system architecture, key modules such as unified and private domain networks, modeling objectives and difficulties, experimental results, future outlook, and a detailed Q&A session.

AINetEase Cloud Musiclarge-scale systems
0 likes · 13 min read
Multi‑Scenario Modeling for NetEase Cloud Music Recommendation: Architecture, Challenges, and Results
ITPUB
ITPUB
Jul 2, 2024 · Operations

Why Cost‑Cutting Undermines Tech Reliability: Lessons from Massive Internet Outages

The article examines how unrealistic cost‑reduction targets, ignored expert advice, and short‑term resource cuts have repeatedly caused large‑scale outages in major internet platforms, highlighting the labor‑, knowledge‑, and asset‑intensive nature of technical reliability and proposing sustained, expert‑led planning as a remedy.

IT Managementlarge-scale systemssystem reliability
0 likes · 11 min read
Why Cost‑Cutting Undermines Tech Reliability: Lessons from Massive Internet Outages
DaTaobao Tech
DaTaobao Tech
Jun 12, 2024 · Backend Development

Refactoring Large-Scale Video Streaming Engineering: Theory and Practice

The article presents a comprehensive guide to large‑scale video‑streaming system refactoring, combining theory on continuous improvement, architectural evolution, code‑quality criteria, and challenges with a practical roadmap that leverages automation, systematic analysis, engineering safeguards, static‑analysis tools, and design patterns to safely transform legacy monoliths into modular, containerized platforms.

Component ArchitectureSoftware Architecturecode quality
0 likes · 16 min read
Refactoring Large-Scale Video Streaming Engineering: Theory and Practice
DataFunSummit
DataFunSummit
Feb 11, 2024 · Artificial Intelligence

GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu

This article details Xiaohongshu's end‑to‑end GPU‑based transformation of its recommendation and search models, covering background, model characteristics, training and inference frameworks, system‑level and GPU‑level optimizations, compilation tricks, hardware upgrades, and future directions for large‑scale machine‑learning infrastructure.

GPUModel ServingTraining
0 likes · 18 min read
GPU-Accelerated Model Service and Optimization Practices at Xiaohongshu
Meituan Technology Team
Meituan Technology Team
Oct 12, 2023 · Operations

Pattern-Based Reliability Governance for Billion-Scale Traffic Systems

The article analyzes reliability governance challenges in Meituan's billion‑traffic systems, introduces pattern mining as a way to uncover common reliability issues, and presents three concrete case studies—idempotency, dependency, and over‑privilege governance—demonstrating how large‑scale traffic data and environment isolation enable low‑cost, automated reliability solutions.

Idempotencyaccess controldependency governance
0 likes · 19 min read
Pattern-Based Reliability Governance for Billion-Scale Traffic Systems
Continuous Delivery 2.0
Continuous Delivery 2.0
Sep 13, 2023 · Fundamentals

Overview of Google’s Software Engineering Practices

Google’s software engineering practices—including a unified source repository, Blaze build system, rigorous code review, automated testing, continuous integration, and structured project and personnel management—are detailed, offering insights and comparisons for other organizations seeking to adopt similar high‑scale development methodologies.

Googlebuild systemscontinuous integration
0 likes · 46 min read
Overview of Google’s Software Engineering Practices
JD Retail Technology
JD Retail Technology
Aug 5, 2023 · Operations

JDV Visual Big‑Screen Platform: Architecture, Challenges, and Technical Innovations for JD.com’s 618 Promotion

The article details JDV, JD.com’s internal visual‑big‑screen data platform, describing its architecture, the demanding real‑time, cross‑midnight, and high‑stability requirements during the 618 promotion, the technical challenges faced, and the innovative solutions—including request state control, heartbeat monitoring, video recording, orchestration tools, precise stop handling, and proxy data sources—that ensured reliable large‑scale screen deployment.

Backend ArchitectureData visualizationlarge-scale systems
0 likes · 17 min read
JDV Visual Big‑Screen Platform: Architecture, Challenges, and Technical Innovations for JD.com’s 618 Promotion
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Mar 21, 2023 · Artificial Intelligence

From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu

Xiaohongshu transformed its recommendation pipeline from daily to minute‑level updates by redesigning recall, ranking and feature‑joining components, deploying a base‑plus‑incremental training scheme, migrating Spark to Flink, rewriting services in C++, and optimizing RocksDB, which yielded over 10% longer dwell time, 15% more interactions and roughly 50% higher new‑note efficiency.

Model ServingReal-time Traininglarge-scale systems
0 likes · 20 min read
From Daily to Minute-Level Updates: Real-Time Recommendation System Enhancements at Xiaohongshu
Alimama Tech
Alimama Tech
Aug 10, 2022 · Artificial Intelligence

Overview of Alibaba Mama’s Recent Papers on Online Advertising and Recommendation Systems

Alibaba Mama’s technical team presented ten CIKM‑2022 papers that introduce novel advertising and recommendation methods—including adaptive domain networks, neural‑metric ANN search, control‑based livestream bidding, graph‑based relevance learning, hierarchical ad exposure, knowledge‑extraction pretraining, traffic forecasting, overfitting analysis, adaptive sparsity, and visual debiasing—each deployed to boost revenue and performance on Alibaba’s platforms.

AIlarge-scale systemsrecommendation
0 likes · 15 min read
Overview of Alibaba Mama’s Recent Papers on Online Advertising and Recommendation Systems
DaTaobao Tech
DaTaobao Tech
Jul 18, 2022 · Artificial Intelligence

Walle: An End-to-End, General-Purpose, Large-Scale Device-Cloud Collaborative Machine Learning System

Walle is Alibaba’s first end‑to‑end, general‑purpose, large‑scale device‑cloud collaborative machine‑learning platform that manages billions of mobile devices, provides a full‑stack data and compute pipeline, cuts cloud load by 87 %, reduces latency to ~100 ms, and already powers over a trillion daily ML invocations across dozens of Alibaba apps.

MNNOSDIdevice-cloud collaboration
0 likes · 11 min read
Walle: An End-to-End, General-Purpose, Large-Scale Device-Cloud Collaborative Machine Learning System
Alimama Tech
Alimama Tech
Jun 1, 2022 · Artificial Intelligence

Advances in Alibaba's Advertising Engine: Serverless Architecture, Recall, Strategy, and Creative Technologies

Alibaba Mama’s advertising engine has been transformed into a serverless, cloud‑native platform that unifies runtime, data, and business abstractions, adopts vector‑ and model‑based recall with offline pre‑computed pipelines, implements multi‑stage AI‑driven bidding and auction mechanisms, and leverages large‑scale generative AI for creative assets, thereby accelerating feature rollout, cutting latency, and boosting merchant value.

AIServerlesscloud-native
0 likes · 18 min read
Advances in Alibaba's Advertising Engine: Serverless Architecture, Recall, Strategy, and Creative Technologies
Zuoyebang Tech Team
Zuoyebang Tech Team
Apr 7, 2022 · Cloud Native

How Fluid Transforms Large‑Scale Data Retrieval on Kubernetes

This article explains how Zuoyebang redesigned its massive data retrieval platform by separating compute and storage with the Fluid project on Kubernetes, achieving minute‑level hundred‑TB distribution, elastic caching, and improved stability for real‑time educational services.

Compute-Storage SeparationData RetrievalFluid
0 likes · 8 min read
How Fluid Transforms Large‑Scale Data Retrieval on Kubernetes
Meituan Technology Team
Meituan Technology Team
Feb 17, 2022 · Cloud Native

Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions

Meituan’s cloud‑native cluster scheduling system, built on a customized Kubernetes engine, unifies multi‑cluster management, improves CPU utilization, reduces costs, and enhances stability by balancing throughput, complexity, and reliability while addressing large‑scale deployment, fault‑tolerance, and dynamic resource allocation challenges.

Cloud NativeCluster SchedulingKubernetes
0 likes · 21 min read
Meituan's Cloud‑Native Cluster Scheduling System: Design, Challenges, and Future Directions
JD Retail Technology
JD Retail Technology
Dec 20, 2021 · Artificial Intelligence

Large-Scale Graph Technology in JD.com E‑commerce: Practice and AI Computing Directions

The article summarizes JD.com Vice President Bao Yongjun's presentation on applying ultra‑large‑scale graph technology to e‑commerce, covering data foundations, recommendation and fraud detection use cases, technical challenges, the Galileo graph engine, and future AI computing development directions such as chips, auto‑learning, application layers, and privacy protection.

e‑commercefraud detectiongraph computing
0 likes · 7 min read
Large-Scale Graph Technology in JD.com E‑commerce: Practice and AI Computing Directions
Top Architect
Top Architect
Dec 11, 2021 · Databases

Scaling Zhihu’s Moneta Service with TiDB: Architecture, Performance, and Lessons Learned

Zhihu’s Moneta service, handling over a trillion rows and billions of daily writes, migrated from MySQL to TiDB, achieving millisecond query latency, high availability, and horizontal scalability, and the article details the architecture, performance metrics, migration challenges, and lessons learned from this large‑scale deployment.

Data MigrationTiDBdatabase scalability
0 likes · 13 min read
Scaling Zhihu’s Moneta Service with TiDB: Architecture, Performance, and Lessons Learned
dbaplus Community
dbaplus Community
Nov 18, 2021 · Databases

When Should You Split Your Database Tables? Practical Guidelines and Real‑World Cases

This article examines the signs that a database table has reached its limits, explains why vertical and horizontal sharding are needed, offers concrete sizing formulas, compares hash, range and consistent‑hash partitioning, and shares large‑scale case studies from Suning, JD, Meituan, Ant Financial and Taobao.

Distributed TransactionsPerformance Optimizationhorizontal partitioning
0 likes · 14 min read
When Should You Split Your Database Tables? Practical Guidelines and Real‑World Cases
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 4, 2021 · Cloud Computing

How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters

At USENIX ATC2021, Alibaba Cloud’s Fuxi 2.0 team presented a best‑paper‑award research showing how a partitioned‑synchronization (ParSync) scheduling architecture dramatically reduces conflicts and latency in ultra‑large production clusters, balancing efficiency, quality, and fairness without adding resources.

Cluster SchedulingResource Managementcloud computing
0 likes · 17 min read
How Partitioned Synchronization Scales Alibaba’s Massive Cloud Clusters
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 3, 2021 · Operations

Stability Governance and Observability in Baidu Search: From Kepler 1.0 to Kepler 2.0

This article examines how Baidu Search achieves five‑nine‑plus availability by analyzing stability challenges, introducing the Kepler 1.0 observability stack, evolving to Kepler 2.0 with full‑trace collection, custom compression, and practical use‑cases that dramatically improve fault diagnosis and capacity management in a massive micro‑service environment.

Backendlarge-scale systemsmetrics
0 likes · 18 min read
Stability Governance and Observability in Baidu Search: From Kepler 1.0 to Kepler 2.0
Baidu Geek Talk
Baidu Geek Talk
Jun 30, 2021 · Operations

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

This article dissects Baidu Search's ultra‑large micro‑service architecture, detailing the challenges of maintaining five‑nine‑plus availability, the diverse failure modes, and the step‑by‑step evolution of its observability stack—from early log‑only analysis to the kepler1.0/kepler2.0 tracing, full‑log indexing, custom span‑id generation, and compression techniques that together enable rapid root‑cause diagnosis at massive scale.

Baidu SearchDistributed TracingObservability
0 likes · 21 min read
How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability
FunTester
FunTester
Jun 14, 2021 · Industry Insights

How Leading Tech Companies Design Scalable Quality Assurance Systems

The article reviews four in‑depth talks from MTSC2021 Shanghai, detailing how ZTO, Meituan, ByteDance and Kujiale build large‑scale testing frameworks, event‑tracking QA, advertising system reliability, and multi‑dimensional online inspection to ensure product quality across complex business scenarios.

Performance TestingSoftware Testingindustry practices
0 likes · 9 min read
How Leading Tech Companies Design Scalable Quality Assurance Systems
IT Architects Alliance
IT Architects Alliance
Jun 8, 2021 · Industry Insights

Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

This article dissects Toutiao’s rapid growth from a small startup to a platform with over 5 billion registered users, detailing its data collection pipeline, user‑modeling techniques, recommendation engine, micro‑service architecture, PaaS infrastructure, storage strategies, and push‑notification system.

Recommendation EngineToutiaodata pipeline
0 likes · 9 min read
Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling
IT Architects Alliance
IT Architects Alliance
Jun 7, 2021 · Industry Insights

How WeChat Scales: Agile Practices and Architecture Behind Billions of Users

The article analyzes WeChat's success by detailing its three‑pronged strategy of precise product timing, agile project management, and robust technical support, and explains how the team applies agile attitudes, modular design, extensible protocols, disaster‑recovery mechanisms, and fine‑grained monitoring to operate a massive, highly available system.

Agile DevelopmentWeChatindustry insights
0 likes · 18 min read
How WeChat Scales: Agile Practices and Architecture Behind Billions of Users
58 Tech
58 Tech
Apr 12, 2021 · Artificial Intelligence

Deep Interest Modeling and Multi‑Channel Recommendation for 58.com Home Page

This article presents the challenges of large‑scale home‑page recommendation at 58.com, describes how behavior‑sequence models such as DIN, DIEN and Transformer are applied and evolved into double‑channel and multi‑channel deep interest architectures, and details offline and online performance optimizations that yielded significant gains in click‑through and conversion rates.

AISequence Modelinglarge-scale systems
0 likes · 19 min read
Deep Interest Modeling and Multi‑Channel Recommendation for 58.com Home Page
Efficient Ops
Efficient Ops
Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Operationsanomaly detectionlarge-scale systems
0 likes · 4 min read
How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops
Continuous Delivery 2.0
Continuous Delivery 2.0
Apr 13, 2020 · Operations

Facebook Configuration Management: Practices, Statistics, and Cultural Insights

This article summarizes Facebook's holistic configuration management practices, presenting cultural influences, storage growth, size distribution, update frequency, change magnitude, and author collaboration statistics, while linking to a series of translated articles that explore tools such as Configerator, GateKeeper, and MobileConfig.

Configuration ManagementOperationsTooling
0 likes · 10 min read
Facebook Configuration Management: Practices, Statistics, and Cultural Insights
Meituan Technology Team
Meituan Technology Team
Dec 12, 2019 · Cloud Native

How Meituan Scaled Service Governance with OCTO Mesh: Architecture & Lessons

Meituan’s OCTO Mesh transforms its massive service governance by adopting a Service Mesh architecture with sidecar proxies, a custom control plane, and meta‑server driven routing, addressing multi‑language support, middleware coupling, heterogenous integration, and scalability challenges while detailing design choices, health‑check strategies, and operational tooling.

Cloud NativeControl PlaneService Mesh
0 likes · 20 min read
How Meituan Scaled Service Governance with OCTO Mesh: Architecture & Lessons
dbaplus Community
dbaplus Community
Oct 29, 2019 · Cloud Native

How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0

Meituan‑Dianping describes its evolution from a custom Docker‑based scheduler (HULK1.0) to an open‑source Kubernetes‑based platform (HULK2.0), detailing architecture, resource‑management strategies, scheduler optimizations, Kubelet enhancements, and online‑cluster tuning that together enable stable, cost‑effective operation of a 100k+ node fleet.

Cloud NativeCluster ManagementKubernetes
0 likes · 19 min read
How Meituan‑Dianping Scaled Kubernetes to 100k+ Nodes with HULK2.0
21CTO
21CTO
Jun 3, 2019 · Backend Development

How Didi Engineered a Scalable Large‑Scale Microservice Framework with Go

In this detailed talk, Didi senior engineer Du Huan explains the challenges of building large microservice frameworks, outlines design principles such as the Rule of Least Power, describes the evolution of service frameworks, and shares concrete implementation techniques and business benefits of Didi's Go‑based platform.

MicroservicesReliabilityService Architecture
0 likes · 29 min read
How Didi Engineered a Scalable Large‑Scale Microservice Framework with Go
Architecture Digest
Architecture Digest
May 27, 2019 · Backend Development

Design Practices for Large-Scale Microservice Frameworks

The article presents a comprehensive overview of the challenges, evolution, design principles, and concrete implementation techniques behind building a large‑scale microservice framework at Didi, illustrating how systematic abstraction, reliable I/O handling, and strict interface stability can dramatically improve development efficiency and system robustness.

GoService Architectureframework design
0 likes · 28 min read
Design Practices for Large-Scale Microservice Frameworks
Didi Tech
Didi Tech
May 23, 2019 · Cloud Native

Design Practices for Large‑Scale Microservice Frameworks

In his Go China talk, senior Didi engineer Du Huan outlined the design and implementation of a large‑scale microservice framework that abstracts I/O, injects tracing via protocol hijacking, optimizes timers, and enforces fail‑fast circuit breaking, delivering faster development, higher stability, seamless upgrades, and a unified operating‑system‑like layer for thousands of services.

GoReliabilityService Architecture
0 likes · 29 min read
Design Practices for Large‑Scale Microservice Frameworks
21CTO
21CTO
Mar 1, 2019 · Operations

Inside Baidu’s Epic Spring Festival Red‑Envelope Operation: How 100 000 Servers Powered a Nation‑Wide Live Event

This article recounts how Baidu’s engineering, operations, and cloud teams orchestrated a massive, month‑long effort—designing a task force, procuring tens of thousands of servers, optimizing app traffic, and executing a flawless red‑envelope rollout during the 2019 Chinese New Year gala watched by over a billion people.

BaiduRed EnvelopeSpring Festival Gala
0 likes · 29 min read
Inside Baidu’s Epic Spring Festival Red‑Envelope Operation: How 100 000 Servers Powered a Nation‑Wide Live Event
DataFunTalk
DataFunTalk
Jan 8, 2019 · Artificial Intelligence

Yoo Video Bottom‑Page Recommendation System: From Zero to One Practice

This article details the end‑to‑end design, recall and ranking techniques, engineering implementation, and future research directions of Tencent's Yoo video bottom‑page recommendation system, illustrating how large‑scale video recommendation is built from business needs to deep learning models.

Embeddinglarge-scale systemsmachine learning
0 likes · 13 min read
Yoo Video Bottom‑Page Recommendation System: From Zero to One Practice
Java Backend Technology
Java Backend Technology
Oct 19, 2018 · Operations

How to Ensure Stability for Billion-Request Websites: Proven Strategies

Ensuring stability for sites handling up to 100,000 requests per minute requires a combination of configuration management, feature toggles, phased deployment, robust error handling, comprehensive logging, real-time monitoring, traffic-aware throttling, service degradation, and disaster-recovery tactics, all of which are detailed in this guide.

Deploymentlarge-scale systemsrate limiting
0 likes · 9 min read
How to Ensure Stability for Billion-Request Websites: Proven Strategies
Efficient Ops
Efficient Ops
Aug 16, 2018 · Operations

How Tencent Automates Massive Storage, CDN, and Network Operations at Scale

This article introduces three Tencent TEG sessions that reveal the automated operation systems behind massive storage and CDN services, billion‑level promotional event guarantees, and intelligent DCI network management, highlighting the challenges, solutions, and speaker expertise.

AutomationCDNcloud operations
0 likes · 7 min read
How Tencent Automates Massive Storage, CDN, and Network Operations at Scale
ITPUB
ITPUB
May 30, 2018 · Backend Development

How JD.com Engineered Its Own Distributed Storage System for Billions of Files

This article chronicles JD.com's journey from recognizing massive storage demands to designing, building, and evolving a self‑developed distributed storage platform—JFS—that handles small and large files, powers a custom image system, object storage, and future container‑native workloads.

Backend EngineeringJFSdistributed storage
0 likes · 16 min read
How JD.com Engineered Its Own Distributed Storage System for Billions of Files
MaGe Linux Operations
MaGe Linux Operations
Apr 18, 2018 · Operations

Essential Skills and Challenges for Large‑Scale Website Operations Engineers

This article outlines what large‑scale website operations entail, describes the full product lifecycle involvement of ops engineers, lists the technical skills and personal qualities required, examines current industry issues, and highlights key technologies such as cluster management, monitoring, fault handling, and automation.

large-scale systemssite reliability
0 likes · 19 min read
Essential Skills and Challenges for Large‑Scale Website Operations Engineers
21CTO
21CTO
Aug 9, 2017 · Artificial Intelligence

How Jeff Dean Builds Intelligent Systems with Large‑Scale Deep Learning

Jeff Dean, Google Senior Fellow and head of Google Brain, presents a comprehensive overview of constructing intelligent systems using large‑scale deep learning, covering architectural strategies, scaling techniques, key challenges, and real‑world applications, with insights drawn from his seminal research and industry experience.

Google BrainJeff DeanNeural Networks
0 likes · 2 min read
How Jeff Dean Builds Intelligent Systems with Large‑Scale Deep Learning
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 4, 2017 · Cloud Computing

Inside Alibaba Cloud’s Apsara: How Massive Scale and Open‑Source Drive Innovation

Alibaba Cloud’s chief architect Tang Hong recounts the company’s evolution from its 2009 launch, detailing the Apsara operating system’s milestones, massive scaling achievements, virtualization and container innovations, and future directions in lightweight virtualization, high‑speed hardware, and heterogeneous security, illustrating how open‑source collaboration fuels its growth.

Alibaba CloudApsaraContainer Technology
0 likes · 16 min read
Inside Alibaba Cloud’s Apsara: How Massive Scale and Open‑Source Drive Innovation
Architecture Digest
Architecture Digest
Jun 21, 2017 · Backend Development

Optimizing Billion‑Scale Video Playback: Architecture, Bandwidth, Startup, Buffering, and Success‑Rate Improvements

The talk details Tencent's QQ Space video team’s technical practices for scaling daily video playback from 50 million to over a billion views, covering rapid deployment, bandwidth control, H.265 adoption, startup latency reduction, buffering mitigation, and comprehensive success‑rate monitoring across iOS and Android platforms.

Bandwidth ControlH.265Video Streaming
0 likes · 19 min read
Optimizing Billion‑Scale Video Playback: Architecture, Bandwidth, Startup, Buffering, and Success‑Rate Improvements
21CTO
21CTO
Feb 26, 2017 · Operations

How YouTube Handles 500M Daily Video Plays: Inside Its Scalable Architecture

This article dissects YouTube's massive infrastructure, detailing the basic platform, web and video services, thumbnail handling, database evolution, CDN usage, and data‑center strategies that enable over half a billion daily video clicks with a surprisingly small engineering team.

CDNYouTubedatabase
0 likes · 12 min read
How YouTube Handles 500M Daily Video Plays: Inside Its Scalable Architecture
dbaplus Community
dbaplus Community
Feb 9, 2017 · Operations

Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights

This article shares JD’s large‑scale monitoring system (MDC) design, covering its three‑tier architecture, agent‑based data collection, performance optimizations for SNMP/IPMI, low‑overhead deployment, high‑availability strategies, and practical lessons on scaling monitoring across thousands of physical machines and containers.

JDPerformance Optimizationcontainer monitoring
0 likes · 10 min read
Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights
Efficient Ops
Efficient Ops
Feb 6, 2017 · Operations

Building Billion‑Scale Web Systems That Auto‑Extinguish Failures

The article shares Tencent’s practical fault‑tolerance journey for a billion‑scale activity platform, covering retry strategies, automatic removal of faulty nodes, timeout tuning, business‑level safeguards, service degradation, and decoupling techniques that together reduce manual firefighting and improve system resilience.

Operationsfault tolerancelarge-scale systems
0 likes · 25 min read
Building Billion‑Scale Web Systems That Auto‑Extinguish Failures
Meituan Technology Team
Meituan Technology Team
Dec 27, 2016 · Backend Development

Ensuring Data Consistency in Meituan Hotel Direct Connection Platform

To keep its rapidly expanding hotel‑direct platform consistent despite unstable supplier interfaces, Meituan evolved from full‑batch pulls to segmented fetching, predictive trigger‑based updates, and finally supplier‑initiated pushes, creating a hybrid pull‑push architecture that ensures low‑latency, reliable product and inventory data.

Backend DevelopmentData ConsistencyMySQL replication
0 likes · 18 min read
Ensuring Data Consistency in Meituan Hotel Direct Connection Platform
Efficient Ops
Efficient Ops
Aug 28, 2016 · Operations

Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks

Tencent's SNG team shares six practical capacity‑management techniques—performance, density, feature, fragmentation, barrel, and hardware selection methods—that helped reduce operational expenses by over a hundred million yuan annually while supporting hundreds of millions of daily active users.

Cost OptimizationOperationscapacity management
0 likes · 10 min read
Six Proven Methods to Optimize Server Capacity and Cut Costs in Large‑Scale Social Networks
Efficient Ops
Efficient Ops
Aug 25, 2016 · Operations

How Tencent Scales Ops Automation for Hundreds of Thousands of Servers

This article explains how Tencent transformed massive operational pressure from billions of users and half‑million servers into an automated, standardized workflow by defining clear goals, building a layered CMDB, integrating Dev and Ops, and implementing a six‑step deployment pipeline that balances efficiency with safety.

CMDBDevOpsInfrastructure
0 likes · 21 min read
How Tencent Scales Ops Automation for Hundreds of Thousands of Servers