Tagged articles
59 articles
Page 1 of 1
Raymond Ops
Raymond Ops
Jan 17, 2026 · Operations

Scaling Ansible: From Manual Deployments to Managing Thousands of Servers

This article walks through the challenges of manual server deployment, explains why Ansible is ideal for large‑scale environments, and provides a complete reference architecture, optimized configuration, dynamic inventory scripts, modular playbooks, performance tuning, monitoring, security hardening, rollback mechanisms, cost analysis, and practical lessons learned for automating deployments across thousands of machines.

AnsibleAutomationDeployment
0 likes · 20 min read
Scaling Ansible: From Manual Deployments to Managing Thousands of Servers
MaGe Linux Operations
MaGe Linux Operations
Aug 25, 2025 · Operations

How Ansible Turns Manual Deployments into 10x Faster Automation for 1000+ Servers

This article walks through the author's real‑world experience automating deployments across a thousand‑plus server cluster with Ansible, covering tool selection, architecture design, performance tuning, security practices, rollback mechanisms, cost‑benefit analysis, and common pitfalls, demonstrating how automation can boost efficiency tenfold.

Ansiblelarge scaleperformance
0 likes · 18 min read
How Ansible Turns Manual Deployments into 10x Faster Automation for 1000+ Servers
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 7, 2025 · Cloud Native

How GitOps Powers Cloud‑Native Large‑Scale Cluster Management

This article details Alibaba Cloud's intelligent operations team’s challenges and solutions for managing thousands of cloud‑native clusters, covering their multi‑layered operation architecture, GitOps workflow, infrastructure‑as‑code integration, and the role of AI‑driven intelligent operations in large‑scale environments.

GitOpsKubernetescloud-native
0 likes · 23 min read
How GitOps Powers Cloud‑Native Large‑Scale Cluster Management
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
May 22, 2025 · Artificial Intelligence

Scalable Overload-Aware Graph-Based Index Construction for 10‑Billion‑Scale Vector Similarity Search (SOGAIC)

The paper introduces SOGAIC, a scalable overload‑aware graph‑based index construction system for billion‑scale vector similarity search that uses adaptive overlapping partitioning and load‑balanced distributed scheduling to cut construction time by 47.3% while maintaining high recall.

ANNDistributed Schedulinggraph index
0 likes · 13 min read
Scalable Overload-Aware Graph-Based Index Construction for 10‑Billion‑Scale Vector Similarity Search (SOGAIC)
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 22, 2024 · Cloud Native

Large‑Scale Cloud‑Edge Collaborative Technology Based on Cloud‑Native Wins Zhejiang Province Science and Technology Progress Award

Alibaba Cloud, together with Zhejiang University, Alipay and Xieyun Technology, received the Zhejiang Province Science and Technology Progress First Prize for their cloud‑native large‑scale cloud‑edge collaborative platform, which addresses edge resource constraints, real‑time computing, and massive node management, and has been widely applied across multiple industries.

CNCFContainerReal‑Time Computing
0 likes · 5 min read
Large‑Scale Cloud‑Edge Collaborative Technology Based on Cloud‑Native Wins Zhejiang Province Science and Technology Progress Award
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Sep 9, 2024 · Cloud Native

Optimizing Network and Storage for 5,000‑Node Kubernetes Clusters

This guide outlines practical strategies for designing and optimizing network and storage in Kubernetes clusters of over 5,000 nodes, covering overlay networks, IP pool segmentation, bandwidth allocation, load balancing, security policies, distributed storage options, performance tuning, and reliable backup solutions.

Cloud NativeIP PoolKubernetes
0 likes · 9 min read
Optimizing Network and Storage for 5,000‑Node Kubernetes Clusters
Baidu Geek Talk
Baidu Geek Talk
Nov 8, 2023 · Databases

BES Engineering Practices for Large‑Scale Vector Database Scenarios

At QCon 2023, Baidu’s BES team detailed how their cloud‑native Elasticsearch service has been engineered for large‑scale vector search, describing architecture, C++ plugin integration, memory‑saving storage tricks, HNSW/IVF optimizations, filter strategies, and real‑world multimodal video and LLM knowledge‑base deployments.

AIBESElasticsearch
0 likes · 16 min read
BES Engineering Practices for Large‑Scale Vector Database Scenarios
DataFunSummit
DataFunSummit
Nov 2, 2023 · Databases

Understanding TiKV: Features, Architecture, and Large‑Scale Operational Challenges

This article introduces the distributed transactional KV store TiKV, explains its role as TiDB’s storage engine, details its multi‑layered architecture and Raft‑based consistency model, and discusses the performance and resource challenges encountered at massive data scales along with the engineering solutions implemented to address them.

Performance OptimizationRaftTiKV
0 likes · 14 min read
Understanding TiKV: Features, Architecture, and Large‑Scale Operational Challenges
AntTech
AntTech
Oct 30, 2023 · Artificial Intelligence

AntM2C: A Large-Scale Multi‑Scenario Multi‑Modal CTR Prediction Dataset from Alipay

AntM2C is a publicly released, billion‑sample click‑through‑rate (CTR) dataset covering five distinct Alipay business scenarios, providing both ID and rich multi‑modal (text and image) features to enable comprehensive evaluation of multi‑scenario, cold‑start, and multi‑modal CTR models at industrial scale.

CTRlarge scalemulti-modal
0 likes · 14 min read
AntM2C: A Large-Scale Multi‑Scenario Multi‑Modal CTR Prediction Dataset from Alipay
JD Tech
JD Tech
Oct 10, 2023 · Operations

Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion

This article details how JDV, JD.com’s internal visual dashboard platform, tackled the massive data‑intensive 618 promotion by implementing real‑time updates, cross‑midnight count stops, request‑state control, heartbeat monitoring, proxy data sources, and a suite of developer tools to ensure stability, performance, and rapid feature delivery.

Data PlatformReal-Timelarge scale
0 likes · 18 min read
Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion
DataFunTalk
DataFunTalk
Sep 22, 2023 · Big Data

Design and Practice of Baidu's Tape Library Storage Architecture Based on the Aries Cloud Storage System

This article presents a comprehensive overview of Baidu Intelligent Cloud's tape‑library solution, detailing tape and tape‑library fundamentals, the Aries cloud storage stack, data and access models, the end‑to‑end data flow, key architectural design choices, implementation details, and a real‑world case study demonstrating large‑scale cold‑data storage, backup, and retrieval performance.

ariescold datadata archiving
0 likes · 28 min read
Design and Practice of Baidu's Tape Library Storage Architecture Based on the Aries Cloud Storage System
DataFunSummit
DataFunSummit
Jun 21, 2023 · Databases

Forum on Building Ultra‑Scale Storage Systems: Insights from Baidu, Meituan, Ant Group, Xiaomi and Baidu Cloud

The forum gathers senior experts from Baidu, Meituan, Ant Group, Xiaomi and Baidu Cloud to share practical experiences and future trends on constructing ultra‑large‑scale file, block, KV and NoSQL storage systems, focusing on low‑cost, high‑performance solutions and architectural challenges.

Distributed SystemsKV storageblock storage
0 likes · 8 min read
Forum on Building Ultra‑Scale Storage Systems: Insights from Baidu, Meituan, Ant Group, Xiaomi and Baidu Cloud
Tencent Cloud Developer
Tencent Cloud Developer
May 10, 2023 · Cloud Native

Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions

In October 2022 Tencent finished migrating its flagship services—including QQ, WeChat, and Honor of Kings—to a cloud‑native architecture spanning over 50 million CPU cores, overcoming millisecond‑level upgrade, stateful in‑place refresh, massive cross‑region scaling, and heterogeneous hardware by deploying the TKEx platform’s sidecar upgrades, three‑container patterns, Global Scaler Operator, machine‑type abstraction, and Clusternet‑based application‑centric orchestration, boosting CPU utilization to 65 % and establishing China’s largest cloud‑native practice.

Container UpgradeTencentcloud-native
0 likes · 19 min read
Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions
Continuous Delivery 2.0
Continuous Delivery 2.0
May 8, 2023 · Operations

Google’s Monolithic Code Repository: Scale, Architecture, and Practices

Google’s monolithic repository, managed by the proprietary Piper system and accessed via the cloud‑based CitC client, stores over a billion files and billions of lines of code, supports tens of thousands of engineers, and relies on trunk‑based development, extensive tooling, and strict security to enable large‑scale, efficient software development.

DevOpsGoogleMonorepo
0 likes · 17 min read
Google’s Monolithic Code Repository: Scale, Architecture, and Practices
Alimama Tech
Alimama Tech
Feb 8, 2023 · Artificial Intelligence

Evolution of Recall Indexes in Alibaba Advertising: From Quantization to Graph-based HNSW

Alibaba’s advertising pipeline progressed from low‑dimensional quantization partitions to hierarchical tree indexes, then to graph‑based HNSW structures—including multi‑category, multi‑level graphs and a BlazeOp‑driven scoring service—dramatically boosting recall efficiency, scalability and maintainability while meeting strict latency constraints.

HNSWlarge scalerecall
0 likes · 13 min read
Evolution of Recall Indexes in Alibaba Advertising: From Quantization to Graph-based HNSW
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Dec 12, 2022 · Cloud Native

How Karmada Powers Multi‑Cloud, Multi‑Cluster Production at Cloud Native Days China 2022

The Karmada community's Cloud Native Days China 2022 session in Nanjing gathered over 30 enterprises and developers to share multi‑cloud, multi‑cluster production practices, large‑scale testing results, and real‑world implementations from Huawei Cloud, vivo, Hurricane Engine, China Mobile, DaoCloud, and Zhejiang University, highlighting Karmada's scalability and ecosystem growth.

KarmadaKubernetesMulti-Cluster
0 likes · 9 min read
How Karmada Powers Multi‑Cloud, Multi‑Cluster Production at Cloud Native Days China 2022
Java High-Performance Architecture
Java High-Performance Architecture
Oct 11, 2022 · Operations

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

This article details Meituan's real‑world challenges with a 15,000‑node Kafka deployment and explains the application‑layer and system‑layer optimizations—such as disk balancing, migration pipeline acceleration, fetcher isolation, RAID acceleration, cgroup isolation, and an SSD‑based cache—that together dramatically cut read/write latency and simplify large‑scale cluster management.

Cluster ManagementMeituanStreaming
0 likes · 23 min read
How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters
ITPUB
ITPUB
Sep 24, 2022 · Operations

How Alibaba Cloud Log Service Scales Billion‑Task Scheduling: Design and Practice

This article explains how Alibaba Cloud Log Service implements a billion‑scale task scheduling framework for its observability platform, covering background, task types, design goals, architecture, key design points, and practical examples such as aggregation jobs and various scheduling scenarios.

Log Servicelarge scalemaster-worker
0 likes · 22 min read
How Alibaba Cloud Log Service Scales Billion‑Task Scheduling: Design and Practice
WeChat Backend Team
WeChat Backend Team
Aug 5, 2022 · Artificial Intelligence

How WeChat’s Ekko Achieves Ultra‑Low‑Latency Model Updates for Billion‑User Recommendations

At the 16th OSDI conference, Tencent’s WeChat team presented the award‑winning Ekko system—a groundbreaking, ultra‑low‑latency model‑update solution for massive recommendation workloads that dramatically speeds up updates, supports over a trillion‑scale models, and has already boosted user engagement across billions of daily users.

Low latencyModel UpdateRecommendation Systems
0 likes · 5 min read
How WeChat’s Ekko Achieves Ultra‑Low‑Latency Model Updates for Billion‑User Recommendations
DataFunSummit
DataFunSummit
Jul 26, 2022 · Artificial Intelligence

Multi-step Reasoning over Large-scale Knowledge Graphs: Query2Box and SMORE Framework

This talk presents recent advances in multi-step reasoning over large-scale, noisy knowledge graphs, introducing the Query2Box model that uses box embeddings for complex queries and the SMORE framework that enables efficient multi-hop inference on massive graphs through scalable query generation, embedding computation, and training pipelines.

AIKnowledge GraphQuery2Box
0 likes · 14 min read
Multi-step Reasoning over Large-scale Knowledge Graphs: Query2Box and SMORE Framework
Volcano Engine Developer Services
Volcano Engine Developer Services
Jul 12, 2022 · Cloud Native

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

ByteDance’s cloud‑native ecosystem combines a multi‑layered architecture, dynamic resource over‑provisioning control, hybrid online‑offline scheduling, and federated cluster management to boost container utilization from 23% to 63%, reduce costs by 40%, and support massive events like the 2021 Spring Festival Gala.

Cloud Nativehybrid deploymentlarge scale
0 likes · 16 min read
How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling
Efficient Ops
Efficient Ops
Jun 23, 2022 · Cloud Native

How Vivo Scales Kubernetes: Automated Multi‑Cluster Management with a Custom Operator

Vivo’s rapid migration to Kubernetes across multiple data centers required a secure, efficient, and reliable way to manage thousands of nodes, leading them to develop a custom k8s‑operator that streamlines cluster deployment, CI testing, declarative APIs, and automated repair for large‑scale cloud‑native environments.

Cloud NativeCluster AutomationDevOps
0 likes · 3 min read
How Vivo Scales Kubernetes: Automated Multi‑Cluster Management with a Custom Operator
Alipay Experience Technology
Alipay Experience Technology
Jun 14, 2022 · Frontend Development

How Ant Group’s Serverless Front‑End Platform Boosts Large‑Scale Development

This talk explains Ant Group’s serverless‑based front‑end platform, detailing the current front‑end architecture, challenges of large‑scale financial services, and three core efficiency ideas—document‑as‑code, function‑level deployment, and integrated plugin capabilities—to streamline development and ensure safe production.

BFFServerlessdocument-as-code
0 likes · 9 min read
How Ant Group’s Serverless Front‑End Platform Boosts Large‑Scale Development
Laravel Tech Community
Laravel Tech Community
Jun 6, 2022 · Artificial Intelligence

What an Open‑Source Twitter Algorithm Would Look Like: Architecture, Data Model, and Engineering Challenges

This article examines the practical aspects of open‑sourcing Twitter’s recommendation algorithm, covering the platform’s data model, timeline views, ranking features, a TypeScript pseudocode illustration, and the major engineering challenges of scale, real‑time processing, reliability, and security.

Twitteralgorithmlarge scale
0 likes · 14 min read
What an Open‑Source Twitter Algorithm Would Look Like: Architecture, Data Model, and Engineering Challenges
Alimama Tech
Alimama Tech
Feb 9, 2022 · Artificial Intelligence

Online Allocation Strategies for Guaranteed Display Advertising: Modeling, Distributed Solving, and Adaptive Pacing

The paper presents a guarantee‑based, distributed allocation framework for Alibaba’s off‑site brand contract ads that extends the SHALE algorithm with effect‑driven objectives and explicit over‑allocation constraints, solves dual variables via coordinate descent, and employs adaptive probability‑based pacing to meet volume guarantees while significantly boosting average CTR.

Online Optimizationallocationguaranteed display
0 likes · 11 min read
Online Allocation Strategies for Guaranteed Display Advertising: Modeling, Distributed Solving, and Adaptive Pacing
DataFunTalk
DataFunTalk
Dec 29, 2021 · Artificial Intelligence

Entity Alignment in Product Knowledge Graphs: Techniques and Applications

This article presents a comprehensive overview of building and applying product knowledge graphs for e‑commerce, covering background, recent advances in graph neural network‑based entity alignment, online prediction pipelines, data construction, evaluation metrics, attribute extraction, and future research directions.

Graph Neural NetworkKnowledge Graphattribute extraction
0 likes · 23 min read
Entity Alignment in Product Knowledge Graphs: Techniques and Applications
DataFunTalk
DataFunTalk
Dec 13, 2021 · Artificial Intelligence

Dual Vector Foil (DVF): Decoupled Index and Model for Large‑Scale Retrieval

The article introduces the Dual Vector Foil (DVF) algorithm system, which decouples index construction from model training to enable lightweight, high‑precision large‑scale recall using arbitrary complex models, and details its two‑stage and one‑stage solutions, graph‑based retrieval implementation, performance optimizations, and experimental results.

Deep LearningRecommendation Systemsalgorithm
0 likes · 28 min read
Dual Vector Foil (DVF): Decoupled Index and Model for Large‑Scale Retrieval
DataFunTalk
DataFunTalk
Oct 9, 2021 · Databases

Building and Optimizing a Large‑Scale Graph Platform for Financial Risk Control at Du Xiaoman Financial

This article describes how Du Xiaoman Financial designed, built, and continuously optimized a massive graph platform—including data governance, graph learning, query performance, data import, and online deployment—to improve credit risk assessment using billions of nodes and edges, and shares practical lessons on graph databases, distributed training, and real‑time inference.

DGLJanusGraphfinancial analytics
0 likes · 19 min read
Building and Optimizing a Large‑Scale Graph Platform for Financial Risk Control at Du Xiaoman Financial
DataFunTalk
DataFunTalk
Jan 20, 2021 · Artificial Intelligence

Techniques for Reducing the Computational Complexity of Large-Scale Graph Neural Networks

This article presents an overview of graph neural networks, explains their computational framework, analyzes space and time complexities, and proposes ten practical strategies—including edge avoidance, dimensionality reduction, selective iteration, memory baking, distillation, partitioning, sparse computation, routing, and cross-sample feature sharing—to significantly lower the cost of large‑scale GNN processing.

Computational ComplexityDeep Learninglarge scale
0 likes · 14 min read
Techniques for Reducing the Computational Complexity of Large-Scale Graph Neural Networks
JD Cloud Developers
JD Cloud Developers
Dec 16, 2020 · Backend Development

How JD Logistics Scaled a Billion‑Level Async Messaging System for 11.11

JD Logistics architect Chen Haolong detailed the design, scalability strategies, and operational practices behind the billion‑level asynchronous messaging system that powered JD.com’s massive 11.11 shopping festival, revealing how the platform handled unprecedented traffic and ensured reliability.

JD LogisticsOperationsasync messaging
0 likes · 2 min read
How JD Logistics Scaled a Billion‑Level Async Messaging System for 11.11
Alibaba Cloud Native
Alibaba Cloud Native
Sep 24, 2020 · Cloud Native

Tackling Ultra‑Large‑Scale Service Mesh Deployment: Lessons from Alibaba

This article details Alibaba's practical experience deploying Service Mesh at massive scale, covering architectural evolution, key challenges, traffic interception, hot‑upgrade mechanisms, performance optimizations, and operational tooling that together enable reliable, low‑overhead service communication in a cloud‑native environment.

Cloud NativeEnvoyIstio
0 likes · 22 min read
Tackling Ultra‑Large‑Scale Service Mesh Deployment: Lessons from Alibaba
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Sep 15, 2020 · Cloud Native

Challenges and Solutions for Large-Scale Service Mesh Deployment at Alibaba

Alibaba’s large‑scale Service Mesh deployment faces challenges such as smooth technology evolution, business‑technical balance, technical debt, massive sidecar operations, and scaling, which it addresses through staged architecture evolution, traffic‑transparent interception, hot upgrades, and open‑source contributions to Istio and Envoy.

Cloud NativeEnvoyIstio
0 likes · 19 min read
Challenges and Solutions for Large-Scale Service Mesh Deployment at Alibaba
DataFunTalk
DataFunTalk
Aug 20, 2020 · Artificial Intelligence

Weibo Recommendation Algorithm Practice and Machine Learning Platform Evolution

This article shares Weibo’s experience in building and evolving its recommendation algorithms, covering the recommendation scenario, machine learning workflow, feature engineering, model upgrades, large‑scale challenges, deployment via the Weiflow platform, and the capabilities of its machine‑learning infrastructure.

Online LearningWeibofeature engineering
0 likes · 14 min read
Weibo Recommendation Algorithm Practice and Machine Learning Platform Evolution
MaGe Linux Operations
MaGe Linux Operations
Jul 28, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Didi, and 58.com deploy and evolve Elasticsearch clusters to handle massive order data, log analysis, real‑time monitoring, and security tasks, detailing architecture choices, shard strategies, multi‑cluster designs, and performance optimizations.

Big DataElasticsearchOrder Management
0 likes · 11 min read
How Leading Chinese Companies Scale Elasticsearch for Billions of Orders
DataFunTalk
DataFunTalk
Dec 4, 2019 · Artificial Intelligence

Joint Optimization of Tree‑based Index and Deep Model (JTM) for Large‑Scale Recommendation

This article presents JTM, a joint optimization framework that simultaneously learns a tree‑based index and a deep scoring model to overcome the limitations of traditional recommendation pipelines, demonstrating significant recall improvements on Amazon Books and Alibaba UserBehavior datasets through hierarchical user interest modeling and efficient tree learning.

Deep Learningjoint optimizationlarge scale
0 likes · 19 min read
Joint Optimization of Tree‑based Index and Deep Model (JTM) for Large‑Scale Recommendation
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 13, 2019 · Cloud Native

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.

Cloud NativeCluster ManagementKubernetes
0 likes · 15 min read
Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained
Alibaba Cloud Native
Alibaba Cloud Native
Oct 27, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Cluster ManagementKube-on-KubeKubernetes
0 likes · 15 min read
How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System
DataFunTalk
DataFunTalk
Oct 16, 2019 · Artificial Intelligence

Deep Learning Practices for Personalized Recommendation at Meitu: From Recall to Ranking

This article details Meitu's large‑scale personalized recommendation pipeline, describing the business scenario, challenges of massive data, latency and long‑tail distribution, and the application of deep learning techniques such as Item2vec, YouTubeNet, dual‑tower DNN, NFM, NFwFM and multi‑task learning to improve click‑through rate, conversion and user engagement.

Deep LearningRecommendation Systemslarge scale
0 likes · 20 min read
Deep Learning Practices for Personalized Recommendation at Meitu: From Recall to Ranking
DataFunTalk
DataFunTalk
Aug 16, 2019 · Artificial Intelligence

Tree‑based Deep Match (TDM): Design, Implementation, and Applications in Large‑Scale Retrieval

This article presents a comprehensive overview of the Tree‑based Deep Match (TDM) algorithm, describing the evolution of retrieval technology, the limitations of traditional Match‑Rank pipelines, the design of a one‑stage tree‑indexed deep matching model, its training methodology, performance gains on public datasets, and its deployment in Alibaba’s advertising and e‑commerce platforms.

Recommendation SystemsTDMlarge scale
0 likes · 23 min read
Tree‑based Deep Match (TDM): Design, Implementation, and Applications in Large‑Scale Retrieval
AntTech
AntTech
Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

Cluster ManagementKuberneteslarge scale
0 likes · 9 min read
Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System
Efficient Ops
Efficient Ops
Jun 11, 2019 · Operations

What Powers WeChat’s Billion‑User Scale? Inside Its DevOps Journey

WeChat, China’s top social app with over a billion users, has applied DevOps practices to dramatically improve development efficiency, code quality, and accelerate the feedback cycle from requirements to delivery, while confronting real‑world challenges in tooling, processes, reliability, and automation costs.

Continuous DeliveryOperationsWeChat
0 likes · 3 min read
What Powers WeChat’s Billion‑User Scale? Inside Its DevOps Journey
dbaplus Community
dbaplus Community
Aug 14, 2018 · Operations

How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators

Ant Financial tackles the challenge of managing dozens of Kubernetes clusters and over a hundred thousand worker nodes by employing a meta‑cluster with Kube‑on‑Kube and Node Operators, enabling automated lifecycle management, scaling, upgrades, and fault recovery for both master components and worker nodes.

AutomationCluster ManagementKubernetes
0 likes · 12 min read
How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators
Meitu Technology
Meitu Technology
Jul 12, 2018 · Artificial Intelligence

DeepHash: Large-Scale Multimedia Content Analysis and Retrieval for Short Video Platforms

DeepHash is Meitu’s large‑scale short‑video analysis and retrieval system that converts deep‑learned visual features into compact binary hash codes via a MobileNet‑based CNN and triplet‑loss training, enabling fast, robust similarity search across billions of videos with sub‑second latency and minimal storage.

feature extractionlarge scalemultimedia retrieval
0 likes · 15 min read
DeepHash: Large-Scale Multimedia Content Analysis and Retrieval for Short Video Platforms
MaGe Linux Operations
MaGe Linux Operations
Jun 29, 2018 · Operations

Essential Skills and Roadmap for Large‑Scale Website Operations Engineers

This comprehensive guide explains what large‑scale website operations entail, outlines the product lifecycle involvement of ops engineers, details the technical and personal skills required, and discusses current challenges, future prospects, and key technologies such as cluster management, monitoring, fault handling, and automation.

AutomationDevOpsInfrastructure
0 likes · 18 min read
Essential Skills and Roadmap for Large‑Scale Website Operations Engineers
58 Tech
58 Tech
Jun 1, 2018 · Backend Development

Design and Implementation of Real-Time Indexing in 58.com’s ESearch Search Engine

This article explains how 58.com’s in‑house C++ search kernel ESearch was architected to provide second‑level real‑time indexing, high‑concurrency low‑latency querying, flexible ranking models, and efficient storage structures for billions of daily queries across massive classified data.

BackendC++large scale
0 likes · 13 min read
Design and Implementation of Real-Time Indexing in 58.com’s ESearch Search Engine
Efficient Ops
Efficient Ops
Feb 5, 2018 · Operations

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

This article details the architecture and practical techniques behind WeChat's large‑scale monitoring system, covering lightweight data collection, classification of real‑time, non‑real‑time and user‑specific metrics, anomaly detection algorithms, automated configuration, and high‑performance storage solutions for billions of events per minute.

OperationsReal-Timedata collection
0 likes · 14 min read
How WeChat Scales Massive Real-Time Monitoring: Design & Practices
ITPUB
ITPUB
Nov 14, 2017 · Operations

How Alibaba’s Dragonfly P2P System Supercharges Large‑Scale File and Container Image Distribution

Alibaba’s Dragonfly (蜻蜓) is a self‑developed P2P file distribution platform that dramatically speeds up massive file and container image delivery, reduces bandwidth consumption, supports intelligent compression and flow control, and has become a core infrastructure component powering billions of transactions during major events like Double 11.

File DistributionInfrastructureP2P
0 likes · 20 min read
How Alibaba’s Dragonfly P2P System Supercharges Large‑Scale File and Container Image Distribution
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 14, 2017 · Operations

How Alibaba’s Dragonfly P2P System Supercharges File and Image Distribution

Alibaba’s Dragonfly (蜻蜓) leverages P2P networking, intelligent compression, and flow control to dramatically accelerate large‑scale file and container image distribution, reducing bandwidth usage by over 99%, achieving up to 57× speedup, and supporting tens of thousands of concurrent hosts during peak events like Double 11.

File DistributionP2Pcontainer images
0 likes · 19 min read
How Alibaba’s Dragonfly P2P System Supercharges File and Image Distribution
Efficient Ops
Efficient Ops
May 19, 2017 · Operations

Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent

This article shares insights from Tencent’s SNG operations leader on building effective continuous feedback loops for massive‑scale services, covering monitoring, alerting, operational metrics, multi‑dimensional analysis, and practical DevOps techniques to improve reliability, availability, and automated self‑healing.

Continuous FeedbackDevOpslarge scale
0 likes · 26 min read
Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent
Ctrip Technology
Ctrip Technology
Jan 5, 2017 · Artificial Intelligence

Design and Implementation of a Billion‑Scale Generalized Recommendation System at Tencent Cloud

This article explains how Tencent built a billion‑scale, generalized recommendation system by designing a reusable algorithm library, deploying a low‑latency, highly available real‑time streaming platform (R2), and offering a cloud‑based recommendation engine that simplifies integration for internet businesses.

AIReal‑Time Computingcloud computing
0 likes · 11 min read
Design and Implementation of a Billion‑Scale Generalized Recommendation System at Tencent Cloud
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Mar 20, 2015 · Backend Development

Design and Architecture of Facebook Haystack Image Storage System

The article analyzes Facebook's massive image storage challenges and explains the Haystack architecture, detailing its components—Directory, Store, and Cache—how it reduces I/O, manages metadata, and handles read/write operations at billions‑scale while also addressing CDN dependency and fault tolerance.

BackendFacebookHaystack
0 likes · 10 min read
Design and Architecture of Facebook Haystack Image Storage System