Tagged articles

Large Scale

60 articles · Page 1 of 1

Jun 9, 2026 · Operations

Mastering Elasticsearch Shard Management: From Fundamentals to 100k‑Shard Scale

This article explains Elasticsearch shard fundamentals, primary and replica roles, allocation rules, recovery and rebalance mechanisms, tuning parameters, best‑practice sizing, and presents real‑world production cases—including a 100,000‑shard cluster—along with concrete API commands for effective shard operations.

Cluster OperationsElasticsearchLarge Scale

0 likes · 28 min read

Mastering Elasticsearch Shard Management: From Fundamentals to 100k‑Shard Scale

Raymond Ops

Jan 17, 2026 · Operations

Scaling Ansible: From Manual Deployments to Managing Thousands of Servers

This article walks through the challenges of manual server deployment, explains why Ansible is ideal for large‑scale environments, and provides a complete reference architecture, optimized configuration, dynamic inventory scripts, modular playbooks, performance tuning, monitoring, security hardening, rollback mechanisms, cost analysis, and practical lessons learned for automating deployments across thousands of machines.

AnsibleAutomationLarge Scale

0 likes · 20 min read

Scaling Ansible: From Manual Deployments to Managing Thousands of Servers

MaGe Linux Operations

Aug 25, 2025 · Operations

How Ansible Turns Manual Deployments into 10x Faster Automation for 1000+ Servers

This article walks through the author's real‑world experience automating deployments across a thousand‑plus server cluster with Ansible, covering tool selection, architecture design, performance tuning, security practices, rollback mechanisms, cost‑benefit analysis, and common pitfalls, demonstrating how automation can boost efficiency tenfold.

AnsibleLarge Scaleperformance

0 likes · 18 min read

How Ansible Turns Manual Deployments into 10x Faster Automation for 1000+ Servers

Alibaba Cloud Big Data AI Platform

Aug 7, 2025 · Cloud Native

How GitOps Powers Cloud‑Native Large‑Scale Cluster Management

This article details Alibaba Cloud's intelligent operations team’s challenges and solutions for managing thousands of cloud‑native clusters, covering their multi‑layered operation architecture, GitOps workflow, infrastructure‑as‑code integration, and the role of AI‑driven intelligent operations in large‑scale environments.

GitOpsIaCKubernetes

0 likes · 23 min read

How GitOps Powers Cloud‑Native Large‑Scale Cluster Management

Xiaohongshu Tech REDtech

May 22, 2025 · Artificial Intelligence

Scalable Overload-Aware Graph-Based Index Construction for 10‑Billion‑Scale Vector Similarity Search (SOGAIC)

The paper introduces SOGAIC, a scalable overload‑aware graph‑based index construction system for billion‑scale vector similarity search that uses adaptive overlapping partitioning and load‑balanced distributed scheduling to cut construction time by 47.3% while maintaining high recall.

ANNDistributed SchedulingLarge Scale

0 likes · 13 min read

Scalable Overload-Aware Graph-Based Index Construction for 10‑Billion‑Scale Vector Similarity Search (SOGAIC)

Alibaba Cloud Infrastructure

Nov 22, 2024 · Cloud Native

Large‑Scale Cloud‑Edge Collaborative Technology Based on Cloud‑Native Wins Zhejiang Province Science and Technology Progress Award

Alibaba Cloud, together with Zhejiang University, Alipay and Xieyun Technology, received the Zhejiang Province Science and Technology Progress First Prize for their cloud‑native large‑scale cloud‑edge collaborative platform, which addresses edge resource constraints, real‑time computing, and massive node management, and has been widely applied across multiple industries.

CNCFLarge ScaleReal-Time Computing

0 likes · 5 min read

Large‑Scale Cloud‑Edge Collaborative Technology Based on Cloud‑Native Wins Zhejiang Province Science and Technology Progress Award

Full-Stack DevOps & Kubernetes

Sep 9, 2024 · Cloud Native

Optimizing Network and Storage for 5,000‑Node Kubernetes Clusters

This guide outlines practical strategies for designing and optimizing network and storage in Kubernetes clusters of over 5,000 nodes, covering overlay networks, IP pool segmentation, bandwidth allocation, load balancing, security policies, distributed storage options, performance tuning, and reliable backup solutions.

Cloud NativeIP PoolKubernetes

0 likes · 9 min read

Optimizing Network and Storage for 5,000‑Node Kubernetes Clusters

Baidu Geek Talk

Nov 8, 2023 · Databases

BES Engineering Practices for Large‑Scale Vector Database Scenarios

At QCon 2023, Baidu’s BES team detailed how their cloud‑native Elasticsearch service has been engineered for large‑scale vector search, describing architecture, C++ plugin integration, memory‑saving storage tricks, HNSW/IVF optimizations, filter strategies, and real‑world multimodal video and LLM knowledge‑base deployments.

AIBESCloud

0 likes · 16 min read

BES Engineering Practices for Large‑Scale Vector Database Scenarios

DataFunSummit

Nov 2, 2023 · Databases

Understanding TiKV: Features, Architecture, and Large‑Scale Operational Challenges

This article introduces the distributed transactional KV store TiKV, explains its role as TiDB’s storage engine, details its multi‑layered architecture and Raft‑based consistency model, and discusses the performance and resource challenges encountered at massive data scales along with the engineering solutions implemented to address them.

Large ScalePerformance OptimizationRaft

0 likes · 14 min read

Understanding TiKV: Features, Architecture, and Large‑Scale Operational Challenges

AntTech

Oct 30, 2023 · Artificial Intelligence

AntM2C: A Large-Scale Multi‑Scenario Multi‑Modal CTR Prediction Dataset from Alipay

AntM2C is a publicly released, billion‑sample click‑through‑rate (CTR) dataset covering five distinct Alipay business scenarios, providing both ID and rich multi‑modal (text and image) features to enable comprehensive evaluation of multi‑scenario, cold‑start, and multi‑modal CTR models at industrial scale.

CTRLarge ScaleMulti-modal

0 likes · 14 min read

AntM2C: A Large-Scale Multi‑Scenario Multi‑Modal CTR Prediction Dataset from Alipay

JD Tech

Oct 10, 2023 · Operations

Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion

This article details how JDV, JD.com’s internal visual dashboard platform, tackled the massive data‑intensive 618 promotion by implementing real‑time updates, cross‑midnight count stops, request‑state control, heartbeat monitoring, proxy data sources, and a suite of developer tools to ensure stability, performance, and rapid feature delivery.

Data PlatformLarge ScaleMonitoring

0 likes · 18 min read

Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion

DataFunTalk

Sep 22, 2023 · Big Data

Design and Practice of Baidu's Tape Library Storage Architecture Based on the Aries Cloud Storage System

This article presents a comprehensive overview of Baidu Intelligent Cloud's tape‑library solution, detailing tape and tape‑library fundamentals, the Aries cloud storage stack, data and access models, the end‑to‑end data flow, key architectural design choices, implementation details, and a real‑world case study demonstrating large‑scale cold‑data storage, backup, and retrieval performance.

Data ArchivingLarge Scalearies

0 likes · 28 min read

Design and Practice of Baidu's Tape Library Storage Architecture Based on the Aries Cloud Storage System

DataFunSummit

Jun 21, 2023 · Databases

Forum on Building Ultra‑Scale Storage Systems: Insights from Baidu, Meituan, Ant Group, Xiaomi and Baidu Cloud

The forum gathers senior experts from Baidu, Meituan, Ant Group, Xiaomi and Baidu Cloud to share practical experiences and future trends on constructing ultra‑large‑scale file, block, KV and NoSQL storage systems, focusing on low‑cost, high‑performance solutions and architectural challenges.

KV storageLarge Scaleblock storage

0 likes · 8 min read

Forum on Building Ultra‑Scale Storage Systems: Insights from Baidu, Meituan, Ant Group, Xiaomi and Baidu Cloud

Tencent Cloud Developer

May 10, 2023 · Cloud Native

Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions

In October 2022 Tencent finished migrating its flagship services—including QQ, WeChat, and Honor of Kings—to a cloud‑native architecture spanning over 50 million CPU cores, overcoming millisecond‑level upgrade, stateful in‑place refresh, massive cross‑region scaling, and heterogeneous hardware by deploying the TKEx platform’s sidecar upgrades, three‑container patterns, Global Scaler Operator, machine‑type abstraction, and Clusternet‑based application‑centric orchestration, boosting CPU utilization to 65 % and establishing China’s largest cloud‑native practice.

Container UpgradeLarge ScaleTencent

0 likes · 19 min read

Tencent's Large‑Scale Cloud‑Native Migration: Challenges and Solutions

Continuous Delivery 2.0

May 8, 2023 · Operations

Google’s Monolithic Code Repository: Scale, Architecture, and Practices

Google’s monolithic repository, managed by the proprietary Piper system and accessed via the cloud‑based CitC client, stores over a billion files and billions of lines of code, supports tens of thousands of engineers, and relies on trunk‑based development, extensive tooling, and strict security to enable large‑scale, efficient software development.

DevOpsGoogleLarge Scale

0 likes · 17 min read

Google’s Monolithic Code Repository: Scale, Architecture, and Practices

Alimama Tech

Feb 8, 2023 · Artificial Intelligence

Evolution of Recall Indexes in Alibaba Advertising: From Quantization to Graph-based HNSW

Alibaba’s advertising pipeline progressed from low‑dimensional quantization partitions to hierarchical tree indexes, then to graph‑based HNSW structures—including multi‑category, multi‑level graphs and a BlazeOp‑driven scoring service—dramatically boosting recall efficiency, scalability and maintainability while meeting strict latency constraints.

HNSWLarge Scalerecall

0 likes · 13 min read

Evolution of Recall Indexes in Alibaba Advertising: From Quantization to Graph-based HNSW

Huawei Cloud Developer Alliance

Dec 12, 2022 · Cloud Native

How Karmada Powers Multi‑Cloud, Multi‑Cluster Production at Cloud Native Days China 2022

The Karmada community's Cloud Native Days China 2022 session in Nanjing gathered over 30 enterprises and developers to share multi‑cloud, multi‑cluster production practices, large‑scale testing results, and real‑world implementations from Huawei Cloud, vivo, Hurricane Engine, China Mobile, DaoCloud, and Zhejiang University, highlighting Karmada's scalability and ecosystem growth.

KarmadaKubernetesLarge Scale

0 likes · 9 min read

How Karmada Powers Multi‑Cloud, Multi‑Cluster Production at Cloud Native Days China 2022

NetEase LeiHuo Testing Center

Dec 9, 2022 · Game Development

BattleBit Remastered – An In‑Depth Analysis of Its Large‑Scale Multiplayer FPS Design

BattleBit Remastered is a low‑poly, large‑scale multiplayer FPS that supports up to 254 players per match, offering extensive class and weapon options, a minimalist UI, destructible environments, and strong team‑communication tools, all while running on minimal hardware requirements.

BattleBit RemasteredLarge ScaleUI

0 likes · 12 min read

BattleBit Remastered – An In‑Depth Analysis of Its Large‑Scale Multiplayer FPS Design

FunTester

Oct 24, 2022 · Backend Development

Optimizing Large-Scale API Parameter Combination Testing with Concurrency and QPS Control

This article describes how to efficiently test billions of API parameter combinations by replacing naive nested loops with a queue‑based concurrent approach, dynamically controlling QPS, and addressing memory‑pressure issues using thread‑safe data structures.

API testingJavaLarge Scale

0 likes · 8 min read

Optimizing Large-Scale API Parameter Combination Testing with Concurrency and QPS Control

Java High-Performance Architecture

Oct 11, 2022 · Operations

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

This article details Meituan's real‑world challenges with a 15,000‑node Kafka deployment and explains the application‑layer and system‑layer optimizations—such as disk balancing, migration pipeline acceleration, fetcher isolation, RAID acceleration, cgroup isolation, and an SSD‑based cache—that together dramatically cut read/write latency and simplify large‑scale cluster management.

Large ScaleMeituanOptimization

0 likes · 23 min read

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

ITPUB

Sep 24, 2022 · Operations

How Alibaba Cloud Log Service Scales Billion‑Task Scheduling: Design and Practice

This article explains how Alibaba Cloud Log Service implements a billion‑scale task scheduling framework for its observability platform, covering background, task types, design goals, architecture, key design points, and practical examples such as aggregation jobs and various scheduling scenarios.

Large ScaleLog ServiceTask scheduling

0 likes · 22 min read

How Alibaba Cloud Log Service Scales Billion‑Task Scheduling: Design and Practice

WeChat Backend Team

Aug 5, 2022 · Artificial Intelligence

How WeChat’s Ekko Achieves Ultra‑Low‑Latency Model Updates for Billion‑User Recommendations

At the 16th OSDI conference, Tencent’s WeChat team presented the award‑winning Ekko system—a groundbreaking, ultra‑low‑latency model‑update solution for massive recommendation workloads that dramatically speeds up updates, supports over a trillion‑scale models, and has already boosted user engagement across billions of daily users.

Large ScaleRecommendation SystemsWeChat

0 likes · 5 min read

How WeChat’s Ekko Achieves Ultra‑Low‑Latency Model Updates for Billion‑User Recommendations

DataFunSummit

Jul 26, 2022 · Artificial Intelligence

Multi-step Reasoning over Large-scale Knowledge Graphs: Query2Box and SMORE Framework

This talk presents recent advances in multi-step reasoning over large-scale, noisy knowledge graphs, introducing the Query2Box model that uses box embeddings for complex queries and the SMORE framework that enables efficient multi-hop inference on massive graphs through scalable query generation, embedding computation, and training pipelines.

AIKnowledge GraphLarge Scale

0 likes · 14 min read

Multi-step Reasoning over Large-scale Knowledge Graphs: Query2Box and SMORE Framework

Volcano Engine Developer Services

Jul 12, 2022 · Cloud Native

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

ByteDance’s cloud‑native ecosystem combines a multi‑layered architecture, dynamic resource over‑provisioning control, hybrid online‑offline scheduling, and federated cluster management to boost container utilization from 23% to 63%, reduce costs by 40%, and support massive events like the 2021 Spring Festival Gala.

Cloud NativeLarge ScaleResource Scheduling

0 likes · 16 min read

How ByteDance Scales Cloud‑Native Infrastructure with Hybrid K8s Scheduling

Efficient Ops

Jun 23, 2022 · Cloud Native

How Vivo Scales Kubernetes: Automated Multi‑Cluster Management with a Custom Operator

Vivo’s rapid migration to Kubernetes across multiple data centers required a secure, efficient, and reliable way to manage thousands of nodes, leading them to develop a custom k8s‑operator that streamlines cluster deployment, CI testing, declarative APIs, and automated repair for large‑scale cloud‑native environments.

Cloud NativeCluster AutomationDevOps

0 likes · 3 min read

How Vivo Scales Kubernetes: Automated Multi‑Cluster Management with a Custom Operator

Alipay Experience Technology

Jun 14, 2022 · Frontend Development

How Ant Group’s Serverless Front‑End Platform Boosts Large‑Scale Development

This talk explains Ant Group’s serverless‑based front‑end platform, detailing the current front‑end architecture, challenges of large‑scale financial services, and three core efficiency ideas—document‑as‑code, function‑level deployment, and integrated plugin capabilities—to streamline development and ensure safe production.

BFFLarge ScaleServerless

0 likes · 9 min read

How Ant Group’s Serverless Front‑End Platform Boosts Large‑Scale Development

Laravel Tech Community

Jun 6, 2022 · Artificial Intelligence

What an Open‑Source Twitter Algorithm Would Look Like: Architecture, Data Model, and Engineering Challenges

This article examines the practical aspects of open‑sourcing Twitter’s recommendation algorithm, covering the platform’s data model, timeline views, ranking features, a TypeScript pseudocode illustration, and the major engineering challenges of scale, real‑time processing, reliability, and security.

Large ScaleTwitteralgorithm

0 likes · 14 min read

What an Open‑Source Twitter Algorithm Would Look Like: Architecture, Data Model, and Engineering Challenges

Alibaba Cloud Native

Mar 1, 2022 · Cloud Native

How Alibaba’s KubeProbe Tackles Large‑Scale Kubernetes Stability Challenges

This article explains how Alibaba Cloud's self‑built KubeProbe combines universal link probing and targeted inspections to detect, diagnose, and remediate issues in massive multi‑cluster Kubernetes environments, improving reliability and reducing on‑call overhead.

ChatOpsCloud NativeKubernetes

0 likes · 19 min read

How Alibaba’s KubeProbe Tackles Large‑Scale Kubernetes Stability Challenges

Alimama Tech

Feb 9, 2022 · Artificial Intelligence

Online Allocation Strategies for Guaranteed Display Advertising: Modeling, Distributed Solving, and Adaptive Pacing

The paper presents a guarantee‑based, distributed allocation framework for Alibaba’s off‑site brand contract ads that extends the SHALE algorithm with effect‑driven objectives and explicit over‑allocation constraints, solves dual variables via coordinate descent, and employs adaptive probability‑based pacing to meet volume guarantees while significantly boosting average CTR.

Large ScaleOnline Optimizationallocation

0 likes · 11 min read

Online Allocation Strategies for Guaranteed Display Advertising: Modeling, Distributed Solving, and Adaptive Pacing

DataFunTalk

Dec 29, 2021 · Artificial Intelligence

Entity Alignment in Product Knowledge Graphs: Techniques and Applications

This article presents a comprehensive overview of building and applying product knowledge graphs for e‑commerce, covering background, recent advances in graph neural network‑based entity alignment, online prediction pipelines, data construction, evaluation metrics, attribute extraction, and future research directions.

Graph Neural NetworkKnowledge GraphLarge Scale

0 likes · 23 min read

Entity Alignment in Product Knowledge Graphs: Techniques and Applications

DataFunTalk

Dec 13, 2021 · Artificial Intelligence

Dual Vector Foil (DVF): Decoupled Index and Model for Large‑Scale Retrieval

The article introduces the Dual Vector Foil (DVF) algorithm system, which decouples index construction from model training to enable lightweight, high‑precision large‑scale recall using arbitrary complex models, and details its two‑stage and one‑stage solutions, graph‑based retrieval implementation, performance optimizations, and experimental results.

Large ScaleRecommendation Systemsalgorithm

0 likes · 28 min read

Dual Vector Foil (DVF): Decoupled Index and Model for Large‑Scale Retrieval

Efficient Ops

Nov 9, 2021 · Operations

How Ant Group Scales etcd for 10k‑Node Kubernetes Clusters: High‑Availability Secrets

This article examines Ant Group's strategies for achieving high availability of the etcd key‑value store in a massive 10,000‑node Kubernetes cluster, detailing challenges, performance metrics, filesystem upgrades, tuning parameters, operational platform insights, and future directions for distributed etcd deployments.

EtcdKubernetesLarge Scale

0 likes · 21 min read

How Ant Group Scales etcd for 10k‑Node Kubernetes Clusters: High‑Availability Secrets

DataFunTalk

Oct 9, 2021 · Databases

Building and Optimizing a Large‑Scale Graph Platform for Financial Risk Control at Du Xiaoman Financial

This article describes how Du Xiaoman Financial designed, built, and continuously optimized a massive graph platform—including data governance, graph learning, query performance, data import, and online deployment—to improve credit risk assessment using billions of nodes and edges, and shares practical lessons on graph databases, distributed training, and real‑time inference.

DGLFinancial AnalyticsJanusGraph

0 likes · 19 min read

Building and Optimizing a Large‑Scale Graph Platform for Financial Risk Control at Du Xiaoman Financial

DataFunTalk

Jan 20, 2021 · Artificial Intelligence

Techniques for Reducing the Computational Complexity of Large-Scale Graph Neural Networks

This article presents an overview of graph neural networks, explains their computational framework, analyzes space and time complexities, and proposes ten practical strategies—including edge avoidance, dimensionality reduction, selective iteration, memory baking, distillation, partitioning, sparse computation, routing, and cross-sample feature sharing—to significantly lower the cost of large‑scale GNN processing.

Computational ComplexityLarge Scaledeep learning

0 likes · 14 min read

Techniques for Reducing the Computational Complexity of Large-Scale Graph Neural Networks

JD Cloud Developers

Dec 16, 2020 · Backend Development

How JD Logistics Scaled a Billion‑Level Async Messaging System for 11.11

JD Logistics architect Chen Haolong detailed the design, scalability strategies, and operational practices behind the billion‑level asynchronous messaging system that powered JD.com’s massive 11.11 shopping festival, revealing how the platform handled unprecedented traffic and ensured reliability.

JD LogisticsLarge ScaleOperations

0 likes · 2 min read

How JD Logistics Scaled a Billion‑Level Async Messaging System for 11.11

Alibaba Cloud Native

Sep 24, 2020 · Cloud Native

Tackling Ultra‑Large‑Scale Service Mesh Deployment: Lessons from Alibaba

This article details Alibaba's practical experience deploying Service Mesh at massive scale, covering architectural evolution, key challenges, traffic interception, hot‑upgrade mechanisms, performance optimizations, and operational tooling that together enable reliable, low‑overhead service communication in a cloud‑native environment.

Cloud NativeEnvoyIstio

0 likes · 22 min read

Tackling Ultra‑Large‑Scale Service Mesh Deployment: Lessons from Alibaba

Full-Stack Internet Architecture

Sep 15, 2020 · Cloud Native

Challenges and Solutions for Large-Scale Service Mesh Deployment at Alibaba

Alibaba’s large‑scale Service Mesh deployment faces challenges such as smooth technology evolution, business‑technical balance, technical debt, massive sidecar operations, and scaling, which it addresses through staged architecture evolution, traffic‑transparent interception, hot upgrades, and open‑source contributions to Istio and Envoy.

Cloud NativeEnvoyIstio

0 likes · 19 min read

Challenges and Solutions for Large-Scale Service Mesh Deployment at Alibaba

DataFunTalk

Aug 20, 2020 · Artificial Intelligence

Weibo Recommendation Algorithm Practice and Machine Learning Platform Evolution

This article shares Weibo’s experience in building and evolving its recommendation algorithms, covering the recommendation scenario, machine learning workflow, feature engineering, model upgrades, large‑scale challenges, deployment via the Weiflow platform, and the capabilities of its machine‑learning infrastructure.

Large ScaleWeibofeature engineering

0 likes · 14 min read

Weibo Recommendation Algorithm Practice and Machine Learning Platform Evolution

MaGe Linux Operations

Jul 28, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Didi, and 58.com deploy and evolve Elasticsearch clusters to handle massive order data, log analysis, real‑time monitoring, and security tasks, detailing architecture choices, shard strategies, multi‑cluster designs, and performance optimizations.

Big DataElasticsearchLarge Scale

0 likes · 11 min read

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

DataFunTalk

Dec 4, 2019 · Artificial Intelligence

Joint Optimization of Tree‑based Index and Deep Model (JTM) for Large‑Scale Recommendation

This article presents JTM, a joint optimization framework that simultaneously learns a tree‑based index and a deep scoring model to overcome the limitations of traditional recommendation pipelines, demonstrating significant recall improvements on Amazon Books and Alibaba UserBehavior datasets through hierarchical user interest modeling and efficient tree learning.

Large Scaledeep learningjoint optimization

0 likes · 19 min read

Joint Optimization of Tree‑based Index and Deep Model (JTM) for Large‑Scale Recommendation

Alibaba Cloud Developer

Nov 13, 2019 · Cloud Native

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.

Cloud NativeKubernetesLarge Scale

0 likes · 15 min read

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

dbaplus Community

Nov 4, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—detailing its architecture, core operators, desired‑state controllers, fault‑self‑healing mechanisms, risk mitigation, and practical Q&A for production environments.

AutomationCloud NativeKubernetes

0 likes · 16 min read

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

Alibaba Cloud Native

Oct 27, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Kube-on-KubeKubernetesLarge Scale

0 likes · 15 min read

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

DataFunTalk

Oct 16, 2019 · Artificial Intelligence

Deep Learning Practices for Personalized Recommendation at Meitu: From Recall to Ranking

This article details Meitu's large‑scale personalized recommendation pipeline, describing the business scenario, challenges of massive data, latency and long‑tail distribution, and the application of deep learning techniques such as Item2vec, YouTubeNet, dual‑tower DNN, NFM, NFwFM and multi‑task learning to improve click‑through rate, conversion and user engagement.

Large ScaleMulti-Task LearningRecommendation Systems

0 likes · 20 min read

Deep Learning Practices for Personalized Recommendation at Meitu: From Recall to Ranking

DataFunTalk

Aug 16, 2019 · Artificial Intelligence

Tree‑based Deep Match (TDM): Design, Implementation, and Applications in Large‑Scale Retrieval

This article presents a comprehensive overview of the Tree‑based Deep Match (TDM) algorithm, describing the evolution of retrieval technology, the limitations of traditional Match‑Rank pipelines, the design of a one‑stage tree‑indexed deep matching model, its training methodology, performance gains on public datasets, and its deployment in Alibaba’s advertising and e‑commerce platforms.

Large ScaleRecommendation SystemsTDM

0 likes · 23 min read

Tree‑based Deep Match (TDM): Design, Implementation, and Applications in Large‑Scale Retrieval

AntTech

Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

KubernetesLarge Scalecluster management

0 likes · 9 min read

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

Efficient Ops

Jun 11, 2019 · Operations

What Powers WeChat’s Billion‑User Scale? Inside Its DevOps Journey

WeChat, China’s top social app with over a billion users, has applied DevOps practices to dramatically improve development efficiency, code quality, and accelerate the feedback cycle from requirements to delivery, while confronting real‑world challenges in tooling, processes, reliability, and automation costs.

Continuous DeliveryLarge ScaleOperations

0 likes · 3 min read

What Powers WeChat’s Billion‑User Scale? Inside Its DevOps Journey

dbaplus Community

Aug 14, 2018 · Operations

How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators

Ant Financial tackles the challenge of managing dozens of Kubernetes clusters and over a hundred thousand worker nodes by employing a meta‑cluster with Kube‑on‑Kube and Node Operators, enabling automated lifecycle management, scaling, upgrades, and fault recovery for both master components and worker nodes.

AutomationKubernetesLarge Scale

0 likes · 12 min read

How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators

JD Tech

Aug 6, 2018 · Operations

Ensuring Stability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

The article shares practical experience on operating massive Kubernetes clusters, focusing on three stability questions, data collection and visualization, and a suite of operational tools to achieve reliable, high‑availability services in production environments.

Cluster OperationsKubernetesLarge Scale

0 likes · 12 min read

Ensuring Stability in Large‑Scale Kubernetes Clusters: Three Key Questions and Operational Practices

Meitu Technology

Jul 12, 2018 · Artificial Intelligence

DeepHash: Large-Scale Multimedia Content Analysis and Retrieval for Short Video Platforms

DeepHash is Meitu’s large‑scale short‑video analysis and retrieval system that converts deep‑learned visual features into compact binary hash codes via a MobileNet‑based CNN and triplet‑loss training, enabling fast, robust similarity search across billions of videos with sub‑second latency and minimal storage.

Large Scalefeature extractionmultimedia retrieval

0 likes · 15 min read

DeepHash: Large-Scale Multimedia Content Analysis and Retrieval for Short Video Platforms

MaGe Linux Operations

Jun 29, 2018 · Operations

Essential Skills and Roadmap for Large‑Scale Website Operations Engineers

This comprehensive guide explains what large‑scale website operations entail, outlines the product lifecycle involvement of ops engineers, details the technical and personal skills required, and discusses current challenges, future prospects, and key technologies such as cluster management, monitoring, fault handling, and automation.

AutomationDevOpsLarge Scale

0 likes · 18 min read

Essential Skills and Roadmap for Large‑Scale Website Operations Engineers

58 Tech

Jun 1, 2018 · Backend Development

Design and Implementation of Real-Time Indexing in 58.com’s ESearch Search Engine

This article explains how 58.com’s in‑house C++ search kernel ESearch was architected to provide second‑level real‑time indexing, high‑concurrency low‑latency querying, flexible ranking models, and efficient storage structures for billions of daily queries across massive classified data.

C++Large ScaleSearch Engine

0 likes · 13 min read

Design and Implementation of Real-Time Indexing in 58.com’s ESearch Search Engine

Efficient Ops

Feb 5, 2018 · Operations

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

This article details the architecture and practical techniques behind WeChat's large‑scale monitoring system, covering lightweight data collection, classification of real‑time, non‑real‑time and user‑specific metrics, anomaly detection algorithms, automated configuration, and high‑performance storage solutions for billions of events per minute.

Large ScaleMonitoringOperations

0 likes · 14 min read

How WeChat Scales Massive Real-Time Monitoring: Design & Practices

Efficient Ops

Dec 5, 2017 · Operations

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

This article explains how Alibaba’s Sunfire monitoring platform processes terabytes of logs per minute, uses a pull‑based architecture with Brain‑Reduce‑Map roles, tackles scalability and reliability challenges, and outlines future directions such as MQL standardization and intelligent baselines.

Large ScaleLog ProcessingMonitoring

0 likes · 17 min read

How Alibaba’s Sunfire Achieves Second‑Level Monitoring at Trillion‑Transaction Scale

ITPUB

Nov 14, 2017 · Operations

How Alibaba’s Dragonfly P2P System Supercharges Large‑Scale File and Container Image Distribution

Alibaba’s Dragonfly (蜻蜓) is a self‑developed P2P file distribution platform that dramatically speeds up massive file and container image delivery, reduces bandwidth consumption, supports intelligent compression and flow control, and has become a core infrastructure component powering billions of transactions during major events like Double 11.

File DistributionLarge ScaleP2P

0 likes · 20 min read

How Alibaba’s Dragonfly P2P System Supercharges Large‑Scale File and Container Image Distribution

Alibaba Cloud Developer

Nov 14, 2017 · Operations

How Alibaba’s Dragonfly P2P System Supercharges File and Image Distribution

Alibaba’s Dragonfly (蜻蜓) leverages P2P networking, intelligent compression, and flow control to dramatically accelerate large‑scale file and container image distribution, reducing bandwidth usage by over 99%, achieving up to 57× speedup, and supporting tens of thousands of concurrent hosts during peak events like Double 11.

File DistributionLarge ScaleP2P

0 likes · 19 min read

How Alibaba’s Dragonfly P2P System Supercharges File and Image Distribution

Efficient Ops

May 19, 2017 · Operations

Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent

This article shares insights from Tencent’s SNG operations leader on building effective continuous feedback loops for massive‑scale services, covering monitoring, alerting, operational metrics, multi‑dimensional analysis, and practical DevOps techniques to improve reliability, availability, and automated self‑healing.

Continuous FeedbackDevOpsLarge Scale

0 likes · 26 min read

Mastering Continuous Feedback in Massive Operations: DevOps Strategies from Tencent

Ctrip Technology

Jan 5, 2017 · Artificial Intelligence

Design and Implementation of a Billion‑Scale Generalized Recommendation System at Tencent Cloud

This article explains how Tencent built a billion‑scale, generalized recommendation system by designing a reusable algorithm library, deploying a low‑latency, highly available real‑time streaming platform (R2), and offering a cloud‑based recommendation engine that simplifies integration for internet businesses.

AICloud ComputingLarge Scale

0 likes · 11 min read

Design and Implementation of a Billion‑Scale Generalized Recommendation System at Tencent Cloud

360 Zhihui Cloud Developer

Dec 15, 2016 · Operations

How Qcmd Revolutionizes Large‑Scale Server Automation Compared to SaltStack

This article explains how 360's Qcmd, a Golang‑based real‑time command execution system, overcomes SaltStack's limitations to reliably manage tens of thousands of servers with high success rates, flexible scripting, detailed monitoring, and efficient message handling.

AutomationCommand ExecutionLarge Scale

0 likes · 7 min read

How Qcmd Revolutionizes Large‑Scale Server Automation Compared to SaltStack

Art of Distributed System Architecture Design

Mar 20, 2015 · Backend Development

Design and Architecture of Facebook Haystack Image Storage System

The article analyzes Facebook's massive image storage challenges and explains the Haystack architecture, detailing its components—Directory, Store, and Cache—how it reduces I/O, manages metadata, and handles read/write operations at billions‑scale while also addressing CDN dependency and fault tolerance.

FacebookHaystackImage storage

0 likes · 10 min read