Tagged articles
2179 articles
Page 21 of 22
21CTO
21CTO
Jul 6, 2017 · Big Data

How HBase Boosted Tencent Monitoring Platform Performance 3‑5×

Facing the challenge of storing over 120 billion daily monitoring points from hundreds of thousands of servers, Tencent’s monitoring platform migrated from a custom solution and OpenTSDB to a finely tuned HBase architecture, achieving 3‑5× higher throughput, improved reliability, and significant storage savings.

DistributedStorageHBasePerformanceTuning
0 likes · 11 min read
How HBase Boosted Tencent Monitoring Platform Performance 3‑5×
Qunar Tech Salon
Qunar Tech Salon
Jul 4, 2017 · Big Data

Design and Evolution of Airbnb's Log Data Storage and Query Platform

The article describes how Airbnb's data infrastructure team built a next‑generation log storage and query platform to improve data quality, timeliness, flexibility, and anomaly detection, outlining the system architecture, key requirements, five improvement areas, and the resulting benefits.

Airbnbdata pipelinelog platform
0 likes · 7 min read
Design and Evolution of Airbnb's Log Data Storage and Query Platform
Suning Technology
Suning Technology
Jul 3, 2017 · Operations

Inside Suning’s Intelligent Ops Forum: How Tech Leaders Automate and AI‑Boost Operations

The Suning Cloud Commerce IT headquarters hosted a comprehensive Intelligent Operations forum featuring experts from Alibaba, Weibo, Meituan, 360, Meizu and PPD, who shared practical insights on automation, platformization, AI‑driven big‑data analytics, network automation, security, and monitoring across modern IT operations.

Intelligent Operationsmonitoring
0 likes · 8 min read
Inside Suning’s Intelligent Ops Forum: How Tech Leaders Automate and AI‑Boost Operations
Efficient Ops
Efficient Ops
Jun 11, 2017 · Operations

How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring

From early manual deployments to a sophisticated, multi-layered monitoring stack—including ELK, Zabbix, Statsd, Grafana, and Prometheus—Bilibili’s ops team shares the evolution, challenges, and lessons learned in building scalable, automated infrastructure for massive internet traffic.

DevOpsELKGrafana
0 likes · 8 min read
How Bilibili Scaled Its Ops: From DIY Deployments to Prometheus Monitoring
ITPUB
ITPUB
Jun 9, 2017 · Operations

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

MetricsOperationsSLI
0 likes · 10 min read
Mastering Effective Monitoring: From Basics to the USE Method
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Jun 6, 2017 · Backend Development

Design and Optimization of Baidu Waimai Activity Module Architecture

This article presents a comprehensive redesign of Baidu Waimai’s client‑side activity module, detailing background challenges, design goals, functional and performance specifications, trade‑off analyses of three architectural alternatives, and the chosen parallel HTTP‑request solution with monitoring, degradation, and phased rollout plans.

BackendPerformance OptimizationScalability
0 likes · 8 min read
Design and Optimization of Baidu Waimai Activity Module Architecture
ITPUB
ITPUB
May 31, 2017 · Operations

Automate Bulk Host Addition for Cacti and Nagios with Simple Scripts

The article explains how to automate the tedious process of adding multiple hosts to Cacti and Nagios by using shell‑wrapped PHP scripts and custom templates, provides download links, and shares practical tips to avoid common installation pitfalls.

AutomationBatchCacti
0 likes · 5 min read
Automate Bulk Host Addition for Cacti and Nagios with Simple Scripts
Qunar Tech Salon
Qunar Tech Salon
May 19, 2017 · Mobile Development

Zero‑Instrumentation Interaction and Performance Monitoring for Large‑Scale Mobile Apps

The article presents a comprehensive approach to solving crash and performance issues in large‑scale mobile applications by reconstructing user interaction traces through a no‑track analytics platform, compile‑time AOP instrumentation, and unified data aggregation, ultimately improving debugging efficiency and reducing operational overhead.

Analyticsaopmonitoring
0 likes · 9 min read
Zero‑Instrumentation Interaction and Performance Monitoring for Large‑Scale Mobile Apps
ITPUB
ITPUB
May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

OperationsSREincident management
0 likes · 18 min read
Mastering Online Incident Management: From Detection to Prevention
Qunar Tech Salon
Qunar Tech Salon
May 11, 2017 · Operations

Designing Performance Test Scenarios: Models, Metrics, and Strategies

This article explains how to design performance testing scenarios, covering test models, metrics, script preparation, concurrency calculations, pressure strategies, run times, delay settings, user termination, monitoring methods, and various typical scenario types such as baseline, load, mixed, capacity, large‑concurrency, stability and scalability tests.

Load TestingPerformance TestingTPS
0 likes · 24 min read
Designing Performance Test Scenarios: Models, Metrics, and Strategies
MaGe Linux Operations
MaGe Linux Operations
May 10, 2017 · Operations

Step‑by‑Step: Monitor Nginx and PHP‑FPM Status with Zabbix

This guide walks through configuring Zabbix to monitor Nginx and PHP‑FPM status, covering software installation paths, enabling status modules, creating extraction scripts, setting up Zabbix agent userparameters, restarting services, testing data retrieval, and adding server‑side templates for items, triggers, and graphs.

LinuxNginxOps
0 likes · 9 min read
Step‑by‑Step: Monitor Nginx and PHP‑FPM Status with Zabbix
Efficient Ops
Efficient Ops
May 9, 2017 · Backend Development

How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons

This article details how Tencent's AMS system was analyzed, traffic‑estimated, and redesigned for high‑availability during the QQ Spring Festival Red Packet event, covering architecture mapping, scaling strategies, overload protection, flexible availability, disaster recovery, monitoring, and practical lessons learned.

Backenddisaster-recoveryhigh-availability
0 likes · 25 min read
How Tencent Scaled QQ Red Packet to 100k QPS: Architecture & Lessons
DevOps
DevOps
May 9, 2017 · Operations

A Clear and Concise DevOps Implementation Framework: 11 Core Service Capabilities

This article introduces a straightforward DevOps implementation framework that maps eleven essential service capabilities across the software development lifecycle, explains why adopting DevOps is a multi‑year journey, and uses a fitness analogy to illustrate how enterprises can progressively build these capabilities.

Continuous DeliveryDevOpsOperations
0 likes · 4 min read
A Clear and Concise DevOps Implementation Framework: 11 Core Service Capabilities
Efficient Ops
Efficient Ops
May 3, 2017 · Operations

How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations

This article details Tencent's large‑scale live streaming architecture for NBA games, covering the rapid growth of live video, key technical features, network transmission challenges, multi‑angle production, CDN deployment, monitoring, big‑data processing, and strategies for ensuring low latency and high reliability for millions of concurrent viewers.

Big DataCDNOperations
0 likes · 25 min read
How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations
DevOps
DevOps
Apr 25, 2017 · Operations

Analyzing and Visualizing Docker Logs with the ELK Stack (Part Two)

This article explains how to analyze and visualize Docker container logs using the ELK stack, covering preparation, parsing tips, Kibana query techniques, and example visualizations to help monitor Dockerized environments effectively in production.

DockerELKKibana
0 likes · 7 min read
Analyzing and Visualizing Docker Logs with the ELK Stack (Part Two)
MaGe Linux Operations
MaGe Linux Operations
Apr 17, 2017 · Operations

Essential Linux & Server Commands: From Log Cleanup to RAID and Monitoring

This guide presents practical Linux and server administration commands, covering log cleanup, nginx IP analysis, tcpdump capture, Python date formatting and string reversal, subprocess execution, multiprocessing, iptables port forwarding, cron scheduling, file relocation, RAID concepts, Oracle backup strategies, port checking, Apache MPM modes, and monitoring tool comparisons.

LinuxNetworkingRAID
0 likes · 10 min read
Essential Linux & Server Commands: From Log Cleanup to RAID and Monitoring
Efficient Ops
Efficient Ops
Apr 16, 2017 · Operations

How China Life Built a Self‑Developed Automated Ops Platform from Scratch

China Life’s Shanghai Data Center team transformed chaotic, multi‑system operations into a unified, automated platform by standardizing hardware, processes, and tools, leveraging OpenStack, Docker, Zabbix, and custom scripts, ultimately achieving efficient monitoring, change management, and a mobile‑enabled DevOps workflow.

AutomationOps Platformcloud
0 likes · 17 min read
How China Life Built a Self‑Developed Automated Ops Platform from Scratch
dbaplus Community
dbaplus Community
Apr 13, 2017 · Backend Development

Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning

This guide walks through the evolution of a small .NET‑based e‑commerce system, covering its initial LAMP‑style setup, detailed backend architecture, logging and monitoring solutions, master‑slave database design, shared‑storage image server, mobile M‑site construction, capacity estimation methods, and caching strategies.

architecturecapacity planningdatabase
0 likes · 22 min read
Scalable Small .NET E‑Commerce Architecture: Monitoring, DB Master‑Slave & Capacity Planning
Efficient Ops
Efficient Ops
Apr 12, 2017 · Operations

Mastering Enterprise Monitoring: From Basics to Advanced Toolchains

This comprehensive guide explains why monitoring is vital for operations, outlines clear objectives and methods, compares popular open‑source and commercial tools, details a Zabbix‑based workflow, and covers hardware, system, application, network, security, API, performance, and business metrics with practical alerting strategies.

AlertingOperationsZabbix
0 likes · 21 min read
Mastering Enterprise Monitoring: From Basics to Advanced Toolchains
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Apr 10, 2017 · Operations

Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform

The Sentinel system was built to provide real‑time, zero‑modification monitoring of airline ticket business services by consuming Tianwang logs through a Storm cluster, offering flexible rule configuration, addressing performance pitfalls, and planning future enhancements such as custom monitoring scripts and visual dashboards.

KafkaLog ProcessingReal-Time
0 likes · 6 min read
Sentinel Monitoring System: Real‑Time Business Log Monitoring and Incident Detection for an Airline Ticket Platform
Efficient Ops
Efficient Ops
Apr 9, 2017 · Cloud Native

How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud

Ctrip built a private container cloud to handle massive seasonal traffic spikes, enabling rapid, automated scaling and shrinking of resources, improving deployment speed, resource utilization, and operational intelligence across more than 20 business units.

Ctripcloud-nativecontainerization
0 likes · 16 min read
How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 6, 2017 · Backend Development

How Alibaba Scaled GitLab to Support Millions of Users with Sharding and High‑Availability

This article details Alibaba Group's journey of transforming its GitLab deployment from a single‑node setup to a distributed, sharded architecture that handles tens of millions of daily requests, achieves near‑perfect reliability, and incorporates performance, monitoring, and disaster‑recovery innovations.

GitLabPerformance Optimizationhigh availability
0 likes · 15 min read
How Alibaba Scaled GitLab to Support Millions of Users with Sharding and High‑Availability
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Apr 6, 2017 · Operations

How SRE’s Dialectical Thinking Redefines Modern Operations

An insightful reflection on Google’s SRE philosophy shows how dialectical thinking—questioning absolute stability, embracing limited toil, prioritizing simple monitoring, recognizing automation’s hidden risks, and practicing real‑world failure drills—can reshape operations, encouraging smarter, more resilient system design.

AutomationReliabilitySRE
0 likes · 7 min read
How SRE’s Dialectical Thinking Redefines Modern Operations
Efficient Ops
Efficient Ops
Mar 30, 2017 · Backend Development

Designing a Scalable, Configurable Distributed Web Crawler

This article outlines the motivation, requirements, modular decomposition, and architecture of a distributed web crawling platform that emphasizes reusability, lightweight modules, real‑time monitoring, and easy configuration for diverse data‑collection tasks.

Backend ArchitectureConfigurationPipeline
0 likes · 10 min read
Designing a Scalable, Configurable Distributed Web Crawler
Baidu Waimai Technology Team
Baidu Waimai Technology Team
Mar 30, 2017 · Backend Development

Design and Implementation of a Unified Voucher Issuance Platform for Baidu Waimai

This article describes the design, architecture, and operational features of Baidu Waimai's unified voucher issuance platform, detailing its four‑layer backend structure, permission and strategy configurations, flow‑control mechanisms, service isolation, monitoring visualizations, and re‑entrancy safeguards to support large‑scale marketing distribution.

Backend ArchitectureFlow ControlSystem Design
0 likes · 7 min read
Design and Implementation of a Unified Voucher Issuance Platform for Baidu Waimai
Qunar Tech Salon
Qunar Tech Salon
Mar 23, 2017 · Cloud Native

Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

The article details Ctrip's rapid business growth driving the need for elastic scaling, the adoption of container technology to achieve second‑level provisioning, the design of their container cloud platform—including deployment principles, network choices, orchestration evaluations, monitoring solutions, and the CDOS overview—providing practical insights for large‑scale cloud‑native operations.

DevOpsOrchestrationcloud-native
0 likes · 16 min read
Ctrip Container Cloud: Architecture, Scaling, and Operational Practices
Baidu Intelligent Testing
Baidu Intelligent Testing
Mar 21, 2017 · Operations

Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details

This article presents a comprehensive server‑side monitoring solution covering functional and performance requirements, monitoring objects, design choices between self‑monitoring and centralized reporting, system architecture, API definitions, key challenges such as key collisions, data formats, storage options, and operational considerations.

AlertingMetricsOperations
0 likes · 12 min read
Server Monitoring Solution: Requirements, Design Decisions, and Implementation Details
ITPUB
ITPUB
Mar 20, 2017 · Operations

How to Diagnose Linux Performance in the First 60 Seconds

Learn the essential Linux command-line tools and step-by-step commands you need to run within the first minute of logging into a server to quickly assess process activity, resource usage, and potential bottlenecks, enabling effective performance troubleshooting in production environments.

Command-linemonitoringperformance
0 likes · 12 min read
How to Diagnose Linux Performance in the First 60 Seconds
Architecture Digest
Architecture Digest
Mar 18, 2017 · Backend Development

Technical Strategies for Startup Engineering Teams: Simplicity, Cloud Servers, Databases, Caching, and DevOps

The article outlines practical engineering guidelines for internet startups, emphasizing simplicity, rapid development, resource efficiency, and the use of cloud servers, MySQL, caching, asynchronous processing, logging, monitoring, documentation, and integrated build‑deploy pipelines to build stable, low‑cost backend systems.

Backend Developmentcachingcloud servers
0 likes · 16 min read
Technical Strategies for Startup Engineering Teams: Simplicity, Cloud Servers, Databases, Caching, and DevOps
Ctrip Technology
Ctrip Technology
Mar 17, 2017 · Cloud Computing

Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices

This article details Ctrip's journey in building a private container cloud to support rapid business growth, covering elasticity challenges, container deployment principles, orchestration platform choices, network design, operational issues, custom executors, monitoring solutions, and the overarching CDOS system.

DockerMesoscdos
0 likes · 16 min read
Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices
High Availability Architecture
High Availability Architecture
Mar 16, 2017 · Operations

Stormcrow: Dropbox’s Scalable Feature‑Flag Platform for Rapid Deployment and A/B Testing

The article describes Dropbox’s Stormcrow system, a configurable feature‑gate platform that enables fast, safe rollout of new functionality across web, desktop, and mobile clients, supports granular A/B testing, leverages custom data fields, and integrates deployment, monitoring, and audit tooling for large‑scale operations.

A/B testingDeploymentScalable Systems
0 likes · 15 min read
Stormcrow: Dropbox’s Scalable Feature‑Flag Platform for Rapid Deployment and A/B Testing
Efficient Ops
Efficient Ops
Mar 1, 2017 · Operations

How Metrics-Driven Development Transforms Software Iteration and Ops

Metrics‑Driven Development (MDD) extends test‑driven principles by embedding real‑time monitoring into design, enabling rapid, precise, and granular software iterations, improving early problem detection, decision support, and aligning development with DevOps culture.

MetricsObservabilitymonitoring
0 likes · 13 min read
How Metrics-Driven Development Transforms Software Iteration and Ops
Efficient Ops
Efficient Ops
Feb 28, 2017 · Operations

Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response

This article outlines a comprehensive PDCA‑based methodology for e‑commerce platforms to proactively prevent issues, quickly detect anomalies, and execute rapid decisions during large‑scale promotions, covering system goal definition, performance evaluation, capacity planning, SLA management, and team/process maturity.

Operationscapacity planninge‑commerce
0 likes · 18 min read
Prepare Your E‑Commerce System for Mega‑Sales: Proactive Prevention & Rapid Response
Meituan Technology Team
Meituan Technology Team
Feb 24, 2017 · Operations

Improvements and Architecture of Mt-Falcon Monitoring System

Mt‑Falcon, Meituan’s re‑engineered successor to Zabbix, introduces a modular architecture—Agent, Transfer, HBS, Judge, Graph, Alarm, Portal—and extensive refactorings that boost memory efficiency, asynchronous data handling, multi‑condition alerts, and API exposure, enabling over one million QPS, 200 million metrics, and robust, scalable monitoring across the company.

Alertingarchitecturemonitoring
0 likes · 24 min read
Improvements and Architecture of Mt-Falcon Monitoring System
Efficient Ops
Efficient Ops
Feb 21, 2017 · Mobile Development

How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes

This article details Alibaba's mobile app operational practices, covering the challenges of client-side maintenance, their high‑frequency release pipeline, gray‑release mechanisms, monitoring, trace systems, remote logging, and rapid issue resolution to ensure stability and performance at massive scale.

MobileOperationsgray release
0 likes · 21 min read
How Alibaba Scales Mobile App Ops: Gray Release, Monitoring, and Rapid Fixes
转转QA
转转QA
Feb 13, 2017 · Databases

Redis Connection Pool Saturation: A Debugging Tale

A developer recounts how a Redis connection pool overflow across dozens of clusters was traced to a single misbehaving service, diagnosed with netstat and ps commands, and resolved by adjusting configuration and stopping the offending process, illustrating practical troubleshooting of connection limits.

Connection PoolOperationsmonitoring
0 likes · 4 min read
Redis Connection Pool Saturation: A Debugging Tale
dbaplus Community
dbaplus Community
Feb 9, 2017 · Operations

Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights

This article shares JD’s large‑scale monitoring system (MDC) design, covering its three‑tier architecture, agent‑based data collection, performance optimizations for SNMP/IPMI, low‑overhead deployment, high‑availability strategies, and practical lessons on scaling monitoring across thousands of physical machines and containers.

JDPerformance Optimizationcontainer monitoring
0 likes · 10 min read
Scalable Monitoring for Massive Physical & Container Clusters: JD's MDC Insights
dbaplus Community
dbaplus Community
Feb 6, 2017 · Operations

How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations

CallGraph, JD.com’s in‑house distributed tracing platform, provides low‑intrusion, high‑performance monitoring for micro‑service ecosystems, enabling real‑time call‑graph analysis, TP metrics, flexible configuration, and future extensions such as deep‑learning‑driven insights.

Distributed TracingLog Processingmonitoring
0 likes · 15 min read
How JD’s CallGraph Transforms Distributed Tracing for Real‑Time Operations
Node Underground
Node Underground
Jan 24, 2017 · Operations

11 Essential Practices to Master Node.js Application Monitoring

Effective Node.js monitoring boosts competitiveness, user experience, and cost efficiency, and this guide outlines eleven key recommendations—from tracking downtime and response thresholds to linking performance with business metrics and leveraging third‑party APM tools—ensuring robust, noise‑free alerts and secure, scalable applications.

APMDevOpsNode.js
0 likes · 3 min read
11 Essential Practices to Master Node.js Application Monitoring
Efficient Ops
Efficient Ops
Jan 22, 2017 · Operations

What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

The 2016 Ops Alert Report reveals Zabbix’s dominance, preferred notification channels, monthly and daily alert trends, peak alert times, regional distribution, and quirky usage statistics, offering valuable insights for operations teams to optimize monitoring and incident response.

OperationsZabbixalerts
0 likes · 5 min read
What 2016 Ops Teams Learned About Monitoring Tools and Alert Patterns

Building a Scalable Business Monitoring System: Architecture, Modules & Lessons

This article presents a comprehensive case study of a business monitoring system, covering its background, architectural analysis, module design, time‑series database selection, visualization with Grafana, alerting strategies, decision‑making logic, and intelligent monitoring experiments, followed by key takeaways and lessons learned.

GrafanaInfluxDBOperations
0 likes · 12 min read
Building a Scalable Business Monitoring System: Architecture, Modules & Lessons
Liulishuo Tech Team
Liulishuo Tech Team
Dec 31, 2016 · Cloud Native

Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling

This article shares the engineering team’s experience of building a high‑growth, reliable backend for English Fluently, covering inter‑service communication with gRPC, service discovery, Docker‑based deployment, health‑checking, monitoring, autoscaling, Kubernetes orchestration, and multi‑cell availability strategies.

DockerKubernetesMicroservices
0 likes · 10 min read
Designing Scalable and Reliable Backend Services at English Fluently: Architecture, Service Discovery, Monitoring, and Autoscaling
dbaplus Community
dbaplus Community
Dec 26, 2016 · Databases

How to Build a Scalable, Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL management at scale, covering dedicated instance deployment, configuration consistency, multi‑instance creation, metadata collection, backup, monitoring, high‑availability with Zookeeper, and task orchestration using DBTask to achieve rapid, reliable database services.

AutomationDBTaskDatabase operations
0 likes · 12 min read
How to Build a Scalable, Automated MySQL Operations Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 26, 2016 · Operations

How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions

Alibaba’s SunFire platform delivers massive‑scale, real‑time log collection, processing, and visualization for e‑commerce spikes like Double 11, using low‑overhead agents, asynchronous Map/Reduce pipelines, fault‑tolerant task scheduling, and shared inputs to ensure accurate, low‑latency monitoring across billions of transactions.

AlibabaOperationsReal-Time
0 likes · 18 min read
How Alibaba’s SunFire Powers Real‑Time Monitoring for Billion‑Scale Transactions
Weidian Tech Team
Weidian Tech Team
Dec 15, 2016 · Databases

How to Build a Scalable Automated MySQL Operations Platform

This article explains how to standardize and automate MySQL operations—including multi‑instance deployment, metadata collection, monitoring, backup, and high‑availability using Zookeeper—so that large‑scale database services can be provisioned, managed, and scaled with minimal human intervention.

BackupDatabase operationshigh availability
0 likes · 11 min read
How to Build a Scalable Automated MySQL Operations Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 12, 2016 · Cloud Native

How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons

This article chronicles Alibaba's ten‑year journey from monolithic Java EE deployments to a cloud‑native microservice ecosystem, detailing the technical challenges, the evolution of its EDAS RPC frameworks, comprehensive monitoring, capacity planning, and the strategies that enabled resilient large‑scale services during massive traffic events.

Cloud Nativecapacity planningmonitoring
0 likes · 11 min read
How Alibaba Built a Decade-Long Microservice Architecture: Challenges and Lessons
Ctrip Technology
Ctrip Technology
Dec 2, 2016 · Backend Development

Challenges and Practices in Service‑Oriented Splitting of Qunar Payment System

The article details the technical challenges encountered during the service‑oriented decomposition of Qunar's payment platform, covering Dubbo and HTTP service conventions, database sharding and read/write separation, asynchronous processing, multi‑system management, and comprehensive monitoring and alerting solutions.

asynchronous processingdatabase shardingmonitoring
0 likes · 10 min read
Challenges and Practices in Service‑Oriented Splitting of Qunar Payment System
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 21, 2016 · Operations

Taobao’s Scaling Secrets: Stateless Sessions, Caching, Service Splitting & Sharding

This article explains how Taobao achieves horizontal scalability by adopting stateless session handling, efficient client‑side cookie storage, multi‑level caching, service splitting with HSF, database sharding via TDDL, asynchronous messaging, unstructured data storage, and comprehensive monitoring and configuration management.

Service Splittingcachingmonitoring
0 likes · 18 min read
Taobao’s Scaling Secrets: Stateless Sessions, Caching, Service Splitting & Sharding
Efficient Ops
Efficient Ops
Nov 20, 2016 · Operations

Why Most Log‑Analysis Features Are Overrated and What Really Matters

The article critiques popular but unnecessary log‑analysis features—such as sub‑second alerts, endless pagination, flashy maps, full SQL support, bulk downloads, and live tail—arguing that focusing on practical alert content, efficient querying, and proper architecture yields far more value for IT operations.

AlertingDSLData visualization
0 likes · 10 min read
Why Most Log‑Analysis Features Are Overrated and What Really Matters
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Nov 20, 2016 · Backend Development

How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions

The article details Meizu’s real‑time push system that supports 25 million online users and 6 million messages per minute, describing its four‑layer architecture, power‑saving strategies, network‑instability fixes, massive‑connection handling, monitoring practices, and gray‑release deployment techniques.

Distributed Systemshigh concurrencymonitoring
0 likes · 12 min read
How Meizu Scales Real‑Time Push to 600 M Messages/min: Architecture, Pitfalls & Solutions
Efficient Ops
Efficient Ops
Nov 17, 2016 · Operations

How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency

This article explains how Qunar tackled growing network device management workload, low‑efficiency manual processes, and operational risk by designing an integrated platform that automates common tasks, enforces permission‑based controls, records audits, and provides real‑time monitoring and scalable data collection.

Automationmonitoringpermission control
0 likes · 8 min read
How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency
Efficient Ops
Efficient Ops
Nov 14, 2016 · Operations

How a Banking Card Organization Built a Scalable Cloud Operations Platform

This article details the evolution from manual, standardized operations to an automated, intelligent cloud operations platform for a banking card organization, describing its motivations, core features, key scenarios, technical architecture, scheduling algorithms, data visualization, and real‑world outcomes.

AutomationOperations ManagementService Orchestration
0 likes · 13 min read
How a Banking Card Organization Built a Scalable Cloud Operations Platform
Qunar Tech Salon
Qunar Tech Salon
Nov 12, 2016 · Backend Development

Challenges and Solutions in Service‑Oriented Splitting of Qunar Payment System

The article examines the technical challenges encountered during the service‑oriented decomposition of Qunar's payment platform—including development efficiency, interface conventions, concurrency, security, monitoring, database sharding, read‑write separation, and asynchronous processing—and presents concrete solutions and best‑practice recommendations.

Backend Developmentasynchronous processingdatabase sharding
0 likes · 10 min read
Challenges and Solutions in Service‑Oriented Splitting of Qunar Payment System
ITPUB
ITPUB
Nov 11, 2016 · Databases

Essential Oracle SQL Queries for Performance Monitoring and Troubleshooting

This guide compiles a comprehensive set of Oracle SQL statements and explanations for detecting fragmented tables, index fragmentation, high clustering factor tables, session and process mapping, DML lock analysis, DDL lock inspection, active SQL tracking, resource usage statistics, and various performance‑related metrics, helping DBAs diagnose and tune database behavior efficiently.

AdministrationOracleQueries
0 likes · 26 min read
Essential Oracle SQL Queries for Performance Monitoring and Troubleshooting
Architecture Digest
Architecture Digest
Nov 10, 2016 · Operations

Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution

In this interview, Lu Pengcheng, a platform architect at Mogu Street, discusses the company’s large‑scale e‑commerce architecture, the evolution of its monitoring platform, design choices for high‑availability distributed systems, and future open‑source plans, providing practical insights for engineers and technical managers.

C++Distributed SystemsOperations
0 likes · 9 min read
Interview with Lu Pengcheng on Mogu Street’s Monitoring System Architecture and Evolution
Nightwalker Tech
Nightwalker Tech
Nov 9, 2016 · Operations

Best Practices for Service Monitoring and Alerting in E‑commerce Systems

The discussion outlines essential service‑monitoring techniques—including health checks, JVM metrics, traffic and payment ring‑ratio analysis, client‑side exception tracking, third‑party CDN monitoring, alert thresholds, instrumentation via AOP or SDKs, and tooling such as Datadog, Zabbix, and the Elastic stack—to reliably detect and respond to incidents in e‑commerce environments.

Alertinge‑commerceincident response
0 likes · 10 min read
Best Practices for Service Monitoring and Alerting in E‑commerce Systems
Efficient Ops
Efficient Ops
Oct 29, 2016 · Databases

Why Your System Slows Down: Uncover Hidden Database Bottlenecks

The article explains how unnoticed database issues often cause system slowness, outlines key diagnostic questions for operations teams, and presents a three‑step approach—discover, solve, prevent—to regularly health‑check and optimize databases for reliable performance.

Operationsdatabasesmonitoring
0 likes · 8 min read
Why Your System Slows Down: Uncover Hidden Database Bottlenecks
Meituan Technology Team
Meituan Technology Team
Oct 28, 2016 · Big Data

Design and Architecture of the CAT Real-Time Monitoring System

The CAT real‑time monitoring system, open‑sourced in 2014 for Java applications, combines a lightweight ThreadLocal‑based client SDK, Netty‑driven asynchronous transport, and a highly scalable backend that processes ~100 TB of logs daily across 70 machines, using custom binary serialization, in‑memory modeling, segmented storage with 48‑bit indexing, and hourly aggregation to provide near‑full‑volume fault detection, localization, and performance analysis.

Distributed SystemsJavaReal-Time
0 likes · 18 min read
Design and Architecture of the CAT Real-Time Monitoring System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 19, 2016 · Operations

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

AlertingAutomationInfrastructure
0 likes · 7 min read
Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation
Efficient Ops
Efficient Ops
Oct 17, 2016 · Operations

How Shanda Games Built a Scalable Automated Operations System

This article details Shanda Games' journey in designing and implementing a comprehensive automated operations platform—including installation, deployment, security, client and server updates, data analysis, backup, and monitoring—to efficiently manage hundreds of games across diverse hardware and operating systems.

AutomationDeploymentOperations
0 likes · 22 min read
How Shanda Games Built a Scalable Automated Operations System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Oct 14, 2016 · Operations

Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method

This guide teaches operations engineers a four‑step “Wang‑Wen‑Wen‑Qie” methodology—observing system health, listening via alerts, questioning changes, and pulse‑checking with profiling tools—to rapidly diagnose and resolve high‑impact incidents while maintaining clear communication and post‑mortem learning.

Operationsincident responsemonitoring
0 likes · 5 min read
Mastering Rapid Incident Response: An Ops Engineer’s 4‑Step Method
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 18, 2016 · Artificial Intelligence

How Linear Regression Can Tame Your Nighttime Alert Fatigue

This article explores how historical monitoring alerts can be analyzed and predicted using linear regression, guiding operations engineers to preprocess data, build regression models, and forecast future alert trends to reduce manual alarm handling and improve system stability.

Operationsalert predictionlinear regression
0 likes · 8 min read
How Linear Regression Can Tame Your Nighttime Alert Fatigue
Architecture Digest
Architecture Digest
Sep 7, 2016 · Backend Development

Design and Maintenance of High‑Peak E‑Commerce Systems for Traditional Enterprises

The article examines common pitfalls and best‑practice solutions for traditional enterprises building e‑commerce platforms that must handle traffic spikes, covering large‑scale query optimization, distributed architecture, database design, service degradation strategies, and comprehensive monitoring and operations.

cachinge‑commercemonitoring
0 likes · 13 min read
Design and Maintenance of High‑Peak E‑Commerce Systems for Traditional Enterprises
High Availability Architecture
High Availability Architecture
Aug 30, 2016 · Operations

Evolution of Meizu Flyme Operations Architecture and High‑Availability Practices

The article details Meizu's Flyme operations platform evolution—from a single‑cabinet setup in 2011 to a multi‑IDC, 6000‑server infrastructure—highlighting challenges, architectural upgrades, monitoring, cost control, automation, and future high‑availability directions for large‑scale internet services.

Infrastructurecost controlhigh availability
0 likes · 13 min read
Evolution of Meizu Flyme Operations Architecture and High‑Availability Practices
Efficient Ops
Efficient Ops
Aug 22, 2016 · Cloud Computing

Mastering OpenStack Monitoring: Key Metrics and Best Practices

This article explains what OpenStack is, outlines its core modules, and details the most important monitoring metrics for Nova, Neutron, Keystone, hypervisors, tenants, and RabbitMQ, helping engineers build a robust, scalable OpenStack monitoring solution.

MetricsNOVAOpenStack
0 likes · 11 min read
Mastering OpenStack Monitoring: Key Metrics and Best Practices
Efficient Ops
Efficient Ops
Aug 21, 2016 · Operations

How to Build a Standardized CI/CD Pipeline for Enterprise Delivery

This article explains how enterprises can overcome manual deployment challenges by standardizing their tech stack, defining mutable and immutable deployment modes, designing an XY‑axis model for system components, and implementing continuous integration, delivery, feedback, and monitoring using tools like Jenkins.

Continuous DeliveryDevOpsJenkins
0 likes · 21 min read
How to Build a Standardized CI/CD Pipeline for Enterprise Delivery
ITPUB
ITPUB
Aug 21, 2016 · Backend Development

How to Diagnose and Prevent Redis Data Loss in Production

This article examines common causes of Redis data loss, walks through a real‑world incident where 90,000 keys vanished, and provides concrete monitoring, configuration, and operational safeguards to detect and avoid such failures.

BackendData lossOperations
0 likes · 11 min read
How to Diagnose and Prevent Redis Data Loss in Production
ITPUB
ITPUB
Aug 18, 2016 · Backend Development

Step-by-Step Guide to Building a Django‑Based Operations Platform

This article fills the gap of detailed tutorials by chronologically documenting how the author constructed a Django‑powered operations platform, adding monitoring, dashboards, log viewing, task submission, and various management features, each illustrated with screenshots and brief explanations.

Backend DevelopmentDjangoPython
0 likes · 8 min read
Step-by-Step Guide to Building a Django‑Based Operations Platform
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 16, 2016 · Mobile Development

Building a Comprehensive Monitoring System for Mobile Apps: Problem Discovery, Localization, and Damage Control

This article explains how to design a complete mobile app monitoring framework that covers problem discovery through key quality metrics and user feedback, systematic log instrumentation, effective issue localization methods, and rapid damage‑control strategies such as cloud‑based feature toggles and hot‑fix mechanisms.

Mobilecrash analysislogging
0 likes · 12 min read
Building a Comprehensive Monitoring System for Mobile Apps: Problem Discovery, Localization, and Damage Control
Ctrip Technology
Ctrip Technology
Aug 12, 2016 · Big Data

Ctrip's Real-Time Data Platform: Architecture, Practices, and Lessons Learned

This article details Ctrip's journey building a unified real-time data platform—covering business motivations, architectural requirements, technology choices like Kafka and Storm, implementation of Avro schemas, monitoring, alerting, operational lessons, and future explorations such as Streaming CQL and JStorm.

AlertingBig DataKafka
0 likes · 15 min read
Ctrip's Real-Time Data Platform: Architecture, Practices, and Lessons Learned
Baidu Intelligent Testing
Baidu Intelligent Testing
Aug 11, 2016 · Mobile Development

Common Mobile App Quality Issues and Monitoring Challenges

The article outlines typical mobile app quality problems such as adaptation, user experience, and traffic consumption, discusses their characteristics and impact assessment, and emphasizes the need for a comprehensive monitoring system to quickly detect, locate, and mitigate issues in production.

MobileUser experienceapp quality
0 likes · 8 min read
Common Mobile App Quality Issues and Monitoring Challenges
Architecture Digest
Architecture Digest
Aug 7, 2016 · Operations

Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System

The article describes how Alibaba's game integration platform achieved business‑oriented high availability by abandoning traditional system‑centric designs and implementing a three‑dimensional architecture that combines clear HA goals, multi‑active deployment, client‑side retries, functional isolation, automated monitoring, and rapid fault recovery, ultimately meeting a 3‑minute issue‑location and 5‑minute business‑recovery target.

OperationsSystem Architecturebusiness‑oriented HA
0 likes · 21 min read
Designing a Three‑Dimensional High‑Availability Architecture for Alibaba's Game Integration System
Architecture Digest
Architecture Digest
Jul 30, 2016 · Frontend Development

Evolution and Architecture of Taobao Home Page: From PHP to Node, Performance Optimization, Stability, and Agile Operations

This article details the evolution of Taobao's home page over a year and a half, covering its background, migration from PHP to Node, modular architecture, performance tuning, stability mechanisms, and agile operational practices that keep a billion‑scale front‑end service reliable and fast.

CDNmonitoringnodejs
0 likes · 18 min read
Evolution and Architecture of Taobao Home Page: From PHP to Node, Performance Optimization, Stability, and Agile Operations
Architecture Digest
Architecture Digest
Jul 19, 2016 · Operations

Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System

The article presents a business‑oriented, three‑layer high‑availability architecture for a large‑scale game access platform, detailing measurable goals, client‑side retry with HTTP‑DNS, functional separation and degradation, multi‑region active‑active deployment, and automated, visual monitoring to achieve rapid fault detection, isolation, and recovery.

Operationsdistributed-systemsfault-tolerance
0 likes · 20 min read
Designing a Multi‑Dimensional High‑Availability Architecture for a Game Access System