Tagged articles
2179 articles
Page 14 of 22
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Jun 8, 2021 · Cloud Native

How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration

This article details a four‑stage migration project that rebuilt international services on a cloud‑native stack, introducing temporary Istio monitoring, standardized change processes, Helm‑based deployments, and full microservice integration while sharing practical quality‑assurance lessons and pitfalls.

DeploymentIstiocloud-native
0 likes · 14 min read
How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration
ByteFE
ByteFE
Jun 8, 2021 · Frontend Development

Design and Implementation of Monitor and Monitor‑Tracer SDKs for Frontend Event Tracking

This article explains the architecture of a complete event‑tracking system, introduces two kinds of front‑end events, and details the technical design and implementation of the monitor‑tracer SDK for page visibility/active time as well as the monitor SDK for custom trigger events, including lifecycle monitoring, DOM observation, decorators, and React hooks.

Page lifecycleevent trackingmonitoring
0 likes · 27 min read
Design and Implementation of Monitor and Monitor‑Tracer SDKs for Frontend Event Tracking
IT Architects Alliance
IT Architects Alliance
Jun 7, 2021 · Industry Insights

How WeChat Scales: Agile Practices and Architecture Behind Billions of Users

The article analyzes WeChat's success by detailing its three‑pronged strategy of precise product timing, agile project management, and robust technical support, and explains how the team applies agile attitudes, modular design, extensible protocols, disaster‑recovery mechanisms, and fine‑grained monitoring to operate a massive, highly available system.

Agile DevelopmentWeChatindustry insights
0 likes · 18 min read
How WeChat Scales: Agile Practices and Architecture Behind Billions of Users
Dada Group Technology
Dada Group Technology
Jun 4, 2021 · Databases

JD Daojia MySQL Containerization: Architecture, Implementation, and Operational Practices

This article presents JD Daojia's practice of containerizing MySQL, detailing the underlying resource platform, custom container scheduling algorithm, high‑availability design, monitoring system, and an automated operations platform that together improve performance, cut costs, and boost operational efficiency.

automationcontainerizationmonitoring
0 likes · 9 min read
JD Daojia MySQL Containerization: Architecture, Implementation, and Operational Practices
Youzan Coder
Youzan Coder
Jun 4, 2021 · Operations

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

This article analyzes the stability challenges of a multi‑store chain’s product‑copy mechanism, outlines design goals for isolation and scalability, and presents short‑ and long‑term monitoring, flow‑control, and emergency‑response strategies to ensure reliable large‑scale operations.

Flow ControlOperationsScalability
0 likes · 12 min read
How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System
Ops Development Stories
Ops Development Stories
Jun 4, 2021 · Operations

Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2

This tutorial explains how to use Zabbix Agent 2 to monitor MongoDB databases and clusters, covering the required read‑only user setup, relevant Zabbix templates, key metrics such as jumbo chunks, connection pool stats, server status, collection and replSet information, and practical configuration examples.

Agent2MongoDBOperations
0 likes · 6 min read
Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2
Big Data Technology Architecture
Big Data Technology Architecture
Jun 2, 2021 · Big Data

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

AnsibleBig DataEasyOps
0 likes · 9 min read
Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing
JD Retail Technology
JD Retail Technology
May 31, 2021 · Operations

JD Health Technical Product Team's 2021 618 Promotion Preparation: Architecture Review, Performance Tuning, Security Drills, and Monitoring

The JD Health Technical Product Team organized a comprehensive 618 promotion preparation in 2021, covering system architecture reviews, capacity assessments, performance stress testing, offensive‑defensive drills, system optimization, 24‑hour monitoring, and product operation training to ensure high availability and stable service during the large‑scale sales event.

Performance TestingSystem Architecturee‑commerce
0 likes · 10 min read
JD Health Technical Product Team's 2021 618 Promotion Preparation: Architecture Review, Performance Tuning, Security Drills, and Monitoring
Aikesheng Open Source Community
Aikesheng Open Source Community
May 28, 2021 · Databases

Comprehensive MySQL Inspection Checklist and Command Reference

This guide presents a detailed MySQL inspection checklist covering operating‑system metrics, critical MySQL parameters, status queries, replication health, high‑availability components, and useful SQL scripts, enabling DBAs to efficiently monitor performance, detect issues, and maintain reliable database services.

Replicationhigh availabilityinspection
0 likes · 11 min read
Comprehensive MySQL Inspection Checklist and Command Reference
Amap Tech
Amap Tech
May 28, 2021 · Operations

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

Gaode Ride‑Hailing created a comprehensive 360° observability platform—standardized logging, distributed tracing, multi‑domain metrics, visual dashboards, and an incident workflow—that transforms raw data into actionable insights, accelerates root‑cause analysis, and enables automated fault defense for its large‑scale cloud‑native microservice system.

Distributed Systemsfault tolerancelogging
0 likes · 22 min read
System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense
TAL Education Technology
TAL Education Technology
May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations
0 likes · 12 min read
Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading
Liangxu Linux
Liangxu Linux
May 27, 2021 · Operations

How I Built an Automated Redis Sentinel to Seamlessly Handle Failover

A sysadmin narrates how he monitors four Redis nodes, detects master failure with PING, promotes a slave using SLAVEOF, reconfigures the remaining replicas, and ultimately automates the entire process with a custom Sentinel program and a multi‑node Sentinel cluster for high availability.

Operationsautomationc++
0 likes · 11 min read
How I Built an Automated Redis Sentinel to Seamlessly Handle Failover
Liulishuo Tech Team
Liulishuo Tech Team
May 26, 2021 · Operations

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

This article details Liulishuo's customized Prometheus monitoring architecture, including data backup to Aliyun SLS, ECS service discovery, advanced alerting with PagerDuty and Goalert, GitOps-driven config management, cloud resource exporters, SLA monitoring, and future plans for storage and alert pipelines.

Alertingcloud-nativemonitoring
0 likes · 9 min read
Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo
IT Architects Alliance
IT Architects Alliance
May 24, 2021 · Cloud Native

What Is Microservices? A Complete Guide to Architecture, Benefits, and Tools

This article provides a comprehensive overview of microservices, explaining their definition, core characteristics, advantages and drawbacks, suitable organizational contexts, and the essential technical components such as service discovery, API gateways, configuration centers, communication protocols, monitoring, circuit breaking, and container orchestration platforms.

Cloud NativeConfiguration CenterMicroservices
0 likes · 19 min read
What Is Microservices? A Complete Guide to Architecture, Benefits, and Tools
New Oriental Technology
New Oriental Technology
May 24, 2021 · Operations

Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts

The article provides a comprehensive English overview of SkyWalking UI, detailing its dashboard metrics, topology visualization, trace analysis, performance profiling workflow, and alarm management, illustrating how the tool monitors microservice and cloud‑native environments with metrics such as throughput, latency, Apdex, and JVM statistics.

APMDistributed TracingSkyWalking
0 likes · 11 min read
Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts
dbaplus Community
dbaplus Community
May 18, 2021 · Operations

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

AlertingMetricsOperations
0 likes · 25 min read
Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation
ITPUB
ITPUB
May 17, 2021 · Operations

How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems

This article describes how a Chinese securities firm applied big‑data‑driven clustering and Bayesian methods to automate root‑cause analysis of trading‑system anomalies, detailing the challenges, algorithmic designs, practical implementations, and evaluation results that demonstrate significant reductions in false alarms and faster recovery.

Bayesian inferenceOperationsRoot Cause Analysis
0 likes · 17 min read
How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems
Xianyu Technology
Xianyu Technology
May 13, 2021 · Frontend Development

Front-End Disaster Recovery for Page Stability

To prevent page failures and white‑screen errors, the team built a front‑end SDK that fetches fallback data from OSS + CDN, offers configurable black/white‑list rules, lightweight validation, and a visual backend, cutting error rates from over 8% to 0.55% and dramatically improving interface stability.

CDNOSSSDK
0 likes · 9 min read
Front-End Disaster Recovery for Page Stability
JD Cloud Developers
JD Cloud Developers
May 11, 2021 · Cloud Native

How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale

This article explains the fundamental differences between traditional and cloud‑native monitoring systems, outlines the challenges each faces, and details JD.com's evolution from physical servers to JDOS 2.0, describing its modular architecture, deployment model, and ongoing optimization efforts.

JD.comOperationsarchitecture
0 likes · 10 min read
How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale
Dada Group Technology
Dada Group Technology
May 7, 2021 · Operations

How JD Daojia Built a Scalable Load‑Testing Platform to Reduce Test Time to 15 Minutes

Facing rising traffic, JD Daojia’s in‑house load‑testing platform was redesigned to automate script management, enable distributed JMeter execution, integrate real‑time monitoring, and support custom RPC protocols, dramatically lowering manual effort, cutting test cycles from an hour to fifteen minutes while ensuring system stability.

Distributed SystemsJMeterLoad Testing
0 likes · 12 min read
How JD Daojia Built a Scalable Load‑Testing Platform to Reduce Test Time to 15 Minutes
macrozheng
macrozheng
May 6, 2021 · Operations

How I Built an Automated Redis Sentinel System to Handle Failover

An operations engineer narrates how he monitors a four‑node Redis cluster, detects master failure with continuous PINGs, promotes a slave to master, reconfigures replicas, and automates the entire process with a sentinel program and a sentinel cluster for high availability.

automationfailovermonitoring
0 likes · 11 min read
How I Built an Automated Redis Sentinel System to Handle Failover
Youzan Coder
Youzan Coder
Apr 30, 2021 · Backend Development

Uncovering Hidden Flaws in a Distributed Lock Implementation: A Structured Code Review

This article examines the business context and collection workflow of a BOS‑integrated service, dissects the distributed lock logic and its execution sequence, and conducts a thorough structured code review that reveals logical, exception, non‑functional, and testability issues while offering concrete improvement recommendations.

BackendCode reviewbest practices
0 likes · 8 min read
Uncovering Hidden Flaws in a Distributed Lock Implementation: A Structured Code Review
MaGe Linux Operations
MaGe Linux Operations
Apr 29, 2021 · Operations

Essential Kubernetes Production Best Practices for Reliable Ops

This article outlines essential Kubernetes best‑practice guidelines for production environments, covering health probes, resource allocation, RBAC, cluster configuration, networking policies, monitoring, logging, stateless design, autoscaling, runtime security, and strategies for zero‑downtime and failure recovery.

KubernetesOperationsmonitoring
0 likes · 12 min read
Essential Kubernetes Production Best Practices for Reliable Ops
Programmer DD
Programmer DD
Apr 27, 2021 · Cloud Native

Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems

This article surveys the most popular open‑source projects for Site Reliability Engineering and DevOps, covering monitoring, deployment, chaos testing, and observability tools such as Cloudprober, Istio, Prometheus, Litmus, and more, highlighting their key features and how they help build scalable, high‑reliability cloud‑native systems.

DevOpsKubernetesSRE
0 likes · 11 min read
Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 26, 2021 · Operations

Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting

This article provides a complete tutorial on Prometheus, covering its origins, core features, installation methods (binary and Docker), configuration file structure, PromQL basics, HTTP API usage, Grafana integration, various exporters for metrics collection, and alerting with Alertmanager, all within a cloud‑native monitoring context.

AlertingExportersGrafana
0 likes · 32 min read
Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting
Top Architect
Top Architect
Apr 25, 2021 · Operations

Linux Server Monitoring: CPU, Memory, Disk I/O, and Network Tools Overview

This article introduces essential Linux monitoring utilities—including top, vmstat, pidstat, iostat, netstat, sar, and tcpdump—explaining their output fields, typical usage scenarios, and how to interpret CPU, memory, disk, and network performance metrics for effective system troubleshooting and optimization.

LinuxSystem Toolsmonitoring
0 likes · 17 min read
Linux Server Monitoring: CPU, Memory, Disk I/O, and Network Tools Overview
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 23, 2021 · Cloud Native

ByteDance Stateful Application Cloud‑Native Practices

ByteDance’s cloud‑native migration of stateful services uses a custom SolarService extending StatefulSet with Budset CRD to handle versioned data, shard‑aware routing, NUMA‑aware scheduling, advanced storage, eBPF monitoring, and automated PDB eviction, delivering efficiency, cost savings, and reliable rolling upgrades.

KubernetesSchedulingautomation
0 likes · 18 min read
ByteDance Stateful Application Cloud‑Native Practices
Alibaba Cloud Native
Alibaba Cloud Native
Apr 22, 2021 · Cloud Native

How KubeProbe Enables Early Problem Detection in Large‑Scale Cloud‑Native Clusters

This article explains how Alibaba's KubeProbe system combines black‑box probing and directed inspections to detect issues in massive ASI Kubernetes clusters before users notice them, detailing the architecture, implementation, integration with release pipelines, and real‑world results that improve reliability and operational efficiency.

KubeProbeKubernetescloud-native
0 likes · 17 min read
How KubeProbe Enables Early Problem Detection in Large‑Scale Cloud‑Native Clusters
Python Crawling & Data Mining
Python Crawling & Data Mining
Apr 22, 2021 · Databases

MongoDB Mastery: Install, Configure, and Perform CRUD

This comprehensive tutorial walks you through installing MongoDB on Windows, configuring data and log directories, setting environment variables, creating and managing databases, collections, indexes, aggregation pipelines, backup and restore procedures, monitoring tools, advanced query operators, user management, and using a visual tool like Navicat for MongoDB.

BackupCRUDInstallation
0 likes · 15 min read
MongoDB Mastery: Install, Configure, and Perform CRUD
vivo Internet Technology
vivo Internet Technology
Apr 21, 2021 · Operations

System Health Check: Principles and Implementation

System health checks, akin to medical exams, are vital for modern IT infrastructure, using active and passive monitoring, failover strategies, and tools like Spring Boot Actuator to detect hardware, network, load, or software issues, prevent single points of failure, and ensure continuous high‑availability service operation.

Network ReliabilityRocketMQSpring Boot Actuator
0 likes · 12 min read
System Health Check: Principles and Implementation
Efficient Ops
Efficient Ops
Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Auto ScalingOperationscapacity management
0 likes · 17 min read
How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance
Code Ape Tech Column
Code Ape Tech Column
Apr 20, 2021 · Cloud Native

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

This article provides a comprehensive step‑by‑step tutorial on designing, implementing, and deploying a simple Java Spring Boot microservice system on Kubernetes, covering architecture design, registration center, monitoring with Prometheus and Grafana, logging, tracing, flow control, and verification using tools such as Zipkin and Sentinel.

DeploymentKubernetesMicroservices
0 likes · 18 min read
A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes
DevOps Cloud Academy
DevOps Cloud Academy
Apr 14, 2021 · Operations

Six Core DevOps Capabilities That Drive Business Success

The article outlines the six essential DevOps capabilities—collaboration, automation, continuous integration, continuous testing, continuous delivery, and continuous monitoring—explaining how each contributes to improving efficiency, reducing errors, and ensuring reliable software delivery in modern IT environments.

CollaborationContinuous DeliveryDevOps
0 likes · 5 min read
Six Core DevOps Capabilities That Drive Business Success
Architect
Architect
Apr 10, 2021 · Cloud Native

A Beginner’s Guide to Building High‑Availability Microservices on Kubernetes

This article walks readers through the complete lifecycle of designing, implementing, deploying, and validating a simple Java Spring‑Boot microservice system on Kubernetes, covering service design, registration, monitoring, tracing, traffic control, high‑availability deployment, and practical verification steps.

cloud-nativemonitoringspring-boot
0 likes · 20 min read
A Beginner’s Guide to Building High‑Availability Microservices on Kubernetes
ITPUB
ITPUB
Apr 7, 2021 · Operations

8 Real-World Production Failures and How to Diagnose Them Quickly

The article shares eight authentic production incident cases—from frequent JVM Full GC and memory leaks to cache avalanches, DNS hijacking, and database deadlocks—detailing their root causes, diagnostic steps, code snippets, and practical remediation strategies for engineers facing similar challenges.

CacheJVMOperations
0 likes · 17 min read
8 Real-World Production Failures and How to Diagnose Them Quickly
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Apr 2, 2021 · Operations

Understanding Redis Sentinel: High‑Availability Mechanism and Automatic Failover

This article explains how Redis Sentinel provides high‑availability for Redis by continuously monitoring master and replica nodes, detecting failures through subjective and objective down states, electing a new master via quorum‑based voting, and notifying clients of the failover using Pub/Sub events.

Replicationfailoverhigh availability
0 likes · 19 min read
Understanding Redis Sentinel: High‑Availability Mechanism and Automatic Failover
Architecture Digest
Architecture Digest
Apr 2, 2021 · Backend Development

Understanding the Essence of Architecture: A Deep Dive into Weibo’s Large‑Scale System Design

The article explores the fundamental concepts of software architecture, illustrating how massive platforms like Weibo handle millions of users through layered design, service decomposition, multi‑level caching, distributed tracing, and capacity planning to achieve high scalability and reliability.

BackendDistributed Systemsarchitecture
0 likes · 21 min read
Understanding the Essence of Architecture: A Deep Dive into Weibo’s Large‑Scale System Design
Sohu Tech Products
Sohu Tech Products
Mar 31, 2021 · Operations

Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions

The article analyzes the instability of a company's Kubernetes clusters, identifies root causes such as unstable release processes, lack of monitoring, logging, and documentation, and proposes comprehensive solutions including a Kubernetes‑centric CI/CD pipeline, federated Prometheus monitoring, Elasticsearch logging, centralized documentation, and integrated traffic management with Kong and Istio.

DevOpsKubernetesOperations
0 likes · 10 min read
Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions
HelloTech
HelloTech
Mar 26, 2021 · Big Data

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.

AIBig DataData Quality
0 likes · 26 min read
Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform
Top Architect
Top Architect
Mar 26, 2021 · Operations

Top Open‑Source Projects for SREs and DevOps

This article presents a curated list of popular open‑source tools for monitoring, deployment, chaos testing, and reliability engineering, explaining their main features and how they help SREs and DevOps engineers build scalable, highly available cloud‑native systems.

Cloud NativeDevOpsSRE
0 likes · 10 min read
Top Open‑Source Projects for SREs and DevOps
FunTester
FunTester
Mar 18, 2021 · Operations

Real‑Time QPS Monitoring and Asynchronous Progress Display for Load Tests

This article explains how to enrich a performance‑testing framework with an asynchronous progress bar that also reports real‑time QPS, detailing the design of a unified Progress class, its handling of different thread models, the QPS calculation logic, a sample Groovy test script, and the resulting console output.

Load TestingPerformance TestingQPS
0 likes · 12 min read
Real‑Time QPS Monitoring and Asynchronous Progress Display for Load Tests
IT Architects Alliance
IT Architects Alliance
Mar 14, 2021 · Backend Development

Evolution and Performance Optimization of Ximalaya’s HTTP Gateway: From Tomcat NIO to Netty Full‑Async Architecture

This article describes how Ximalaya’s high‑traffic HTTP gateway evolved from a Tomcat NIO + AsyncServlet design to a Netty‑based fully asynchronous architecture, detailing the challenges of blocking I/O, memory copying, GC pressure, and how layered redesign, lock‑free connection pools, comprehensive monitoring, and performance optimizations enabled stable handling of over 200 billion daily calls with peak QPS exceeding 40 k per machine.

AsynchronousNettygateway
0 likes · 15 min read
Evolution and Performance Optimization of Ximalaya’s HTTP Gateway: From Tomcat NIO to Netty Full‑Async Architecture
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 12, 2021 · Operations

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.

AlertingCATDevOps
0 likes · 12 min read
Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform
Sohu Tech Products
Sohu Tech Products
Mar 10, 2021 · Databases

Elasticsearch Deployment Best Practices: Memory, CPU, Sharding, Replicas, Hot/Warm Architecture, Node Roles, Monitoring and Troubleshooting

This article presents practical best‑practice guidelines for configuring Elasticsearch in production, covering heap memory sizing, CPU considerations, shard and replica planning, hot‑warm node architecture, node role settings, common pitfalls, monitoring APIs, and troubleshooting tips.

Cluster TuningElasticsearchMemory Management
0 likes · 15 min read
Elasticsearch Deployment Best Practices: Memory, CPU, Sharding, Replicas, Hot/Warm Architecture, Node Roles, Monitoring and Troubleshooting
Zhuanzhuan Tech
Zhuanzhuan Tech
Mar 10, 2021 · Backend Development

Service Governance Architecture and Practices at Zhuanzhuan

This article explains how Zhuanzhuan’s service management platform implements comprehensive service governance—including registration, discovery, configuration, monitoring, authentication, rate limiting, and alerting—to support micro‑service architectures and improve reliability, scalability, and operational efficiency.

Configuration ManagementMicroservicesmonitoring
0 likes · 10 min read
Service Governance Architecture and Practices at Zhuanzhuan
IT Architects Alliance
IT Architects Alliance
Mar 9, 2021 · Backend Development

Understanding the Essence of Architecture and Scaling Strategies for Billion‑User Systems

This article explores the fundamental concepts of system architecture, illustrating how large‑scale services like Weibo handle massive traffic through layered design, sharding, caching, service decomposition, monitoring, and operational practices to achieve high performance and reliability.

Distributed SystemsMicroservicesScalability
0 likes · 20 min read
Understanding the Essence of Architecture and Scaling Strategies for Billion‑User Systems
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response
0 likes · 21 min read
How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook
Top Architect
Top Architect
Mar 6, 2021 · Operations

Spring Boot Monitoring with Prometheus and Grafana: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on setting up Spring Boot application monitoring using Prometheus and Grafana, covering project creation, dependency configuration, security setup, Prometheus server installation, Grafana dashboard creation, email alerting configuration, and testing the end‑to‑end alert workflow.

AlertingBackendSpring Boot
0 likes · 10 min read
Spring Boot Monitoring with Prometheus and Grafana: A Step‑by‑Step Guide
IT Architects Alliance
IT Architects Alliance
Mar 5, 2021 · Backend Development

Understanding the Essence of System Architecture: Insights from Weibo’s Large‑Scale Design

The article explores the fundamental concepts of system architecture, illustrating how large‑scale services like Weibo handle massive traffic through layered design, abstraction, caching, service decomposition, monitoring, and operational practices to achieve scalability, reliability, and performance.

BackendDistributed SystemsScalable Design
0 likes · 20 min read
Understanding the Essence of System Architecture: Insights from Weibo’s Large‑Scale Design
Architects' Tech Alliance
Architects' Tech Alliance
Mar 3, 2021 · Backend Development

Understanding the Essence of Architecture: Insights from Weibo’s Large‑Scale System Design

This article explores the fundamental principles of system architecture by analyzing Weibo’s evolution to a multi‑layer, high‑traffic platform, covering scalability, service decomposition, caching strategies, distributed tracing, and operational best practices for building robust backend systems.

ScalabilityWeibocaching
0 likes · 21 min read
Understanding the Essence of Architecture: Insights from Weibo’s Large‑Scale System Design
21CTO
21CTO
Mar 1, 2021 · Backend Development

How to Build a Scalable WebSocket Long‑Connection Gateway with Netty

This article explains the challenges of server‑push in HTTP, reviews WebSocket as the mainstream solution, and details the design, implementation, session management, monitoring, and performance testing of a Netty‑based distributed WebSocket long‑connection gateway used at iQIYI.

NettyWebSocketdistributed architecture
0 likes · 12 min read
How to Build a Scalable WebSocket Long‑Connection Gateway with Netty
DataFunTalk
DataFunTalk
Mar 1, 2021 · Artificial Intelligence

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

The article describes JD's end‑to‑end online learning pipeline for retail search, covering the background, system architecture, real‑time feature collection, sample stitching, Flink‑based incremental training, parameter updates, and full‑link monitoring to achieve low‑latency, high‑accuracy model serving.

FlinkModel ServingOnline Learning
0 likes · 9 min read
Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink
Didi Tech
Didi Tech
Feb 25, 2021 · Industry Insights

Why DiDi’s Obsuite Is Redefining Hybrid‑Cloud Observability

Obsuite, DiDi’s open‑source observability suite, tackles hybrid‑cloud monitoring challenges by combining metrics, logs, and traces, while the article analyzes market trends, private‑cloud demand, and the product’s architecture, open‑source components, and the OCE certification program for enterprise users.

Log ManagementMetricshybrid cloud
0 likes · 6 min read
Why DiDi’s Obsuite Is Redefining Hybrid‑Cloud Observability
ITPUB
ITPUB
Feb 23, 2021 · Backend Development

Building a Complete Backend Stack for Startups: Languages, Services, and Tools

This guide walks through the essential layers of a backend technology stack for startups, covering language choices, core components like DNS, load balancing, CDN, RPC frameworks, databases, messaging, logging, monitoring, configuration, deployment, and operational best‑practice processes.

DevOpsMessagingarchitecture
0 likes · 31 min read
Building a Complete Backend Stack for Startups: Languages, Services, and Tools
High Availability Architecture
High Availability Architecture
Feb 22, 2021 · Backend Development

Evolution and Performance Optimization of Ximalaya's High‑Throughput HTTP Gateway

This article details the design evolution, architectural redesign, and performance‑tuning techniques of Ximalaya's gateway—from an initial Tomcat NIO implementation to a fully asynchronous Netty‑based solution—covering traffic management, timeout handling, monitoring, and future HTTP/2 migration.

AsynchronousHTTPmonitoring
0 likes · 16 min read
Evolution and Performance Optimization of Ximalaya's High‑Throughput HTTP Gateway
Top Architect
Top Architect
Feb 7, 2021 · Cloud Native

A Comprehensive Guide to Docker Tools and the Container Ecosystem

This article provides an extensive overview of the most popular Docker‑related tools—including orchestration, CI/CD, monitoring, logging, security, storage, networking, service discovery, image building, and management solutions—detailing their core features, official websites, pricing models, and typical use cases for developers and operations engineers.

ContainerDevOpsDocker
0 likes · 23 min read
A Comprehensive Guide to Docker Tools and the Container Ecosystem
Efficient Ops
Efficient Ops
Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Operationsanomaly detectionlarge-scale systems
0 likes · 4 min read
How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops
Open Source Linux
Open Source Linux
Jan 29, 2021 · Operations

Essential Kubernetes Production Best Practices for Secure, Scalable Ops

This article outlines comprehensive production‑grade Kubernetes best practices—including health probes, RBAC, resource management, network policies, monitoring, autoscaling, image security, and zero‑downtime strategies—to help teams run secure, efficient, and highly available workloads.

KubernetesOperationsautoscaling
0 likes · 11 min read
Essential Kubernetes Production Best Practices for Secure, Scalable Ops
DevOps Cloud Academy
DevOps Cloud Academy
Jan 25, 2021 · Cloud Native

Blackbox Monitoring with Prometheus Blackbox Exporter in Kubernetes

This guide explains how to complement Prometheus white‑box monitoring with black‑box probes by deploying the Blackbox Exporter in a Kubernetes cluster, configuring ConfigMaps, Deployments, Services, and Prometheus scrape jobs for HTTP, DNS, TCP, and ICMP checks, and using annotations for automatic service discovery.

Blackbox ExporterPrometheusmonitoring
0 likes · 10 min read
Blackbox Monitoring with Prometheus Blackbox Exporter in Kubernetes
Efficient Ops
Efficient Ops
Jan 20, 2021 · Operations

Log vs Network Data: Which Wins Full‑Link Monitoring in Modern Distributed Systems?

With the shift from monolithic to distributed architectures, this article compares log‑based and network‑data‑based monitoring across data sources, precision, monitoring paths, and implementation methods, concluding that network‑data monitoring offers superior real‑time insight, lower cost, and faster deployment for full‑link observability.

full‑linklog analysismonitoring
0 likes · 11 min read
Log vs Network Data: Which Wins Full‑Link Monitoring in Modern Distributed Systems?
转转QA
转转QA
Jan 19, 2021 · Operations

Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms

This article details a systematic full‑link performance testing workflow—including background, timing, scenario design, data preparation, capacity planning, monitoring, issue analysis, and post‑test cleanup—aimed at reliably evaluating and scaling e‑commerce services during major promotional events.

OperationsPerformance Testingcapacity planning
0 likes · 18 min read
Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms
Architecture Digest
Architecture Digest
Jan 17, 2021 · Operations

System Performance Issue Analysis and Optimization Process

This article outlines a comprehensive process for diagnosing and optimizing performance problems in production business systems, covering hardware, OS, database, middleware, JVM tuning, code inefficiencies, monitoring tools, and the limitations of pre‑release testing, with practical guidelines and visual references.

APMDatabase TuningSystem optimization
0 likes · 16 min read
System Performance Issue Analysis and Optimization Process
Programmer DD
Programmer DD
Jan 15, 2021 · Operations

Why Does Prometheus Sometimes Fail to Trigger Alerts?

This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.

AlertingGrafanaOps
0 likes · 7 min read
Why Does Prometheus Sometimes Fail to Trigger Alerts?
Didi Tech
Didi Tech
Jan 14, 2021 · Cloud Computing

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Didi’s Logi‑KafkaManager is a multi‑tenant Kafka cloud platform that consolidates dozens of clusters into a secure, isolated gateway‑driven service offering intuitive web‑based topic management, real‑time metrics visualization, automated diagnostics, quota governance and safe scaling, delivering high internal satisfaction and enterprise commercialization.

Big DataKafkacloud platform
0 likes · 17 min read
Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform
HelloTech
HelloTech
Jan 13, 2021 · Mobile Development

Hybrid Container Optimization for Puhui Ride‑Hailing Business

The report details a comprehensive overhaul of Puhui’s hybrid container for ride‑hailing services, introducing full‑link tracing, container reuse, offline resource caching, image WebP conversion and pre‑fetching, which together slash first‑screen load times by over 70 % and boost the 1‑second open rate from 12 % to 91 %.

Hybrid ContainerMobile FrontendWebView
0 likes · 19 min read
Hybrid Container Optimization for Puhui Ride‑Hailing Business
ITPUB
ITPUB
Jan 13, 2021 · Operations

How to Diagnose and Optimize Business System Performance Issues

This article outlines a step‑by‑step approach for identifying root causes of performance bottlenecks in production business systems, covering common scenarios such as high concurrency, data growth, hardware limits, database and middleware tuning, code inefficiencies, and the role of monitoring and APM tools.

APMDatabase TuningJVM
0 likes · 15 min read
How to Diagnose and Optimize Business System Performance Issues
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 7, 2021 · Frontend Development

How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba

This article explores the challenges of frontend availability, introduces chaos engineering concepts, and details Alibaba's practical approach to frontend fault injection—including static resource hijacking, a safe isolated environment, monitoring integration, and a real‑world drill that demonstrates how to measure and improve detection and response capabilities.

Fault InjectionReliabilitychaos engineering
0 likes · 18 min read
How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba
Ops Development Stories
Ops Development Stories
Jan 7, 2021 · Operations

Master Blackbox Exporter: Install, Configure, and Alert with Prometheus

This guide walks through the concepts of white‑box vs black‑box monitoring, explains Prometheus Blackbox Exporter capabilities, shows step‑by‑step installation, Kubernetes configuration, probe definitions for HTTP, TCP, ICMP and SSL, and provides ready‑to‑use alert rules and Grafana dashboard integration.

AlertingBlackbox ExporterKubernetes
0 likes · 11 min read
Master Blackbox Exporter: Install, Configure, and Alert with Prometheus
Alibaba Terminal Technology
Alibaba Terminal Technology
Jan 6, 2021 · Frontend Development

Mastering Gray-Scale Cross-Platform Monitoring for Front-End Apps

This article explains the background, technical architecture, real‑world case, and key takeaways of implementing gray‑scale monitoring across web, Weex, mini‑programs, and other cross‑platform front‑end solutions to improve issue detection and reduce mean time to recovery.

Operationscross‑platformfrontend
0 likes · 10 min read
Mastering Gray-Scale Cross-Platform Monitoring for Front-End Apps