Tagged articles

2179 articles

Page 14 of 22

Jun 8, 2021 · Cloud Native

How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration

This article details a four‑stage migration project that rebuilt international services on a cloud‑native stack, introducing temporary Istio monitoring, standardized change processes, Helm‑based deployments, and full microservice integration while sharing practical quality‑assurance lessons and pitfalls.

DeploymentIstiocloud-native

0 likes · 14 min read

How We Stabilized International Services with a Multi‑Phase Cloud‑Native Migration

ByteFE

Jun 8, 2021 · Frontend Development

Design and Implementation of Monitor and Monitor‑Tracer SDKs for Frontend Event Tracking

This article explains the architecture of a complete event‑tracking system, introduces two kinds of front‑end events, and details the technical design and implementation of the monitor‑tracer SDK for page visibility/active time as well as the monitor SDK for custom trigger events, including lifecycle monitoring, DOM observation, decorators, and React hooks.

Page lifecycleevent trackingmonitoring

0 likes · 27 min read

Design and Implementation of Monitor and Monitor‑Tracer SDKs for Frontend Event Tracking

IT Architects Alliance

Jun 7, 2021 · Industry Insights

How WeChat Scales: Agile Practices and Architecture Behind Billions of Users

The article analyzes WeChat's success by detailing its three‑pronged strategy of precise product timing, agile project management, and robust technical support, and explains how the team applies agile attitudes, modular design, extensible protocols, disaster‑recovery mechanisms, and fine‑grained monitoring to operate a massive, highly available system.

Agile DevelopmentWeChatindustry insights

0 likes · 18 min read

How WeChat Scales: Agile Practices and Architecture Behind Billions of Users

Dada Group Technology

Jun 4, 2021 · Databases

JD Daojia MySQL Containerization: Architecture, Implementation, and Operational Practices

This article presents JD Daojia's practice of containerizing MySQL, detailing the underlying resource platform, custom container scheduling algorithm, high‑availability design, monitoring system, and an automated operations platform that together improve performance, cut costs, and boost operational efficiency.

automationcontainerizationmonitoring

0 likes · 9 min read

JD Daojia MySQL Containerization: Architecture, Implementation, and Operational Practices

Youzan Coder

Jun 4, 2021 · Operations

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

This article analyzes the stability challenges of a multi‑store chain’s product‑copy mechanism, outlines design goals for isolation and scalability, and presents short‑ and long‑term monitoring, flow‑control, and emergency‑response strategies to ensure reliable large‑scale operations.

Flow ControlOperationsScalability

0 likes · 12 min read

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

Ops Development Stories

Jun 4, 2021 · Operations

Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2

This tutorial explains how to use Zabbix Agent 2 to monitor MongoDB databases and clusters, covering the required read‑only user setup, relevant Zabbix templates, key metrics such as jumbo chunks, connection pool stats, server status, collection and replSet information, and practical configuration examples.

Agent2MongoDBOperations

0 likes · 6 min read

Step-by-Step Guide to Monitoring MongoDB with Zabbix Agent 2

Big Data Technology Architecture

Jun 2, 2021 · Big Data

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

AnsibleBig DataEasyOps

0 likes · 9 min read

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

Java High-Performance Architecture

Jun 1, 2021 · Backend Development

How Ximalaya Scaled Its Gateway to 200B Daily Calls: Architecture & Optimizations

This article details Ximalaya's evolution of its HTTP gateway—from a Tomcat NIO prototype to a fully asynchronous Netty design—covering architectural diagrams, performance bottlenecks, traffic management features, monitoring, GC tuning, and future plans for HTTP/2 and graceful degradation.

HTTP2asynchronous processinggateway

0 likes · 15 min read

How Ximalaya Scaled Its Gateway to 200B Daily Calls: Architecture & Optimizations

JD Retail Technology

May 31, 2021 · Operations

JD Health Technical Product Team's 2021 618 Promotion Preparation: Architecture Review, Performance Tuning, Security Drills, and Monitoring

The JD Health Technical Product Team organized a comprehensive 618 promotion preparation in 2021, covering system architecture reviews, capacity assessments, performance stress testing, offensive‑defensive drills, system optimization, 24‑hour monitoring, and product operation training to ensure high availability and stable service during the large‑scale sales event.

Performance TestingSystem Architecturee‑commerce

0 likes · 10 min read

JD Health Technical Product Team's 2021 618 Promotion Preparation: Architecture Review, Performance Tuning, Security Drills, and Monitoring

Aikesheng Open Source Community

May 28, 2021 · Databases

Comprehensive MySQL Inspection Checklist and Command Reference

This guide presents a detailed MySQL inspection checklist covering operating‑system metrics, critical MySQL parameters, status queries, replication health, high‑availability components, and useful SQL scripts, enabling DBAs to efficiently monitor performance, detect issues, and maintain reliable database services.

Replicationhigh availabilityinspection

0 likes · 11 min read

Comprehensive MySQL Inspection Checklist and Command Reference

Amap Tech

May 28, 2021 · Operations

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

Gaode Ride‑Hailing created a comprehensive 360° observability platform—standardized logging, distributed tracing, multi‑domain metrics, visual dashboards, and an incident workflow—that transforms raw data into actionable insights, accelerates root‑cause analysis, and enables automated fault defense for its large‑scale cloud‑native microservice system.

Distributed Systemsfault tolerancelogging

0 likes · 22 min read

System Observability Practices in Gaode Ride-Hailing: From Unified Logging to Fault Defense

TAL Education Technology

May 27, 2021 · Big Data

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

This article outlines the challenges of operating petabyte‑scale big‑data clusters and presents a comprehensive monitoring framework—including basic and upgraded monitoring layers, metric collection, alerting pipelines, and strategies for alarm convergence and grading—to ensure reliable, proactive SRE operations.

AlertingGrafanaOperations

0 likes · 12 min read

Big Data Monitoring System: Architecture, Basic and Advanced Monitoring, and Alert Convergence & Grading

Liangxu Linux

May 27, 2021 · Operations

How I Built an Automated Redis Sentinel to Seamlessly Handle Failover

A sysadmin narrates how he monitors four Redis nodes, detects master failure with PING, promotes a slave using SLAVEOF, reconfigures the remaining replicas, and ultimately automates the entire process with a custom Sentinel program and a multi‑node Sentinel cluster for high availability.

Operationsautomationc++

0 likes · 11 min read

How I Built an Automated Redis Sentinel to Seamlessly Handle Failover

Liulishuo Tech Team

May 26, 2021 · Operations

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

This article details Liulishuo's customized Prometheus monitoring architecture, including data backup to Aliyun SLS, ECS service discovery, advanced alerting with PagerDuty and Goalert, GitOps-driven config management, cloud resource exporters, SLA monitoring, and future plans for storage and alert pipelines.

Alertingcloud-nativemonitoring

0 likes · 9 min read

Custom Prometheus Monitoring Architecture and GitOps Practices at Liulishuo

IT Architects Alliance

May 24, 2021 · Cloud Native

What Is Microservices? A Complete Guide to Architecture, Benefits, and Tools

This article provides a comprehensive overview of microservices, explaining their definition, core characteristics, advantages and drawbacks, suitable organizational contexts, and the essential technical components such as service discovery, API gateways, configuration centers, communication protocols, monitoring, circuit breaking, and container orchestration platforms.

Cloud NativeConfiguration CenterMicroservices

0 likes · 19 min read

What Is Microservices? A Complete Guide to Architecture, Benefits, and Tools

New Oriental Technology

May 24, 2021 · Operations

Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts

The article provides a comprehensive English overview of SkyWalking UI, detailing its dashboard metrics, topology visualization, trace analysis, performance profiling workflow, and alarm management, illustrating how the tool monitors microservice and cloud‑native environments with metrics such as throughput, latency, Apdex, and JVM statistics.

APMDistributed TracingSkyWalking

0 likes · 11 min read

Overview of SkyWalking UI: Dashboard, Topology, Tracing, Profiling, and Alerts

dbaplus Community

May 18, 2021 · Operations

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

AlertingMetricsOperations

0 likes · 25 min read

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

ITPUB

May 17, 2021 · Operations

How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems

This article describes how a Chinese securities firm applied big‑data‑driven clustering and Bayesian methods to automate root‑cause analysis of trading‑system anomalies, detailing the challenges, algorithmic designs, practical implementations, and evaluation results that demonstrate significant reductions in false alarms and faster recovery.

Bayesian inferenceOperationsRoot Cause Analysis

0 likes · 17 min read

How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems

MaGe Linux Operations

May 16, 2021 · Operations

Top 10 Open‑Source Tools Every SRE Should Use for Reliable Cloud Operations

This article introduces ten popular open‑source projects for monitoring, deployment, and reliability engineering, detailing each tool's purpose, key features, and how they help Site Reliability Engineers build scalable, highly reliable cloud‑native systems.

DevOpsSREmonitoring

0 likes · 10 min read

Top 10 Open‑Source Tools Every SRE Should Use for Reliable Cloud Operations

Xianyu Technology

May 13, 2021 · Frontend Development

Front-End Disaster Recovery for Page Stability

To prevent page failures and white‑screen errors, the team built a front‑end SDK that fetches fallback data from OSS + CDN, offers configurable black/white‑list rules, lightweight validation, and a visual backend, cutting error rates from over 8% to 0.55% and dramatically improving interface stability.

CDNOSSSDK

0 likes · 9 min read

Front-End Disaster Recovery for Page Stability

Tencent Cloud Middleware

May 11, 2021 · Operations

Mastering High Availability: Core Concepts, Metrics, and Design Strategies

This article explains high availability fundamentals, defines availability, outlines design targets, presents common metrics such as MTBF, MTTR, MTTF, SA, RPO, RTO, discusses CAP theory, essential design elements, and answers practical Q&A on cost, architecture, fault tolerance, testing, and implementation guidance.

CAP theoremSLAfailover

0 likes · 15 min read

Mastering High Availability: Core Concepts, Metrics, and Design Strategies

JD Cloud Developers

May 11, 2021 · Cloud Native

How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale

This article explains the fundamental differences between traditional and cloud‑native monitoring systems, outlines the challenges each faces, and details JD.com's evolution from physical servers to JDOS 2.0, describing its modular architecture, deployment model, and ongoing optimization efforts.

JD.comOperationsarchitecture

0 likes · 10 min read

How JD Built a Cloud‑Native Monitoring & Logging System for Massive Scale

Dada Group Technology

May 7, 2021 · Operations

How JD Daojia Built a Scalable Load‑Testing Platform to Reduce Test Time to 15 Minutes

Facing rising traffic, JD Daojia’s in‑house load‑testing platform was redesigned to automate script management, enable distributed JMeter execution, integrate real‑time monitoring, and support custom RPC protocols, dramatically lowering manual effort, cutting test cycles from an hour to fifteen minutes while ensuring system stability.

Distributed SystemsJMeterLoad Testing

0 likes · 12 min read

How JD Daojia Built a Scalable Load‑Testing Platform to Reduce Test Time to 15 Minutes

macrozheng

May 6, 2021 · Operations

How I Built an Automated Redis Sentinel System to Handle Failover

An operations engineer narrates how he monitors a four‑node Redis cluster, detects master failure with continuous PINGs, promotes a slave to master, reconfigures replicas, and automates the entire process with a sentinel program and a sentinel cluster for high availability.

automationfailovermonitoring

0 likes · 11 min read

How I Built an Automated Redis Sentinel System to Handle Failover

Youzan Coder

Apr 30, 2021 · Backend Development

Uncovering Hidden Flaws in a Distributed Lock Implementation: A Structured Code Review

This article examines the business context and collection workflow of a BOS‑integrated service, dissects the distributed lock logic and its execution sequence, and conducts a thorough structured code review that reveals logical, exception, non‑functional, and testability issues while offering concrete improvement recommendations.

BackendCode reviewbest practices

0 likes · 8 min read

Uncovering Hidden Flaws in a Distributed Lock Implementation: A Structured Code Review

MaGe Linux Operations

Apr 29, 2021 · Operations

Essential Kubernetes Production Best Practices for Reliable Ops

This article outlines essential Kubernetes best‑practice guidelines for production environments, covering health probes, resource allocation, RBAC, cluster configuration, networking policies, monitoring, logging, stateless design, autoscaling, runtime security, and strategies for zero‑downtime and failure recovery.

KubernetesOperationsmonitoring

0 likes · 12 min read

Essential Kubernetes Production Best Practices for Reliable Ops

Programmer DD

Apr 27, 2021 · Cloud Native

Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems

This article surveys the most popular open‑source projects for Site Reliability Engineering and DevOps, covering monitoring, deployment, chaos testing, and observability tools such as Cloudprober, Istio, Prometheus, Litmus, and more, highlighting their key features and how they help build scalable, high‑reliability cloud‑native systems.

DevOpsKubernetesSRE

0 likes · 11 min read

Top Open‑Source Tools Every SRE Should Master for Scalable, Reliable Systems

Big Data Technology & Architecture

Apr 26, 2021 · Operations

Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting

This article provides a complete tutorial on Prometheus, covering its origins, core features, installation methods (binary and Docker), configuration file structure, PromQL basics, HTTP API usage, Grafana integration, various exporters for metrics collection, and alerting with Alertmanager, all within a cloud‑native monitoring context.

AlertingExportersGrafana

0 likes · 32 min read

Comprehensive Guide to Prometheus: Installation, Configuration, PromQL, Exporters, Grafana, and Alerting

Top Architect

Apr 25, 2021 · Operations

Linux Server Monitoring: CPU, Memory, Disk I/O, and Network Tools Overview

This article introduces essential Linux monitoring utilities—including top, vmstat, pidstat, iostat, netstat, sar, and tcpdump—explaining their output fields, typical usage scenarios, and how to interpret CPU, memory, disk, and network performance metrics for effective system troubleshooting and optimization.

LinuxSystem Toolsmonitoring

0 likes · 17 min read

Linux Server Monitoring: CPU, Memory, Disk I/O, and Network Tools Overview

iQIYI Technical Product Team

Apr 23, 2021 · Cloud Native

ByteDance Stateful Application Cloud‑Native Practices

ByteDance’s cloud‑native migration of stateful services uses a custom SolarService extending StatefulSet with Budset CRD to handle versioned data, shard‑aware routing, NUMA‑aware scheduling, advanced storage, eBPF monitoring, and automated PDB eviction, delivering efficiency, cost savings, and reliable rolling upgrades.

KubernetesSchedulingautomation

0 likes · 18 min read

ByteDance Stateful Application Cloud‑Native Practices

Alibaba Cloud Native

Apr 22, 2021 · Cloud Native

How KubeProbe Enables Early Problem Detection in Large‑Scale Cloud‑Native Clusters

This article explains how Alibaba's KubeProbe system combines black‑box probing and directed inspections to detect issues in massive ASI Kubernetes clusters before users notice them, detailing the architecture, implementation, integration with release pipelines, and real‑world results that improve reliability and operational efficiency.

KubeProbeKubernetescloud-native

0 likes · 17 min read

How KubeProbe Enables Early Problem Detection in Large‑Scale Cloud‑Native Clusters

Volcano Engine Developer Services

Apr 22, 2021 · Cloud Native

How ByteDance Scaled Stateful Applications with Cloud‑Native Kubernetes

This article details ByteDance's journey of migrating stateful services to a cloud‑native Kubernetes platform, covering challenges in state management, infrastructure enhancements, storage solutions, monitoring, and automated operations that together improve efficiency and reduce costs at massive scale.

Schedulingcloud-nativemonitoring

0 likes · 17 min read

How ByteDance Scaled Stateful Applications with Cloud‑Native Kubernetes

Python Crawling & Data Mining

Apr 22, 2021 · Databases

MongoDB Mastery: Install, Configure, and Perform CRUD

This comprehensive tutorial walks you through installing MongoDB on Windows, configuring data and log directories, setting environment variables, creating and managing databases, collections, indexes, aggregation pipelines, backup and restore procedures, monitoring tools, advanced query operators, user management, and using a visual tool like Navicat for MongoDB.

BackupCRUDInstallation

0 likes · 15 min read

MongoDB Mastery: Install, Configure, and Perform CRUD

vivo Internet Technology

Apr 21, 2021 · Operations

System Health Check: Principles and Implementation

System health checks, akin to medical exams, are vital for modern IT infrastructure, using active and passive monitoring, failover strategies, and tools like Spring Boot Actuator to detect hardware, network, load, or software issues, prevent single points of failure, and ensure continuous high‑availability service operation.

Network ReliabilityRocketMQSpring Boot Actuator

0 likes · 12 min read

System Health Check: Principles and Implementation

Efficient Ops

Apr 20, 2021 · Operations

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Auto ScalingOperationscapacity management

0 likes · 17 min read

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

Code Ape Tech Column

Apr 20, 2021 · Cloud Native

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

This article provides a comprehensive step‑by‑step tutorial on designing, implementing, and deploying a simple Java Spring Boot microservice system on Kubernetes, covering architecture design, registration center, monitoring with Prometheus and Grafana, logging, tracing, flow control, and verification using tools such as Zipkin and Sentinel.

DeploymentKubernetesMicroservices

0 likes · 18 min read

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

ITPUB

Apr 18, 2021 · Operations

Choosing the Right Monitoring Tool: Cacti, Nagios, Zabbix, Grafana, and Prometheus Compared

This article reviews five popular open‑source monitoring solutions—Cacti, Nagios, Zabbix, Prometheus, and Grafana—highlighting their core architectures, key features, strengths, and limitations to help IT professionals select the most suitable tool for their environment.

CactiGrafanaIT Operations

0 likes · 7 min read

Choosing the Right Monitoring Tool: Cacti, Nagios, Zabbix, Grafana, and Prometheus Compared

Liangxu Linux

Apr 18, 2021 · Operations

Deploy a Complete Prometheus Monitoring Stack with Docker, Exporters, and Alertmanager

This guide walks through installing Prometheus, building a custom Docker image, configuring service discovery, adding node, cadvisor, Redis, JMX and process exporters, setting up Alertmanager with WeChat alerts, creating PromQL rules, and visualizing metrics in Grafana for a production‑grade monitoring solution.

AlertmanagerConsulExporters

0 likes · 30 min read

Deploy a Complete Prometheus Monitoring Stack with Docker, Exporters, and Alertmanager

Node Underground

Apr 16, 2021 · Operations

How to Integrate Grafana & Prometheus Monitoring into Midway Applications

Learn step‑by‑step how to install Midway’s Prometheus plugin, configure Docker‑based Prometheus and Grafana, expose metrics from a Node.js app, and visualize them in Grafana dashboards, enabling effective monitoring and operations for your services.

DockerGrafanaMidway

0 likes · 7 min read

How to Integrate Grafana & Prometheus Monitoring into Midway Applications

DevOps Cloud Academy

Apr 14, 2021 · Operations

Six Core DevOps Capabilities That Drive Business Success

The article outlines the six essential DevOps capabilities—collaboration, automation, continuous integration, continuous testing, continuous delivery, and continuous monitoring—explaining how each contributes to improving efficiency, reducing errors, and ensuring reliable software delivery in modern IT environments.

CollaborationContinuous DeliveryDevOps

0 likes · 5 min read

Six Core DevOps Capabilities That Drive Business Success

Full-Stack Internet Architecture

Apr 14, 2021 · Operations

Introduction to Prometheus: Concepts, Deployment, and Integration with Grafana

This article introduces Prometheus as an open‑source monitoring system and time‑series database, explains its core components such as Server, Exporter, PushGateway and Service Discovery, and provides step‑by‑step Docker deployment instructions together with Grafana integration for visualizing metrics.

DockerExporterGrafana

0 likes · 6 min read

Introduction to Prometheus: Concepts, Deployment, and Integration with Grafana

Architect

Apr 10, 2021 · Cloud Native

A Beginner’s Guide to Building High‑Availability Microservices on Kubernetes

This article walks readers through the complete lifecycle of designing, implementing, deploying, and validating a simple Java Spring‑Boot microservice system on Kubernetes, covering service design, registration, monitoring, tracing, traffic control, high‑availability deployment, and practical verification steps.

cloud-nativemonitoringspring-boot

0 likes · 20 min read

A Beginner’s Guide to Building High‑Availability Microservices on Kubernetes

Top Architect

Apr 10, 2021 · Backend Development

Understanding Java Thread Pools: Benefits, Workflow, Configuration, Optimization, and Monitoring

This article provides a comprehensive overview of Java thread pools, covering their advantages, internal workflow, creation parameters, various execution states, shutdown procedures, performance tuning strategies, and key monitoring metrics, supplemented with practical code examples.

ThreadPoolExecutorbackend-developmentconcurrency

0 likes · 15 min read

Understanding Java Thread Pools: Benefits, Workflow, Configuration, Optimization, and Monitoring

ITPUB

Apr 7, 2021 · Operations

8 Real-World Production Failures and How to Diagnose Them Quickly

The article shares eight authentic production incident cases—from frequent JVM Full GC and memory leaks to cache avalanches, DNS hijacking, and database deadlocks—detailing their root causes, diagnostic steps, code snippets, and practical remediation strategies for engineers facing similar challenges.

CacheJVMOperations

0 likes · 17 min read

8 Real-World Production Failures and How to Diagnose Them Quickly

Programmer DD

Apr 3, 2021 · Operations

Why a Massive KEYS * Command Crashed Our Redis Service and How to Fix It

The article recounts a sudden Redis performance crisis caused by massive KEYS * commands, explains how monitoring, INFO, COMMANDSTATS and SLOWLOG revealed the issue, and outlines temporary and long‑term remediation steps for preventing similar outages.

Slowlogmonitoring

0 likes · 7 min read

Why a Massive KEYS * Command Crashed Our Redis Service and How to Fix It

Full-Stack Internet Architecture

Apr 2, 2021 · Operations

Understanding Redis Sentinel: High‑Availability Mechanism and Automatic Failover

This article explains how Redis Sentinel provides high‑availability for Redis by continuously monitoring master and replica nodes, detecting failures through subjective and objective down states, electing a new master via quorum‑based voting, and notifying clients of the failover using Pub/Sub events.

Replicationfailoverhigh availability

0 likes · 19 min read

Understanding Redis Sentinel: High‑Availability Mechanism and Automatic Failover

Architecture Digest

Apr 2, 2021 · Backend Development

Understanding the Essence of Architecture: A Deep Dive into Weibo’s Large‑Scale System Design

The article explores the fundamental concepts of software architecture, illustrating how massive platforms like Weibo handle millions of users through layered design, service decomposition, multi‑level caching, distributed tracing, and capacity planning to achieve high scalability and reliability.

BackendDistributed Systemsarchitecture

0 likes · 21 min read

Understanding the Essence of Architecture: A Deep Dive into Weibo’s Large‑Scale System Design

Sohu Tech Products

Mar 31, 2021 · Operations

Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions

The article analyzes the instability of a company's Kubernetes clusters, identifies root causes such as unstable release processes, lack of monitoring, logging, and documentation, and proposes comprehensive solutions including a Kubernetes‑centric CI/CD pipeline, federated Prometheus monitoring, Elasticsearch logging, centralized documentation, and integrated traffic management with Kong and Istio.

DevOpsKubernetesOperations

0 likes · 10 min read

Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions

HelloTech

Mar 26, 2021 · Big Data

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.

AIBig DataData Quality

0 likes · 26 min read

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

Top Architect

Mar 26, 2021 · Operations

Top Open‑Source Projects for SREs and DevOps

This article presents a curated list of popular open‑source tools for monitoring, deployment, chaos testing, and reliability engineering, explaining their main features and how they help SREs and DevOps engineers build scalable, highly available cloud‑native systems.

Cloud NativeDevOpsSRE

0 likes · 10 min read

Top Open‑Source Projects for SREs and DevOps

Architecture Digest

Mar 24, 2021 · Cloud Native

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

This article provides a step‑by‑step tutorial on designing a simple front‑back separation microservice system, implementing it with Java Spring Boot, and deploying the whole solution on a Kubernetes cluster with high‑availability features, monitoring, logging, tracing, and traffic control.

KubernetesSpring Bootcloud-native

0 likes · 18 min read

MaGe Linux Operations

Mar 23, 2021 · Operations

Build a Full‑Stack Prometheus Monitoring System with Docker, Exporters & Alertmanager

This guide walks through deploying Prometheus, its exporters, Alertmanager, and Grafana using Docker, configuring service discovery with Consul, writing PromQL alerts, and visualizing metrics, providing a complete end‑to‑end monitoring solution for cloud‑native environments.

AlertmanagerConsulExporters

0 likes · 31 min read

Build a Full‑Stack Prometheus Monitoring System with Docker, Exporters & Alertmanager

Architecture Digest

Mar 19, 2021 · Backend Development

Evolution and Performance Optimization of a High‑Throughput HTTP Gateway at Ximalaya

This article details the design evolution, architectural choices, performance tuning, monitoring, and future plans of Ximalaya's high‑traffic HTTP gateway, covering its migration from Tomcat NIO to a fully asynchronous Netty implementation and the associated engineering challenges and solutions.

AsynchronousHTTPNetty

0 likes · 15 min read

Evolution and Performance Optimization of a High‑Throughput HTTP Gateway at Ximalaya

FunTester

Mar 18, 2021 · Operations

Real‑Time QPS Monitoring and Asynchronous Progress Display for Load Tests

This article explains how to enrich a performance‑testing framework with an asynchronous progress bar that also reports real‑time QPS, detailing the design of a unified Progress class, its handling of different thread models, the QPS calculation logic, a sample Groovy test script, and the resulting console output.

Load TestingPerformance TestingQPS

0 likes · 12 min read

Real‑Time QPS Monitoring and Asynchronous Progress Display for Load Tests

Open Source Linux

Mar 15, 2021 · Databases

Mastering MongoDB Clusters: Setup, Monitoring, and Performance Tuning

This comprehensive guide explains MongoDB cluster components, common use cases, monitoring commands, basic operational tasks, data migration procedures, troubleshooting of production issues, and optimization recommendations to achieve high performance and scalability.

ClusterMongoDBOperations

0 likes · 20 min read

Mastering MongoDB Clusters: Setup, Monitoring, and Performance Tuning

IT Architects Alliance

Mar 14, 2021 · Backend Development

Evolution and Performance Optimization of Ximalaya’s HTTP Gateway: From Tomcat NIO to Netty Full‑Async Architecture

This article describes how Ximalaya’s high‑traffic HTTP gateway evolved from a Tomcat NIO + AsyncServlet design to a Netty‑based fully asynchronous architecture, detailing the challenges of blocking I/O, memory copying, GC pressure, and how layered redesign, lock‑free connection pools, comprehensive monitoring, and performance optimizations enabled stable handling of over 200 billion daily calls with peak QPS exceeding 40 k per machine.

AsynchronousNettygateway

0 likes · 15 min read

Evolution and Performance Optimization of Ximalaya’s HTTP Gateway: From Tomcat NIO to Netty Full‑Async Architecture

iQIYI Technical Product Team

Mar 12, 2021 · Operations

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.

AlertingCATDevOps

0 likes · 12 min read

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

Sohu Tech Products

Mar 10, 2021 · Databases

Elasticsearch Deployment Best Practices: Memory, CPU, Sharding, Replicas, Hot/Warm Architecture, Node Roles, Monitoring and Troubleshooting

This article presents practical best‑practice guidelines for configuring Elasticsearch in production, covering heap memory sizing, CPU considerations, shard and replica planning, hot‑warm node architecture, node role settings, common pitfalls, monitoring APIs, and troubleshooting tips.

Cluster TuningElasticsearchMemory Management

0 likes · 15 min read

Elasticsearch Deployment Best Practices: Memory, CPU, Sharding, Replicas, Hot/Warm Architecture, Node Roles, Monitoring and Troubleshooting

Zhuanzhuan Tech

Mar 10, 2021 · Backend Development

Service Governance Architecture and Practices at Zhuanzhuan

This article explains how Zhuanzhuan’s service management platform implements comprehensive service governance—including registration, discovery, configuration, monitoring, authentication, rate limiting, and alerting—to support micro‑service architectures and improve reliability, scalability, and operational efficiency.

Configuration ManagementMicroservicesmonitoring

0 likes · 10 min read

Service Governance Architecture and Practices at Zhuanzhuan

IT Architects Alliance

Mar 9, 2021 · Backend Development

Understanding the Essence of Architecture and Scaling Strategies for Billion‑User Systems

This article explores the fundamental concepts of system architecture, illustrating how large‑scale services like Weibo handle massive traffic through layered design, sharding, caching, service decomposition, monitoring, and operational practices to achieve high performance and reliability.

Distributed SystemsMicroservicesScalability

0 likes · 20 min read

Understanding the Essence of Architecture and Scaling Strategies for Billion‑User Systems

Alibaba Cloud Developer

Mar 8, 2021 · Operations

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

This article presents a comprehensive, step‑by‑step framework for guaranteeing system reliability during high‑traffic promotional periods, covering SRE hierarchy, stability criteria, profiling, monitoring, capacity planning, incident response, and post‑event analysis to help teams build resilient services.

SREcapacity planningincident response

0 likes · 21 min read

How to Ensure System Stability During Massive Sales Events: A Complete SRE Playbook

Top Architect

Mar 6, 2021 · Operations

Spring Boot Monitoring with Prometheus and Grafana: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on setting up Spring Boot application monitoring using Prometheus and Grafana, covering project creation, dependency configuration, security setup, Prometheus server installation, Grafana dashboard creation, email alerting configuration, and testing the end‑to‑end alert workflow.

AlertingBackendSpring Boot

0 likes · 10 min read

Spring Boot Monitoring with Prometheus and Grafana: A Step‑by‑Step Guide

IT Architects Alliance

Mar 5, 2021 · Backend Development

Understanding the Essence of System Architecture: Insights from Weibo’s Large‑Scale Design

The article explores the fundamental concepts of system architecture, illustrating how large‑scale services like Weibo handle massive traffic through layered design, abstraction, caching, service decomposition, monitoring, and operational practices to achieve scalability, reliability, and performance.

BackendDistributed SystemsScalable Design

0 likes · 20 min read

Understanding the Essence of System Architecture: Insights from Weibo’s Large‑Scale Design

Architects' Tech Alliance

Mar 3, 2021 · Backend Development

Understanding the Essence of Architecture: Insights from Weibo’s Large‑Scale System Design

This article explores the fundamental principles of system architecture by analyzing Weibo’s evolution to a multi‑layer, high‑traffic platform, covering scalability, service decomposition, caching strategies, distributed tracing, and operational best practices for building robust backend systems.

ScalabilityWeibocaching

0 likes · 21 min read

Understanding the Essence of Architecture: Insights from Weibo’s Large‑Scale System Design

Efficient Ops

Mar 1, 2021 · Operations

Mastering Monitoring: From Fundamentals to Prometheus in Cloud‑Native Environments

This comprehensive guide explains the purpose, models, and methods of monitoring across the entire software lifecycle, compares health checks, logging, tracing, and metric collection, and details practical implementations using tools like ELK, SkyWalking, and Prometheus for cloud‑native operations.

Operationscloud-nativemonitoring

0 likes · 24 min read

Mastering Monitoring: From Fundamentals to Prometheus in Cloud‑Native Environments

21CTO

Mar 1, 2021 · Backend Development

How to Build a Scalable WebSocket Long‑Connection Gateway with Netty

This article explains the challenges of server‑push in HTTP, reviews WebSocket as the mainstream solution, and details the design, implementation, session management, monitoring, and performance testing of a Netty‑based distributed WebSocket long‑connection gateway used at iQIYI.

NettyWebSocketdistributed architecture

0 likes · 12 min read

How to Build a Scalable WebSocket Long‑Connection Gateway with Netty

DataFunTalk

Mar 1, 2021 · Artificial Intelligence

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

The article describes JD's end‑to‑end online learning pipeline for retail search, covering the background, system architecture, real‑time feature collection, sample stitching, Flink‑based incremental training, parameter updates, and full‑link monitoring to achieve low‑latency, high‑accuracy model serving.

FlinkModel ServingOnline Learning

0 likes · 9 min read

Online Learning and Real‑Time Model Updating in JD Retail Search Using Flink

Didi Tech

Feb 25, 2021 · Industry Insights

Why DiDi’s Obsuite Is Redefining Hybrid‑Cloud Observability

Obsuite, DiDi’s open‑source observability suite, tackles hybrid‑cloud monitoring challenges by combining metrics, logs, and traces, while the article analyzes market trends, private‑cloud demand, and the product’s architecture, open‑source components, and the OCE certification program for enterprise users.

Log ManagementMetricshybrid cloud

0 likes · 6 min read

Why DiDi’s Obsuite Is Redefining Hybrid‑Cloud Observability

Sohu Tech Products

Feb 24, 2021 · Operations

Redis Monitoring and Alerting Practices: Metrics, Thresholds, and Troubleshooting

This article presents a comprehensive guide to Redis monitoring and alerting, covering metric classification, threshold settings, client traffic collection, host resource usage, instance health checks, cluster failover diagnostics, and detailed explanations of Redis INFO sections with practical code examples.

AlertingMetricsOperations

0 likes · 23 min read

Redis Monitoring and Alerting Practices: Metrics, Thresholds, and Troubleshooting

ITPUB

Feb 23, 2021 · Backend Development

Building a Complete Backend Stack for Startups: Languages, Services, and Tools

This guide walks through the essential layers of a backend technology stack for startups, covering language choices, core components like DNS, load balancing, CDN, RPC frameworks, databases, messaging, logging, monitoring, configuration, deployment, and operational best‑practice processes.

DevOpsMessagingarchitecture

0 likes · 31 min read

Building a Complete Backend Stack for Startups: Languages, Services, and Tools

High Availability Architecture

Feb 22, 2021 · Backend Development

Evolution and Performance Optimization of Ximalaya's High‑Throughput HTTP Gateway

This article details the design evolution, architectural redesign, and performance‑tuning techniques of Ximalaya's gateway—from an initial Tomcat NIO implementation to a fully asynchronous Netty‑based solution—covering traffic management, timeout handling, monitoring, and future HTTP/2 migration.

AsynchronousHTTPmonitoring

0 likes · 16 min read

Evolution and Performance Optimization of Ximalaya's High‑Throughput HTTP Gateway

Java Captain

Feb 21, 2021 · Operations

Exposing Spring Boot Metrics with Prometheus and Visualizing Them in Grafana

This tutorial explains how to add Actuator and Prometheus dependencies to a Spring Boot application, configure security, expose metrics endpoints, run Prometheus and Grafana via Docker, and set up Grafana dashboards for real‑time monitoring of Spring Boot services.

ActuatorDockerGrafana

0 likes · 4 min read

Exposing Spring Boot Metrics with Prometheus and Visualizing Them in Grafana

DevOps Cloud Academy

Feb 18, 2021 · Cloud Native

Comprehensive Guide to Deploying and Configuring Prometheus Monitoring on Kubernetes

This article provides a step‑by‑step tutorial on installing Prometheus, configuring its components, deploying it in a Kubernetes cluster with proper RBAC and persistent storage, and extending monitoring to applications and exporters using /metrics endpoints.

Cloud NativeDevOpsPrometheus

0 likes · 19 min read

Comprehensive Guide to Deploying and Configuring Prometheus Monitoring on Kubernetes

Architect's Tech Stack

Feb 16, 2021 · Backend Development

Exposing Spring Boot Metrics for Prometheus Monitoring and Visualizing with Grafana

This tutorial explains how to add Actuator and Prometheus dependencies to a Spring Boot application, configure security, run the app to expose Prometheus‑format metrics, and then set up Docker‑based Prometheus and Grafana containers to collect and visualize those metrics.

BackendDockerGrafana

0 likes · 5 min read

Exposing Spring Boot Metrics for Prometheus Monitoring and Visualizing with Grafana

ITFLY8 Architecture Home

Feb 10, 2021 · Backend Development

How WeChat Scales: Inside Its Agile, Massive‑Scale Architecture

This article reveals the three‑in‑one strategy, agile mindset, modular design, extensibility, gray‑release process, and monitoring techniques that enable WeChat to handle billions of users with high availability and rapid feature delivery.

Agile DevelopmentWeChatlarge-scale systems

0 likes · 18 min read

How WeChat Scales: Inside Its Agile, Massive‑Scale Architecture

Liangxu Linux

Feb 8, 2021 · Operations

How a Single “return null” Caused Million-Dollar Loss: Lessons in Release Management

A seemingly harmless code change that returned null triggered a massive production outage, costing millions, and the author recounts the incident, the emergency rollback, root‑cause analysis, and the broader lessons about code review, testing, monitoring, and disciplined release practices.

BackendCode reviewincident management

0 likes · 7 min read

How a Single “return null” Caused Million-Dollar Loss: Lessons in Release Management

Top Architect

Feb 7, 2021 · Cloud Native

A Comprehensive Guide to Docker Tools and the Container Ecosystem

This article provides an extensive overview of the most popular Docker‑related tools—including orchestration, CI/CD, monitoring, logging, security, storage, networking, service discovery, image building, and management solutions—detailing their core features, official websites, pricing models, and typical use cases for developers and operations engineers.

ContainerDevOpsDocker

0 likes · 23 min read

A Comprehensive Guide to Docker Tools and the Container Ecosystem

Practical DevOps Architecture

Feb 5, 2021 · Operations

Using RocketMQ mqadmin Commands for Monitoring and Managing Topics, Producers, and Consumers

This guide demonstrates how to navigate to the RocketMQ bin directory and use various mqadmin CLI commands to list available commands, check help, monitor producer and consumer connections, query topic status, retrieve messages by offset or ID, and shut down the NameServer and brokers.

CLIOperationsRocketMQ

0 likes · 4 min read

Using RocketMQ mqadmin Commands for Monitoring and Managing Topics, Producers, and Consumers

Ops Development Stories

Feb 5, 2021 · Operations

How to Monitor Ceph Clusters with Zabbix: 3 Practical Methods

This guide explains three ways to monitor Ceph distributed storage using Zabbix—Agent2 with the RESTful module, Zabbix Sender, and custom scripts—providing step‑by‑step commands, configuration tips, and troubleshooting notes for reliable operations.

Agent2CephOperations

0 likes · 5 min read

How to Monitor Ceph Clusters with Zabbix: 3 Practical Methods

Java Architect Essentials

Feb 2, 2021 · Operations

Server and Business Monitoring Practices Using Netdata, Spring AOP, and Javamelody

This article explains how to monitor both Linux servers and Java business applications by selecting lightweight tools like Netdata, implementing request‑time logging with Spring AOP, and integrating Javamelody, while providing configuration snippets and code examples for a comprehensive monitoring solution.

JavamelodyNetdatajava

0 likes · 9 min read

Server and Business Monitoring Practices Using Netdata, Spring AOP, and Javamelody

Efficient Ops

Feb 1, 2021 · Operations

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

This article explains how internet companies can reduce soaring manual operations costs by applying intelligent monitoring techniques—such as pattern recognition and statistical anomaly detection—to automatically identify abnormal nodes among thousands of servers, streamline fault diagnosis, and improve service quality.

Operationsanomaly detectionlarge-scale systems

0 likes · 4 min read

How to Detect Anomalous Nodes in Massive Compute Clusters Using Intelligent Ops

Open Source Linux

Jan 29, 2021 · Operations

Essential Kubernetes Production Best Practices for Secure, Scalable Ops

This article outlines comprehensive production‑grade Kubernetes best practices—including health probes, RBAC, resource management, network policies, monitoring, autoscaling, image security, and zero‑downtime strategies—to help teams run secure, efficient, and highly available workloads.

KubernetesOperationsautoscaling

0 likes · 11 min read

Essential Kubernetes Production Best Practices for Secure, Scalable Ops

MaGe Linux Operations

Jan 28, 2021 · Cloud Native

Master Prometheus: Step‑by‑Step Container & Host Monitoring with Grafana

This guide introduces Prometheus, explains its advantages over traditional monitoring tools, walks through installation, configuration, and Docker deployment, and demonstrates practical monitoring of Docker containers, Linux hosts, and visualization with Grafana, providing complete code snippets and screenshots.

GrafanaPrometheuscAdvisor

0 likes · 7 min read

Master Prometheus: Step‑by‑Step Container & Host Monitoring with Grafana

Big Data Technology & Architecture

Jan 25, 2021 · Operations

Real‑time Redis Monitoring with redis‑exporter, Prometheus and Grafana Using Docker

This guide shows how to set up a complete Redis monitoring stack by launching two Redis instances, a redis‑exporter collector, and Grafana‑Prometheus visualisation all via Docker, covering container creation, IP discovery, configuration files, datasource setup and dashboard creation.

DockerExporterGrafana

0 likes · 7 min read

Real‑time Redis Monitoring with redis‑exporter, Prometheus and Grafana Using Docker

DevOps Cloud Academy

Jan 25, 2021 · Cloud Native

Blackbox Monitoring with Prometheus Blackbox Exporter in Kubernetes

This guide explains how to complement Prometheus white‑box monitoring with black‑box probes by deploying the Blackbox Exporter in a Kubernetes cluster, configuring ConfigMaps, Deployments, Services, and Prometheus scrape jobs for HTTP, DNS, TCP, and ICMP checks, and using annotations for automatic service discovery.

Blackbox ExporterPrometheusmonitoring

0 likes · 10 min read

Blackbox Monitoring with Prometheus Blackbox Exporter in Kubernetes

Efficient Ops

Jan 20, 2021 · Operations

Log vs Network Data: Which Wins Full‑Link Monitoring in Modern Distributed Systems?

With the shift from monolithic to distributed architectures, this article compares log‑based and network‑data‑based monitoring across data sources, precision, monitoring paths, and implementation methods, concluding that network‑data monitoring offers superior real‑time insight, lower cost, and faster deployment for full‑link observability.

full‑linklog analysismonitoring

0 likes · 11 min read

Log vs Network Data: Which Wins Full‑Link Monitoring in Modern Distributed Systems?

Liangxu Linux

Jan 20, 2021 · Operations

Deploy Nightingale Monitoring Platform with Docker and Grafana – A Complete Guide

This tutorial walks through installing the open‑source Nightingale monitoring system using Docker, configuring its components, adding node agents, integrating with Grafana, and setting up alerting, providing all commands, configuration files, and screenshots needed for a production‑ready deployment.

DeploymentGrafanaOperations

0 likes · 8 min read

Deploy Nightingale Monitoring Platform with Docker and Grafana – A Complete Guide

Practical DevOps Architecture

Jan 20, 2021 · Operations

Deploying Zabbix Agent on Hundreds of Servers Using Ansible

This guide explains how to use Ansible to automatically install and configure Zabbix‑agent on a large number of Linux servers, covering inventory setup, a deployment script, a playbook, execution commands, and verification of the agent listening on port 10050.

AnsibleOperationsZabbix

0 likes · 4 min read

Deploying Zabbix Agent on Hundreds of Servers Using Ansible

转转QA

Jan 19, 2021 · Operations

Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms

This article details a systematic full‑link performance testing workflow—including background, timing, scenario design, data preparation, capacity planning, monitoring, issue analysis, and post‑test cleanup—aimed at reliably evaluating and scaling e‑commerce services during major promotional events.

OperationsPerformance Testingcapacity planning

0 likes · 18 min read

Comprehensive Full-Link Performance Testing Process and Practices for E-commerce Platforms

Big Data Technology & Architecture

Jan 17, 2021 · Operations

Building an Enterprise‑Level Flink Monitoring System with Prometheus, Grafana and Pushgateway

This article explains how to use the Cloud Native Prometheus ecosystem—including Prometheus Server, exporters, Pushgateway, Alertmanager and Grafana—to collect, store, query and visualize Flink job metrics, providing a complete monitoring solution for production clusters.

Cloud NativeFlinkGrafana

0 likes · 13 min read

Building an Enterprise‑Level Flink Monitoring System with Prometheus, Grafana and Pushgateway

Architecture Digest

Jan 17, 2021 · Operations

System Performance Issue Analysis and Optimization Process

This article outlines a comprehensive process for diagnosing and optimizing performance problems in production business systems, covering hardware, OS, database, middleware, JVM tuning, code inefficiencies, monitoring tools, and the limitations of pre‑release testing, with practical guidelines and visual references.

APMDatabase TuningSystem optimization

0 likes · 16 min read

System Performance Issue Analysis and Optimization Process

Programmer DD

Jan 15, 2021 · Operations

Why Does Prometheus Sometimes Fail to Trigger Alerts?

This article explains why Prometheus alerts may not fire or may fire unexpectedly, covering the role of the for parameter, sampling intervals, Grafana range queries, and practical steps to diagnose and fix alerting issues.

AlertingGrafanaOps

0 likes · 7 min read

Why Does Prometheus Sometimes Fail to Trigger Alerts?

Didi Tech

Jan 14, 2021 · Cloud Computing

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Didi’s Logi‑KafkaManager is a multi‑tenant Kafka cloud platform that consolidates dozens of clusters into a secure, isolated gateway‑driven service offering intuitive web‑based topic management, real‑time metrics visualization, automated diagnostics, quota governance and safe scaling, delivering high internal satisfaction and enterprise commercialization.

Big DataKafkacloud platform

0 likes · 17 min read

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

HelloTech

Jan 13, 2021 · Mobile Development

Hybrid Container Optimization for Puhui Ride‑Hailing Business

The report details a comprehensive overhaul of Puhui’s hybrid container for ride‑hailing services, introducing full‑link tracing, container reuse, offline resource caching, image WebP conversion and pre‑fetching, which together slash first‑screen load times by over 70 % and boost the 1‑second open rate from 12 % to 91 %.

Hybrid ContainerMobile FrontendWebView

0 likes · 19 min read

Hybrid Container Optimization for Puhui Ride‑Hailing Business

ITPUB

Jan 13, 2021 · Operations

How to Diagnose and Optimize Business System Performance Issues

This article outlines a step‑by‑step approach for identifying root causes of performance bottlenecks in production business systems, covering common scenarios such as high concurrency, data growth, hardware limits, database and middleware tuning, code inefficiencies, and the role of monitoring and APM tools.

APMDatabase TuningJVM

0 likes · 15 min read

How to Diagnose and Optimize Business System Performance Issues

Java Backend Technology

Jan 13, 2021 · Backend Development

Why Did Our HttpClient Crash the Server? Uncovering evictExpiredConnections and OOM

A detailed post explains how a mis‑configured HttpClient caused thread explosion and OOM on four servers, walks through the investigation using APM metrics, clarifies keep‑alive mechanics, and presents the fix of using a singleton HttpClient with proper connection eviction.

BackendHttpClientKeep-Alive

0 likes · 9 min read

Why Did Our HttpClient Crash the Server? Uncovering evictExpiredConnections and OOM

MaGe Linux Operations

Jan 9, 2021 · Operations

How to Monitor Kubernetes API with Python and Zabbix Sender – Step‑by‑Step Guide

This tutorial walks you through using Python's requests library and Zabbix Sender to retrieve Kubernetes API metrics, covering API endpoint discovery, token generation, script deployment, host configuration, and manual trigger of checks to visualize the data.

APIKubernetesOps

0 likes · 3 min read

How to Monitor Kubernetes API with Python and Zabbix Sender – Step‑by‑Step Guide

Alibaba Terminal Technology

Jan 7, 2021 · Frontend Development

How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba

This article explores the challenges of frontend availability, introduces chaos engineering concepts, and details Alibaba's practical approach to frontend fault injection—including static resource hijacking, a safe isolated environment, monitoring integration, and a real‑world drill that demonstrates how to measure and improve detection and response capabilities.

Fault InjectionReliabilitychaos engineering

0 likes · 18 min read

How Frontend Chaos Engineering Boosts Reliability: Lessons from Alibaba

Ops Development Stories

Jan 7, 2021 · Operations

Master Blackbox Exporter: Install, Configure, and Alert with Prometheus

This guide walks through the concepts of white‑box vs black‑box monitoring, explains Prometheus Blackbox Exporter capabilities, shows step‑by‑step installation, Kubernetes configuration, probe definitions for HTTP, TCP, ICMP and SSL, and provides ready‑to‑use alert rules and Grafana dashboard integration.

AlertingBlackbox ExporterKubernetes

0 likes · 11 min read

Master Blackbox Exporter: Install, Configure, and Alert with Prometheus

Alibaba Terminal Technology

Jan 6, 2021 · Frontend Development

Mastering Gray-Scale Cross-Platform Monitoring for Front-End Apps

This article explains the background, technical architecture, real‑world case, and key takeaways of implementing gray‑scale monitoring across web, Weex, mini‑programs, and other cross‑platform front‑end solutions to improve issue detection and reduce mean time to recovery.

Operationscross‑platformfrontend

0 likes · 10 min read

Mastering Gray-Scale Cross-Platform Monitoring for Front-End Apps