Tagged articles

2179 articles

Page 16 of 22

Aug 26, 2020 · Backend Development

A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture

This article walks through the transformation of a simple online supermarket from a monolithic design to a fully fledged microservice system, explaining the motivations, architectural changes, component selection, common pitfalls, and best‑practice solutions such as service decomposition, database sharding, monitoring, tracing, service mesh, resilience patterns, and testing strategies.

MicroservicesResiliencearchitecture

0 likes · 22 min read

A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture

Architecture Digest

Aug 25, 2020 · Operations

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

This article provides a comprehensive guide on using Prometheus for Kubernetes monitoring, covering fundamental principles, exporter selection, Grafana dashboard creation, memory and storage optimization, high‑availability designs, query performance, cardinality management, and integration with alerting and logging systems.

ExportersGrafanaKubernetes

0 likes · 33 min read

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

dbaplus Community

Aug 24, 2020 · Operations

How Zhongtong Scaled Elasticsearch Monitoring with ESPaaS: Architecture, Alerts, and Diagnosis

Zhongtong built the ESPaaS platform to automate deployment, unify monitoring, and provide real‑time alerts and diagnostic capabilities for over 40 Elasticsearch clusters, handling petabytes of data with Prometheus, Grafana, and DingTalk integrations while sharing practical lessons learned.

AlertingPrometheusdiagnosis

0 likes · 9 min read

How Zhongtong Scaled Elasticsearch Monitoring with ESPaaS: Architecture, Alerts, and Diagnosis

Aikesheng Open Source Community

Aug 24, 2020 · Operations

Prometheus Data Query Basics and Practical Usage Guide

This article introduces Prometheus' query language PromQL, explains instant and range vector selectors, label matching, offset handling, storage design, common functions and aggregation operators, and provides practical advice for efficient querying and avoiding performance issues.

OperationsPromQLPrometheus

0 likes · 13 min read

Prometheus Data Query Basics and Practical Usage Guide

58 Tech

Aug 19, 2020 · Backend Development

Design and Implementation of a Testing Quality System for the 58.com SSP Advertising Platform

The article details the architecture of 58.com’s SSP advertising platform, identifies three key reliability challenges—data consistency, interface regression, and storage synchronization—and presents a three‑layer testing quality system comprising web‑layer validation, service‑layer automated testing, and data‑layer monitoring with concrete tools and future improvement plans.

SSPadvertising platformautomation

0 likes · 14 min read

Design and Implementation of a Testing Quality System for the 58.com SSP Advertising Platform

Open Source Linux

Aug 17, 2020 · Operations

Step-by-Step Guide to Install and Configure Zabbix on CentOS 7

This tutorial walks you through installing Zabbix on CentOS 7, covering prerequisite disabling of SELinux and firewalls, adding repositories, installing server, web, and database components, configuring files, securing MariaDB, starting services, and completing the web‑based setup with language customization.

CentOSInstallationLinux

0 likes · 7 min read

Step-by-Step Guide to Install and Configure Zabbix on CentOS 7

Full-Stack DevOps & Kubernetes

Aug 16, 2020 · Cloud Native

How to Configure Alertmanager, Add WeChat Alerts, and Enable Automatic Service Discovery in Kubernetes

This guide walks through modifying Alertmanager to use a NodePort service, decoding and editing its secret to add custom receivers and a WeChat template, recreating the secret, and extending Prometheus Operator with additional scrape configs for automatic service discovery, including RBAC adjustments and verification steps.

KubernetesRBACServiceDiscovery

0 likes · 10 min read

How to Configure Alertmanager, Add WeChat Alerts, and Enable Automatic Service Discovery in Kubernetes

Liangxu Linux

Aug 15, 2020 · Fundamentals

Why Does `free` Show More Used Memory Than `ps aux`? A Deep Dive into Linux Memory Accounting

This article explains why Linux's `free` command often reports higher used memory than the RSS values shown by `ps aux`, covering buffer/cache reclaimable memory, slab and page‑table consumption, and provides Bash scripts to accurately calculate total memory usage.

BashFreeRSS

0 likes · 10 min read

Why Does `free` Show More Used Memory Than `ps aux`? A Deep Dive into Linux Memory Accounting

Tencent Cloud Developer

Aug 12, 2020 · Databases

How Autonomous Databases Evolve: From Stone Age to AI‑Driven Self‑Healing

This article traces the evolution of database autonomy from manual, knowledge‑driven operations through tool‑assisted and expert‑level stages to cloud‑native intelligent services, and details Tencent's DBbrain platform, its architecture, performance‑optimization, security, monitoring, cost‑based analysis, and future self‑healing capabilities.

AI OpsCloud DatabasesDBbrain

0 likes · 29 min read

How Autonomous Databases Evolve: From Stone Age to AI‑Driven Self‑Healing

Java Architect Essentials

Aug 11, 2020 · Operations

Four Essential Linux Monitoring Tools for Operations Engineers

This article introduces four widely used Linux monitoring tools—iotop, htop, IPTraf, and Monit—explaining their features, usage scenarios, and how they help operations engineers diagnose performance issues without a GUI, including real‑time I/O tracking, visual CPU/memory graphs, network traffic analysis, and flexible alerting.

IPTrafLinuxMonit

0 likes · 7 min read

Four Essential Linux Monitoring Tools for Operations Engineers

IT Architects Alliance

Aug 10, 2020 · Operations

Step‑by‑Step Guide to Building a Filebeat‑Kafka‑ELK Logging Pipeline

This tutorial walks through installing and configuring Filebeat, Kafka, Logstash, Elasticsearch, and Kibana, detailing version requirements, file permissions, YAML settings, startup commands, topic verification, and how to ingest and visualize log data in Kibana.

ELKElasticsearchFilebeat

0 likes · 13 min read

Step‑by‑Step Guide to Building a Filebeat‑Kafka‑ELK Logging Pipeline

Programmer DD

Aug 9, 2020 · Backend Development

Why Did My Java Service’s Response Time Spike? A Deep Dive into QPS, GC, and CPU Load

An internal Java‑based HTTP service suddenly suffered high latency and timeouts, prompting a systematic investigation that uncovered excessive QPS, frequent ParNew GCs, CPU load spikes, and large response payloads, leading to concrete performance and design improvements.

javamonitoring

0 likes · 9 min read

Why Did My Java Service’s Response Time Spike? A Deep Dive into QPS, GC, and CPU Load

MaGe Linux Operations

Aug 8, 2020 · Operations

Step-by-Step Guide to Installing and Configuring Zabbix on CentOS 7

This tutorial walks you through disabling SELinux and the firewall, adding Zabbix and EPEL repositories, installing Zabbix server, web, and database components, configuring files, securing MariaDB, starting services, and completing the web‑based setup to get a fully functional monitoring system.

CentOSInstallationOpen-source

0 likes · 7 min read

Step-by-Step Guide to Installing and Configuring Zabbix on CentOS 7

Big Data Technology & Architecture

Aug 8, 2020 · Big Data

Setting Up InfluxDB and Grafana for Flink Metrics Monitoring

This guide walks through installing InfluxDB and Grafana on CentOS, configuring InfluxDB for Flink metrics storage, creating databases and retention policies, integrating the Flink InfluxDB reporter, and building Grafana dashboards to visualize real‑time Flink job metrics.

Big DataFlinkGrafana

0 likes · 8 min read

Setting Up InfluxDB and Grafana for Flink Metrics Monitoring

MaGe Linux Operations

Aug 7, 2020 · Operations

How to Diagnose Linux Server Issues in the First 60 Seconds with 10 Essential Commands

This article explains how Netflix's performance team uses ten standard Linux command‑line tools to quickly assess system health within the first minute, focusing on error detection, resource saturation, and utilization across CPU, memory, disk, and network to pinpoint performance problems.

OpsSystem Administrationcommand-line

0 likes · 18 min read

How to Diagnose Linux Server Issues in the First 60 Seconds with 10 Essential Commands

dbaplus Community

Aug 3, 2020 · Operations

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

iQIYI’s tech product team designed a unified full‑link automated monitoring platform that integrates link, metric, and log collection with deep analysis, enhancing fault localization, performance insight, and scalability across microservices, while addressing limitations of existing tools like ELK, Prometheus, and Dapper.

Metricsfull‑linklog collection

0 likes · 15 min read

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

转转QA

Jul 31, 2020 · Operations

Design and Implementation of a Real-Time Log Collection and Query System for Distributed Deployment

The article describes the challenges of troubleshooting distributed deployments across many machines and presents a solution built on the ELK stack that centralizes logs from Java and Go services, enabling near‑real‑time search, visualization, and faster issue resolution.

Distributed SystemsOperationslog collection

0 likes · 5 min read

Design and Implementation of a Real-Time Log Collection and Query System for Distributed Deployment

Xianyu Technology

Jul 28, 2020 · Operations

ShenTan: Automated Fault Localization System for Online Services

ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.

Big DataFault LocalizationOperations

0 likes · 12 min read

ShenTan: Automated Fault Localization System for Online Services

Top Architect

Jul 27, 2020 · Operations

10 Practical Tips to Boost Web Application Performance Up to 10× with NGINX

This article presents ten actionable recommendations—including reverse‑proxy deployment, load balancing, caching, compression, SSL/TLS tuning, HTTP/2 adoption, software upgrades, Linux and web‑server tuning, and real‑time monitoring—to dramatically improve web application performance, often achieving tenfold speed gains.

NginxWeb Performancecaching

0 likes · 22 min read

10 Practical Tips to Boost Web Application Performance Up to 10× with NGINX

DevOps Cloud Academy

Jul 27, 2020 · Operations

Monitoring GitLab Runner and GitLab CI Pipelines with Prometheus

This guide details how to enable Prometheus metrics on GitLab Runner, configure Prometheus to scrape those metrics, and set up the gitlab-ci-pipelines-exporter with Grafana dashboards to monitor both runner performance and CI/CD pipeline health.

DevOpsGitLab RunnerGrafana

0 likes · 7 min read

Monitoring GitLab Runner and GitLab CI Pipelines with Prometheus

WecTeam

Jul 23, 2020 · Backend Development

How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

This article chronicles the evolution of the WebMonitor front‑end monitoring system, detailing its three‑tier stack, data pipeline upgrades from raw disk sampling to HDFS and Elasticsearch, extensive collector‑side optimizations, Jetty thread and timeout tuning, and the resulting performance gains that lowered response times from minutes to sub‑second levels.

Jettydata pipelinejava

0 likes · 15 min read

How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

dbaplus Community

Jul 20, 2020 · Operations

How to Build Reliable Monitoring for Low‑Frequency Financial Services

After two years transitioning from e‑commerce to finance, the team shares practical monitoring strategies for low‑frequency financial services, contrasting e‑commerce traffic‑based methods with finance‑specific challenges, and detailing point‑based metrics, hourly success‑rate alerts, aspect‑oriented exception handling, white‑list filtering, and Sentinel‑based circuit breaking.

AlertingAspect Oriented ProgrammingCircuit Breaking

0 likes · 16 min read

How to Build Reliable Monitoring for Low‑Frequency Financial Services

Liangxu Linux

Jul 19, 2020 · Operations

How to Diagnose Linux Performance Issues with Flame Graphs and System Tools

This guide explains how to systematically analyze Linux performance problems—including CPU, memory, disk I/O, network, and load—using 5W2H methodology, built‑in monitoring commands, perf, flame‑graph visualizations, and a real‑world Nginx case study to pinpoint and resolve bottlenecks.

flamegraphmonitoringperformance

0 likes · 19 min read

How to Diagnose Linux Performance Issues with Flame Graphs and System Tools

360 Tech Engineering

Jul 17, 2020 · Big Data

Qbus Service Overview: Architecture, Use Cases, and Implementation Details

This article introduces Qbus, a cloud‑based queue service built on Kafka, covering its architecture, core components such as log collection, SDKs, HDFS persistence, monitoring with Prometheus, business integration methods, use‑case scenarios, and future development directions.

Cloud QueueHDFSKafka

0 likes · 6 min read

Qbus Service Overview: Architecture, Use Cases, and Implementation Details

Qunhe Technology Quality Tech

Jul 17, 2020 · Operations

How We Built a Robust Monitoring System for Construction Drawing Production

This article describes how our team designed and implemented a comprehensive online monitoring system for construction drawing generation, covering business background, technical architecture analysis, metric definition, monitoring methods, and the resulting dashboards that improve quality, stability, and rapid issue resolution.

MetricsOperationsconstruction drawing

0 likes · 10 min read

How We Built a Robust Monitoring System for Construction Drawing Production

Full-Stack DevOps & Kubernetes

Jul 16, 2020 · Cloud Native

How to Install HAProxy and Exporter on Kubernetes and Monitor It with Prometheus

This guide walks through installing HAProxy on a Kubernetes master node, compiling and configuring it, adding the HAProxy exporter, creating a ServiceMonitor YAML for the Prometheus Operator, and verifying that metrics are correctly scraped and displayed in the Prometheus UI.

ExporterHAProxyKubernetes

0 likes · 10 min read

How to Install HAProxy and Exporter on Kubernetes and Monitor It with Prometheus

Full-Stack Internet Architecture

Jul 12, 2020 · Operations

Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques

This article shares practical monitoring strategies for financial services with low‑frequency operations, contrasting e‑commerce monitoring methods, outlining the challenges of financial monitoring, and presenting reliable solutions such as success‑rate alerts, aspect‑oriented exception handling with whitelists, and circuit‑breaker degradation using Sentinel.

AlertingAspect Oriented ProgrammingFinancial Services

0 likes · 14 min read

Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques

Big Data Technology & Architecture

Jul 11, 2020 · Operations

Prometheus Overview: Features, Architecture, Data Model, and Installation Guide with Grafana Integration

This article introduces Prometheus, covering its architecture, key features, data model, metric types, installation steps, integration with node_exporter and Grafana, and outlines suitable and unsuitable use cases for this open‑source monitoring system.

GrafanaInstallationPrometheus

0 likes · 8 min read

Prometheus Overview: Features, Architecture, Data Model, and Installation Guide with Grafana Integration

Full-Stack DevOps & Kubernetes

Jul 11, 2020 · Cloud Native

Managing Prometheus Alerts and Alertmanager with the Prometheus Operator

This guide walks through creating PrometheusRule resources, deploying Alertmanager via the Prometheus Operator, configuring custom alerting rules, exposing Alertmanager with a Service, and applying custom Prometheus configuration files using Kubernetes secrets and kubectl commands.

KubernetesPrometheus OperatorYAML

0 likes · 11 min read

Managing Prometheus Alerts and Alertmanager with the Prometheus Operator

Full-Stack DevOps & Kubernetes

Jul 9, 2020 · Cloud Native

Deploy and Manage Prometheus Operator on Kubernetes: A Step‑by‑Step Guide

This article explains what the Prometheus Operator is, how it extends Kubernetes with custom resources, lists the CRDs it provides, and walks through a complete deployment—including cloning the source, creating a monitoring namespace, applying RBAC, installing the operator, creating a Prometheus instance, configuring ServiceMonitor, and troubleshooting common permission errors—using concrete YAML manifests and kubectl commands.

KubernetesPrometheus OperatorRBAC

0 likes · 18 min read

Deploy and Manage Prometheus Operator on Kubernetes: A Step‑by‑Step Guide

HaoDF Tech Team

Jul 8, 2020 · Operations

How We Rebuilt Our Monitoring System into a Scalable Alert Service

After two months of intensive development, the team launched a new monitoring and alerting platform that transforms a legacy system into a service‑oriented solution, addressing pain points such as inflexible escalation, noisy alerts, and poor ownership while introducing phone alerts, automated escalation, Prometheus integration, and a unified rule engine.

AlertingDevOpsPrometheus

0 likes · 16 min read

How We Rebuilt Our Monitoring System into a Scalable Alert Service

Full-Stack DevOps & Kubernetes

Jul 8, 2020 · Cloud Native

How to Deploy a Redis Exporter on Kubernetes for Prometheus Monitoring

This guide shows how to configure a Redis exporter alongside a Redis pod in Kubernetes, add Prometheus scrape annotations, apply the deployment and service manifests, and visualize metrics in Grafana, providing step‑by‑step commands, YAML examples, and screenshots of the monitoring dashboard.

ExporterKubernetesYAML

0 likes · 5 min read

How to Deploy a Redis Exporter on Kubernetes for Prometheus Monitoring

ITPUB

Jul 7, 2020 · Operations

Top 2020 DevOps Tools: A Complete Guide to Building Your CI/CD Stack

This article categorizes the most popular 2020 DevOps tools across development, testing, deployment, runtime, and collaboration, explains why each tool leads its class, lists key advantages and competitors, and offers a practical checklist for assembling a full CI/CD pipeline.

CollaborationDevOpsautomation

0 likes · 24 min read

Top 2020 DevOps Tools: A Complete Guide to Building Your CI/CD Stack

ITPUB

Jul 5, 2020 · Operations

2020’s Best DevOps Tools by Category – From CI/CD to Collaboration

This article categorises the most popular 2020 DevOps tools—development/build, automated testing, deployment, runtime, and collaboration—explains why each tool topped its class, lists key advantages, and compares notable competitors to help teams build a complete CI/CD pipeline.

Collaborationautomationmonitoring

0 likes · 27 min read

2020’s Best DevOps Tools by Category – From CI/CD to Collaboration

Architecture Digest

Jul 3, 2020 · Cloud Native

Understanding Loki: Architecture, Benefits, and Comparison with ELK

This article explains the motivations behind Loki, its architecture and components, how it reduces the cost and complexity of log and metric querying compared to ELK, and details its write‑read pipeline, scalability, and integration with Kubernetes and Prometheus.

Lokicloud-nativelogging

0 likes · 7 min read

Understanding Loki: Architecture, Benefits, and Comparison with ELK

dbaplus Community

Jul 2, 2020 · Information Security

How 58 Daojia Secures Data in the DT Era: Threats, Practices, and Lessons

This article summarizes Liu Huan's presentation on data security in the DT era, covering the current security landscape, internal and external threats to enterprise data, and 58 Daojia's practical approaches to data discovery, classification, authentication, monitoring, and incident response.

DT eradata securityenterprise security

0 likes · 14 min read

How 58 Daojia Secures Data in the DT Era: Threats, Practices, and Lessons

Java High-Performance Architecture

Jul 2, 2020 · Operations

4 Essential Linux Monitoring Tools Every Sysadmin Should Master

Discover four high‑usage Linux monitoring utilities—iotop, htop, IPTraf, and Monit—that help you quickly diagnose I/O, CPU, memory, network, and process issues, with visual insights and flexible alerting to keep single or multiple servers running smoothly.

LinuxSystem Administrationhtop

0 likes · 4 min read

4 Essential Linux Monitoring Tools Every Sysadmin Should Master

Full-Stack DevOps & Kubernetes

Jul 1, 2020 · Cloud Native

How to Install and Configure mysql_exporter on a Kubernetes Master Node

This guide walks through downloading the mysql_exporter package, extracting it on a Kubernetes master, installing the binary, creating a dedicated MySQL user with proper permissions, configuring a password‑less client file, launching the exporter, and updating Prometheus via kubectl so MySQL metrics are exposed on port 9104.

Cloud NativeDevOpsKubernetes

0 likes · 4 min read

How to Install and Configure mysql_exporter on a Kubernetes Master Node

Taobao Frontend Technology

Jul 1, 2020 · Frontend Development

How Taobao’s Front‑End Team Delivered a Lightning‑Fast 618 Shopping Experience

This article explains how Taobao’s front‑end engineers tackled the massive traffic of the 618 promotion by optimizing resource requests, data fetching, module loading, monitoring, and fallback strategies, ultimately achieving smooth, high‑performance pages for billions of shoppers.

cachinge‑commercemonitoring

0 likes · 10 min read

How Taobao’s Front‑End Team Delivered a Lightning‑Fast 618 Shopping Experience

Top Architect

Jul 1, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Key Components

Microservices, introduced in 2012 and popularized by Martin Fowler, decompose applications into small, independent services that communicate via lightweight protocols, enabling modular development, flexible technology choices, independent deployment, and improved scalability, while also introducing challenges such as distributed data consistency, testing complexity, and operational overhead.

Backend ArchitectureConfiguration ManagementMicroservices

0 likes · 16 min read

Understanding Microservices Architecture: Concepts, Benefits, and Key Components

dbaplus Community

Jun 28, 2020 · Databases

How to Build a Visual MongoDB Slow Query Dashboard with PHP

This guide explains how to set up a PHP‑based web platform that collects MongoDB slow‑query logs via remote profiling, stores them in MySQL, and visualizes the data, including installation of required PHP extensions, database preparation, configuration, cron scheduling, and enabling profiling on MongoDB.

MongoDBPHPmonitoring

0 likes · 7 min read

How to Build a Visual MongoDB Slow Query Dashboard with PHP

MaGe Linux Operations

Jun 25, 2020 · Databases

How to Monitor Redis with Zabbix: Auto‑Discovery Scripts and Templates

This guide walks you through creating Zabbix auto‑discovery scripts to extract all Redis INFO parameters, configuring custom keys, setting permissions, and building a complete Redis monitoring template with step‑by‑step screenshots.

Auto-discoveryDatabase MonitoringZabbix

0 likes · 7 min read

How to Monitor Redis with Zabbix: Auto‑Discovery Scripts and Templates

Qunar Tech Salon

Jun 23, 2020 · Operations

A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems

This article presents a lightweight gray release approach for complex flight ticket services, comparing traditional hardware and soft‑routing isolation methods, describing the authors' traffic‑based gray identification, business‑focused monitoring, implementation details, and automated safeguards to enable safe incremental deployments.

BackendDeploymentOperations

0 likes · 8 min read

A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems

Aikesheng Open Source Community

Jun 22, 2020 · Operations

Introduction to the Prometheus Data Collection Process

This article explains the complete Prometheus data collection workflow, covering key concepts such as targets, samples, and meta labels, detailing the relabeling steps, configuration options, example use‑cases, and the final scrape and storage phases for effective monitoring.

ConfigurationPrometheusdata collection

0 likes · 8 min read

Introduction to the Prometheus Data Collection Process

Ops Development Stories

Jun 18, 2020 · Operations

Forward Zabbix Alerts to WeChat via Kafka – Complete Step‑by‑Step Guide

This guide shows how to route Zabbix alarm messages through a Kafka cluster and then deliver them to Enterprise WeChat using Python scripts, covering host configuration, Kafka/Zookeeper startup, topic creation, alert‑sending scripts, and Zabbix action setup.

AlertingEnterprise WeChatKafka

0 likes · 6 min read

Forward Zabbix Alerts to WeChat via Kafka – Complete Step‑by‑Step Guide

JD Retail Technology

Jun 17, 2020 · Operations

How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture

The article details JD’s data product teams’ systematic preparation for the 618 shopping festival, covering pressure estimation, capacity expansion, stress testing, emergency downgrade strategies, dual‑data‑center isolation, high‑fidelity end‑to‑end testing, and continuous monitoring to ensure stable, real‑time data services during massive traffic spikes.

Big DataData PlatformJD.com

0 likes · 10 min read

How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture

Xianyu Technology

Jun 17, 2020 · Backend Development

Lottery System Risk Management and SDK Integration

Xianyu mitigated lottery‑related financial loss by centralizing rights management, decoupling UI from business logic, and providing a unified SDK with simple draw APIs, while adding real‑time log backflow, comprehensive accounting and monitoring, cutting configuration time by over 50 % and eliminating UI‑only risk.

BackendLottery SystemSDK

0 likes · 10 min read

Lottery System Risk Management and SDK Integration

Laravel Tech Community

Jun 16, 2020 · Mobile Development

Kuaishou’s APM Platform and Mobile Performance Optimization: Insights from Yang Kai

In a mobile‑first world where limited device resources and unstable networks threaten user retention, Kuaishou’s performance team built an APM monitoring platform and applied systematic memory, startup, and jank optimizations that cut startup time by 40%, reduced package size by 23 MB, and significantly improved key product metrics.

APMKuaishouMemory Management

0 likes · 9 min read

Kuaishou’s APM Platform and Mobile Performance Optimization: Insights from Yang Kai

dbaplus Community

Jun 15, 2020 · Cloud Native

Deploying Prometheus on Kubernetes with Operator, Grafana, and Alertmanager

This guide walks through setting up a complete Prometheus monitoring stack on a Kubernetes cluster, covering both traditional YAML deployments and the Prometheus Operator, configuring services, integrating Grafana dashboards, and enabling Alertmanager notifications including WeChat alerts.

Prometheusmonitoring

0 likes · 34 min read

Deploying Prometheus on Kubernetes with Operator, Grafana, and Alertmanager

Liangxu Linux

Jun 13, 2020 · Operations

Mastering Monitoring: From Basics to Advanced Zabbix Practices

This comprehensive guide explains why monitoring is essential for operations, outlines monitoring goals and methods, reviews a wide range of open‑source tools, details a Zabbix‑based workflow, enumerates key metrics across hardware, system, application, network, security and business layers, and offers practical alerting and interview tips.

AlertingOperationsZabbix

0 likes · 21 min read

Mastering Monitoring: From Basics to Advanced Zabbix Practices

JD Retail Technology

Jun 10, 2020 · Operations

Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring

The logistics R&D team spent 62 days preparing for the 618 promotion by analyzing core processes, applying stress tests, implementing fault‑tolerant architectures, planning capacity, and deploying real‑time monitoring tools to ensure system stability and performance under peak traffic.

OperationsPerformance TestingSystem Design

0 likes · 7 min read

Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring

Full-Stack DevOps & Kubernetes

Jun 9, 2020 · Operations

Configure Alertmanager to Send Alerts to Email, DingTalk, and WeChat

This guide walks you through modifying Alertmanager’s configuration to deliver alerts via QQ email, DingTalk chat‑bot webhooks, and Enterprise WeChat, including SMTP settings, webhook plugin installation, and the required wechat_configs parameters for seamless integration.

DevOpsDingTalkKubernetes

0 likes · 7 min read

Configure Alertmanager to Send Alerts to Email, DingTalk, and WeChat

Manbang Technology Team

Jun 8, 2020 · Cloud Native

Design and Implementation of a Zookeeper Operator for Kubernetes

This article outlines the design, functional requirements, CRD definition, architecture, deployment, scaling, monitoring, fault‑tolerance, and upgrade strategies of a Zookeeper operator on Kubernetes, including code examples, service configurations, and integration with Prometheus and OAM standards.

CRDCloud NativeKubernetes

0 likes · 18 min read

Design and Implementation of a Zookeeper Operator for Kubernetes

Efficient Ops

Jun 3, 2020 · Operations

Understanding Kubernetes vs VM Monitoring: CPU, Memory, Disk & Network

This article compares monitoring metrics for CPU, memory, disk, and network between traditional KVM-based servers and Kubernetes pods, explaining why their indicators differ, how resource isolation works, and what key metrics users should watch to diagnose performance bottlenecks.

CPUKubernetesmemory

0 likes · 11 min read

Understanding Kubernetes vs VM Monitoring: CPU, Memory, Disk & Network

Open Source Linux

Jun 1, 2020 · Operations

Why Inodes Fill Up Before Disk Space? Diagnose and Fix Linux Filesystem Limits

This article explains what inodes are, why they can become exhausted even when disk space remains, and provides step‑by‑step Linux commands and cleanup techniques to monitor and resolve inode exhaustion issues.

Filesystemcleanupcron

0 likes · 6 min read

Why Inodes Fill Up Before Disk Space? Diagnose and Fix Linux Filesystem Limits

iQIYI Technical Product Team

May 29, 2020 · Big Data

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.

Big DataMetricsarchitecture

0 likes · 12 min read

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

FunTester

May 26, 2020 · Fundamentals

Understanding Load Testing: Key Strategies and Best Practices

This article clarifies common misconceptions about load testing, defines it within performance testing, and provides practical strategies for test volume, load generators, scripting, think time, ramp-up/down, monitoring, diagnosis, and data analysis to ensure reliable performance assessments.

Software TestingTest Strategymonitoring

0 likes · 11 min read

Understanding Load Testing: Key Strategies and Best Practices

dbaplus Community

May 25, 2020 · Operations

Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks

This article details how Ctrip optimized the CAT monitoring system—covering its large‑scale deployment, thread‑model redesign, offloading calculations to clients, double‑buffered reporting, and string handling improvements—to dramatically cut CPU usage, GC pressure, and memory consumption while handling billions of messages daily.

Distributed SystemsThread Modelgc

0 likes · 25 min read

Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks

Aikesheng Open Source Community

May 25, 2020 · Operations

Understanding Prometheus Data Collection: Formats, Types, and Best Practices

This article explains Prometheus data collection by describing metric syntax, label usage, time‑series concepts, the four logical metric types (Counter, Gauge, Histogram, Summary), and provides practical naming, labeling, and selection guidelines for effective monitoring.

CounterGaugeHistogram

0 likes · 7 min read

Understanding Prometheus Data Collection: Formats, Types, and Best Practices

Programmer DD

May 22, 2020 · Operations

Grafana 7.0 Released: New UX, Plugin Platform, Transformations & CloudWatch Support

Grafana 7.0 introduces a revamped user experience, a unified data model, a new plugin platform, Jaeger tracing support, powerful data transformations, AWS CloudWatch Logs integration, and enterprise usage analytics, offering enhanced visualization and monitoring capabilities across major data sources.

DashboardData visualizationGrafana

0 likes · 3 min read

Grafana 7.0 Released: New UX, Plugin Platform, Transformations & CloudWatch Support

Top Architect

May 21, 2020 · Backend Development

Comprehensive Guide to Java Application Performance Optimization and Diagnosis

This article provides an in‑depth overview of Java application performance optimization, covering a four‑layer model (application, database, framework, JVM), on‑site and post‑mortem analysis methods, OS and JVM diagnostic tools, common code and GC issues, database deadlock handling, and practical tuning recommendations.

Database TuningJVMdiagnostics

0 likes · 23 min read

Comprehensive Guide to Java Application Performance Optimization and Diagnosis

Efficient Ops

May 20, 2020 · Operations

How to Build a Sustainable CMDB: Three Essential Phases for Reliable Operations

This article explains how to design, implement, and maintain a robust Configuration Management Database (CMDB) by focusing on simple modeling, establishing data closure loops, and efficiently handling existing inventory, while leveraging Kafka, Flink, Elasticsearch, and Neo4j for fast querying and topology visualization.

CMDBConfiguration Managementautomation

0 likes · 19 min read

How to Build a Sustainable CMDB: Three Essential Phases for Reliable Operations

Efficient Ops

May 19, 2020 · Cloud Native

Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning

This article explores the history and principles of Prometheus monitoring, offers guidance on version selection, highlights its limitations, details common Kubernetes exporters, shows Grafana dashboard setups, and provides in‑depth strategies for exporter aggregation, golden metrics, multi‑cluster scraping, GPU monitoring, timezone handling, memory optimization, capacity planning, and rate calculations.

GrafanaKubernetesPrometheus

0 likes · 19 min read

Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning

Ops Development Stories

May 14, 2020 · Operations

How to Set Up Zabbix VMware Monitoring: Step-by-Step Configuration Guide

Learn how to enable Zabbix’s VMware monitoring by configuring collectors, editing the server config, linking vCenter templates, adding CPU, memory, and disk usage items, and creating triggers, with detailed code snippets and screenshots to ensure comprehensive virtual machine performance tracking.

ConfigurationVMwareZabbix

0 likes · 6 min read

How to Set Up Zabbix VMware Monitoring: Step-by-Step Configuration Guide

HomeTech

May 14, 2020 · Cloud Native

Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome

The article describes Autohome's third‑generation cloud‑native monitoring platform, detailing its background, strategic goals for R&D efficiency, mobile‑first design, Prometheus‑based architecture with multi‑replica storage and InfluxDB remote storage, its operational impact, and future directions such as AI‑driven noise reduction.

Containerscloud-nativemonitoring

0 likes · 7 min read

Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome

Programmer DD

May 12, 2020 · Operations

Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops

This comprehensive guide explains how to enhance RabbitMQ reliability by covering confirmation mechanisms, producer and consumer configurations, queue mirroring, alerting, monitoring metrics, and health‑check commands, providing actionable steps for developers and operations teams to ensure stable message delivery.

Message QueueOperationsRabbitMQ

0 likes · 22 min read

Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops

MaGe Linux Operations

May 10, 2020 · Databases

How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana

This guide walks through deploying mysqld_exporter, configuring Prometheus and Grafana, and monitoring essential MySQL metrics such as replication health, query throughput, slow‑query counts, connection usage, and InnoDB buffer‑pool statistics, while also showing how to set up alert rules for proactive database operations.

AlertingExportersGrafana

0 likes · 15 min read

How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana

ITPUB

May 3, 2020 · Operations

Mastering IT Monitoring: Goals, Methods, Tools, and Best Practices

This comprehensive guide explains why monitoring is essential for reliable operations, outlines clear monitoring objectives, walks through practical monitoring methods, compares popular open‑source tools, details a Zabbix‑based workflow, and lists key hardware, system, application, network, security, API, performance, and business metrics to track.

IT infrastructureOperationsZabbix

0 likes · 19 min read

Mastering IT Monitoring: Goals, Methods, Tools, and Best Practices

Laravel Tech Community

May 2, 2020 · Operations

Comprehensive MySQL and Linux Operations Interview Guide

This guide compiles essential MySQL security steps, master‑slave replication principles, backup scripts, Linux boot overview, common port services, virus mitigation, monitoring tools, nginx optimization, InnoDB lock troubleshooting, replication lag reduction, high‑availability components, data migration utilities, and automation configuration management techniques for operations engineers.

LinuxOperationsautomation

0 likes · 13 min read

Comprehensive MySQL and Linux Operations Interview Guide

Top Architect

May 1, 2020 · Operations

Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting

This article presents a systematic approach to diagnosing and resolving Java runtime problems by examining CPU usage, disk I/O, memory consumption, garbage‑collection behavior, and network anomalies, offering practical commands, analysis techniques, and visual aids to pinpoint root causes in production environments.

Operationsgcjava

0 likes · 22 min read

Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting

Liangxu Linux

Apr 29, 2020 · Operations

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

This guide explains why monitoring is essential for the entire operations lifecycle, outlines key monitoring objectives, describes practical methods and workflows, reviews a range of open‑source tools (including Zabbix, MRTG, Ganglia, Nagios, Smokeping, OpenTSDB), and details metric categories such as hardware, system, application, network, log, security, API, performance and business monitoring.

AlertingMetricsZabbix

0 likes · 22 min read

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

vivo Internet Technology

Apr 29, 2020 · Cloud Native

Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring

Prometheus, a CNCF‑graduated, cloud‑native monitoring system, combines pull‑based target discovery, a label‑rich time‑series data model, and four core metric types—gauge, counter, histogram, and summary—to provide near‑real‑time visibility, short‑term retention, alerting via AlertManager, and integration with Grafana and remote storage for scalable observability.

AlertmanagerCNCFDevOps

0 likes · 11 min read

Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring

Qunhe Technology Quality Tech

Apr 29, 2020 · Operations

How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance

This article documents the step‑by‑step practices of a six‑person test‑environment availability team that unified middleware, streamlined deployment pipelines, piloted business usage, introduced monitoring and recovery mechanisms, and created a comprehensive SIT environment handbook to improve integration testing stability and operational efficiency.

DeploymentOperationsSIT

0 likes · 19 min read

How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance

UCloud Tech

Apr 28, 2020 · Cloud Native

How We Built a Highly Available Kubernetes Platform for Multi‑Cluster Deployments

This article explains why Kubernetes was chosen, describes the overall architecture, high‑availability master design, multi‑IDC cluster deployment, logging, monitoring, service exposure, image building, lifecycle hooks, CI/CD, multi‑cluster management, encountered challenges, and future plans for operators and automated scaling.

KubernetesMulti-Clusterci/cd

0 likes · 11 min read

How We Built a Highly Available Kubernetes Platform for Multi‑Cluster Deployments

Aikesheng Open Source Community

Apr 27, 2020 · Operations

Detailed Introduction to Prometheus: Architecture, Quick Deployment, Advantages and Drawbacks

This article provides a comprehensive overview of Prometheus, covering its origins, architecture, step‑by‑step deployment, configuration, web UI usage, as well as its key advantages and limitations for cloud‑native monitoring and operations.

AlertmanagerCloud NativeDeployment

0 likes · 6 min read

Detailed Introduction to Prometheus: Architecture, Quick Deployment, Advantages and Drawbacks

DevOps Cloud Academy

Apr 23, 2020 · Operations

Step-by-Step Guide to Installing and Configuring Prometheus, Node Exporter, Alertmanager, and Grafana

This tutorial provides a beginner-friendly, step-by-step walkthrough for downloading, installing, configuring, and verifying Prometheus, Node Exporter, Alertmanager, and Grafana on a Linux server, including service setup, configuration files, and a simple alert test.

AlertmanagerGrafanaInstallation

0 likes · 7 min read

Step-by-Step Guide to Installing and Configuring Prometheus, Node Exporter, Alertmanager, and Grafana

dbaplus Community

Apr 22, 2020 · Operations

How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs

This article recounts 58 Daojia’s four‑year journey from migrating its IDC infrastructure to public cloud, the challenges encountered, and how the team designed and evolved a multi‑generation operations platform that centralizes asset, cost, domain, and monitoring management, ultimately improving efficiency and reducing expenses.

Cost Managementasset managementcloud migration

0 likes · 14 min read

How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs

MaGe Linux Operations

Apr 22, 2020 · Operations

Why Kubernetes CPU Metrics Differ from Traditional VMs: A Deep Dive

This article compares CPU, memory, disk, and network monitoring metrics between traditional KVM servers and Kubernetes pods, explaining the underlying reasons for differences and offering guidance on interpreting the metrics for effective performance troubleshooting.

CPUKubernetesmonitoring

0 likes · 11 min read

Why Kubernetes CPU Metrics Differ from Traditional VMs: A Deep Dive

Architects' Tech Alliance

Apr 19, 2020 · Operations

IO Performance Evaluation: Models, Tools, Metrics, and Optimization Strategies

This article explains common IO latency problems, introduces how to define and refine IO models, lists disk and network evaluation tools, describes key monitoring metrics, and provides practical tuning methods and case studies for improving storage and network performance.

monitoringnetworkoptimization

0 likes · 14 min read

IO Performance Evaluation: Models, Tools, Metrics, and Optimization Strategies

21CTO

Apr 16, 2020 · Backend Development

How JD’s API Gateway Handles Tens of Millions of Concurrent Requests

This article explains how JD Retail built a high‑performance, secure, and observable API gateway that supports massive traffic, implements asynchronous processing for high concurrency, provides fine‑grained traffic control, gray‑release capabilities, and automated operations to serve native, web, and mini‑program clients.

api-gatewayautomationgray release

0 likes · 10 min read

How JD’s API Gateway Handles Tens of Millions of Concurrent Requests

FunTester

Apr 14, 2020 · Operations

Spot Performance Problems Without Writing a Single Line of Code

Experienced developers can often identify performance bottlenecks simply by reviewing code implementations, configuration settings such as timeouts, intervals, database and Redis parameters, as well as service monitoring data, container and JVM configurations, allowing them to avoid unnecessary test scripts and code changes.

ConfigurationDevOpsOperations

0 likes · 2 min read

Spot Performance Problems Without Writing a Single Line of Code

MaGe Linux Operations

Apr 13, 2020 · Operations

Step-by-Step Guide to Install and Configure Zabbix 4.4 on CentOS 8

This tutorial walks you through preparing a CentOS 8 system, installing Zabbix 4.4 and its dependencies, configuring the database and server, and completing the web‑based setup, providing all necessary commands and screenshots for a successful monitoring solution.

CentOS8InstallationZabbix

0 likes · 8 min read

Step-by-Step Guide to Install and Configure Zabbix 4.4 on CentOS 8

Cloud Native Technology Community

Apr 8, 2020 · Operations

Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring

This article provides a detailed analysis of Thanos' architecture, explaining each core component—Query, Sidecar, Store Gateway, Ruler, Compact, and the upcoming Receiver—how they enable global view, high availability, and long‑term storage for distributed Prometheus deployments, and discusses design trade‑offs and optimization strategies.

Cloud NativeLong‑term StoragePrometheus

0 likes · 12 min read

Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring

Ops Development Stories

Apr 8, 2020 · Operations

Deploy Zabbix Monitoring with Docker and Docker‑Compose on CentOS

This guide walks through preparing a CentOS 7 host, installing Docker, configuring a Zabbix server and MySQL containers, and optionally using docker‑compose to set up Zabbix components, including the web UI and agent, with detailed commands and volume mappings for persistent monitoring.

CentOSDockerDocker Compose

0 likes · 18 min read

Deploy Zabbix Monitoring with Docker and Docker‑Compose on CentOS

DevOps

Apr 8, 2020 · Operations

Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations

This article presents a comprehensive DevOps case study of Bilibili, covering its cultural background, community ecosystem, user‑centric demand management, migration to high‑performance microservices, and the implementation of logging, monitoring, and real‑time data platforms to support rapid, reliable delivery.

BilibiliData PlatformDevOps

0 likes · 17 min read

Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations

Efficient Ops

Apr 6, 2020 · Databases

How to Build a MySQL Monitoring Platform with Prometheus and Grafana

This article walks through setting up a production‑grade MySQL monitoring solution using Prometheus and Grafana, covering exporter installation, MySQL user configuration, systemd service setup, Prometheus job definition, key MySQL performance metrics, and basic alerting rules.

GrafanaMetricsPrometheus

0 likes · 15 min read

How to Build a MySQL Monitoring Platform with Prometheus and Grafana

ITFLY8 Architecture Home

Apr 5, 2020 · Backend Development

Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security

This tutorial walks through what Spring Boot Actuator is, how to quickly create a demo project, configure endpoint exposure, explore essential endpoints such as health, metrics, loggers, and shutdown, and secure them with Spring Security, providing code snippets and configuration examples.

ActuatorEndpointsSpring Boot

0 likes · 14 min read

Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security

Java Backend Technology

Apr 5, 2020 · Backend Development

Mastering Micrometer: From Counters to Grafana Dashboards in Spring Boot

This tutorial walks through Micrometer's metric types, how to register them with MeterRegistry, apply tags and naming conventions, and integrate the framework into Spring Boot applications with Actuator, Prometheus scraping, and Grafana visualization for comprehensive backend monitoring.

GrafanaMetricsMicrometer

0 likes · 27 min read

Mastering Micrometer: From Counters to Grafana Dashboards in Spring Boot

360 Quality & Efficiency

Apr 3, 2020 · Operations

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

This article introduces the core concepts and architecture of the open‑source Prometheus monitoring system, explains its data model and metric types, and provides a step‑by‑step guide to install a Prometheus server, collect host metrics with Node Exporter, and visualize them using Grafana.

GrafanaMetricsPrometheus

0 likes · 10 min read

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

Efficient Ops

Apr 1, 2020 · Operations

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

This article explains why traditional server and service monitoring (e.g., Zabbix) may miss business outages, then walks through setting up Nagios on Debian to monitor web page URLs, API health checks, and related services, including configuration files, plugins, and a desktop alert tool, Nagstamon.

LinuxNagiosOps

0 likes · 18 min read

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

Alibaba Terminal Technology

Apr 1, 2020 · Frontend Development

How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps

This article explains the concept of frontend safety production, outlines its evolution from basic monitoring to a systematic, cloud‑enabled framework, and details the core capabilities—pre‑change CI checks, gray‑release gating, and real‑time monitoring—required to ensure high‑quality, risk‑free frontend deployments.

CIautomationfrontend

0 likes · 12 min read

How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps

Java Captain

Apr 1, 2020 · Operations

Comprehensive Guide to Online Environment Deployment and Operations Practices

This article provides a thorough overview of planning, provisioning, and managing online production environments—including user sizing, bandwidth estimation, database design, OS versus container deployment, middleware selection, security, monitoring, SSH shortcuts, file transfer tools, automation scripts, Docker setup, and log viewing techniques—aimed at giving developers a complete operational perspective.

DeploymentDockerOperations

0 likes · 16 min read

Comprehensive Guide to Online Environment Deployment and Operations Practices

FunTester

Mar 31, 2020 · Operations

Interface Performance Testing – Tools, Scripts, and Guides

This article compiles a comprehensive list of resources—including tools, scripts, and tutorials—for conducting interface performance testing on Linux and other platforms, covering topics such as netdata localization, timewatch utility, load testing strategies, JVM heap dumps, and visualizing test data.

APILinuxmonitoring

0 likes · 6 min read

Interface Performance Testing – Tools, Scripts, and Guides

Continuous Delivery 2.0

Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingcloud operationsconfiguration-management

0 likes · 8 min read

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

Efficient Ops

Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring

0 likes · 13 min read

Why SRE Exists and How It Solves Reliability Challenges

Ops Development Stories

Mar 26, 2020 · Operations

How to Auto‑Discover and Monitor Redis Ports with Zabbix

This guide explains how to use Zabbix's auto‑discovery feature to automatically find Redis instances on a server, create shell or Python scripts for port detection, configure Zabbix agent keys, set up server‑side templates, discovery rules, item prototypes, graphs, and triggers, and finally apply the template to monitored hosts.

Auto-discoveryPythonShell

0 likes · 9 min read

How to Auto‑Discover and Monitor Redis Ports with Zabbix

Efficient Ops

Mar 25, 2020 · Operations

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

This article details JD Logistics' journey to design and implement a massive, AI‑enhanced monitoring platform that handles over three million metrics across hundreds of warehouses, addressing challenges of scale, network complexity, frequent asset changes, and integrating AIOps for proactive fault detection and resolution.

CMDBKafkaLSTM

0 likes · 23 min read

How JD Logistics Built a 300‑Million‑Metric Real‑Time Monitoring System for 99.999% Uptime

Didi Tech

Mar 21, 2020 · Operations

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Nightingale, Didi’s open‑source enterprise monitoring platform, builds on Open‑Falcon but adds a hierarchical object tree, in‑memory indexing, Gorilla‑compressed time‑series storage, a hybrid push‑pull alert engine, built‑in log monitoring, and a unified monapi module, delivering scalable, cloud‑native observability for both container and bare‑metal workloads.

Cloud NativeOpen-Falconarchitecture

0 likes · 10 min read

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Open Source Linux

Mar 19, 2020 · Operations

Essential Ops Playbook: Avoid Costly Mistakes in Server Management

This guide shares practical Linux server operation rules, emphasizing thorough testing, careful use of destructive commands, strict access control, regular backups, security hardening, continuous monitoring, and disciplined performance tuning to prevent costly outages and data loss.

Backupmonitoringperformance tuning

0 likes · 13 min read

Essential Ops Playbook: Avoid Costly Mistakes in Server Management

Efficient Ops

Mar 16, 2020 · Cloud Native

Designing a Scalable, High‑Availability Kubernetes Monitoring Solution at Xiaomi

This article details Xiaomi's implementation of a highly available, persistent, and dynamically scalable Kubernetes monitoring system, covering challenges, architecture choices, Prometheus federation, performance testing, and future enhancements for cloud‑native observability.

KubernetesPrometheusmonitoring

0 likes · 18 min read

Designing a Scalable, High‑Availability Kubernetes Monitoring Solution at Xiaomi