Tagged articles
2179 articles
Page 11 of 22
Efficient Ops
Efficient Ops
Jul 7, 2022 · Operations

Master Zabbix: From Installation to Advanced Custom Monitoring and Alerts

This comprehensive guide explains why server monitoring is essential, details the concept of high‑availability "nines", walks through Zabbix installation, web UI setup, custom item creation, trigger configuration, alert integration, distributed monitoring, SNMP support, and practical scripts for managing large‑scale server farms.

Zabbixautomationmonitoring
0 likes · 29 min read
Master Zabbix: From Installation to Advanced Custom Monitoring and Alerts
IT Services Circle
IT Services Circle
Jul 6, 2022 · Databases

Understanding MySQL COUNT() Performance and Strategies for Large Tables

This article explains how MySQL COUNT() works under different storage engines, why counting rows becomes slow on large InnoDB tables, and presents practical methods such as using EXPLAIN rows, auxiliary count tables, batch processing, and transaction‑based updates to obtain approximate or exact row counts efficiently.

InnoDB_countdatabase
0 likes · 12 min read
Understanding MySQL COUNT() Performance and Strategies for Large Tables
dbaplus Community
dbaplus Community
Jul 4, 2022 · Operations

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

A seasoned operations professional shares personal experiences and hard‑earned insights on why traditional monitoring often becomes ineffective, how over‑automation and noisy dashboards hurt teams, and what a capability‑focused, user‑centric approach to observability should look like.

OperationsSREmonitoring
0 likes · 12 min read
Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

Data Indicator Testing Platform and Quality Assurance

The article presents an Indicator Testing Platform that automates metric validation—covering timeliness, completeness, accuracy, and consistency—through model‑level comparison, regression, online monitoring, and TDD‑style testing, dramatically reducing manual effort and enabling rapid detection and correction of data quality issues across thousands of business indicators.

Automated TestingData PlatformData Quality
0 likes · 10 min read
Data Indicator Testing Platform and Quality Assurance
Architecture Digest
Architecture Digest
Jul 2, 2022 · Operations

Design and Evolution of Vivo Server‑Side Monitoring System

This article systematically outlines the design, components, data flow, and evolution of Vivo’s server‑side monitoring system, covering data collection, transmission, storage with OpenTSDB, visualization, alerting mechanisms, and comparisons with other monitoring solutions.

AlertingOpenTSDBOperations
0 likes · 19 min read
Design and Evolution of Vivo Server‑Side Monitoring System
macrozheng
macrozheng
Jul 1, 2022 · Operations

How to Optimize Server Performance: Config, Load Analysis, and Kernel Tuning

Learn practical methods to boost server performance by selecting appropriate hardware configurations, analyzing CPU, memory, disk I/O and network loads, and fine‑tuning kernel parameters such as file limits and TCP settings, with step‑by‑step commands and monitoring tools like htop, iostat, and nload.

Kernel ParametersLinuxmonitoring
0 likes · 14 min read
How to Optimize Server Performance: Config, Load Analysis, and Kernel Tuning
21CTO
21CTO
Jun 28, 2022 · Operations

Master Prometheus: From Metrics Collection to Alerts and Grafana Visualization

This comprehensive guide walks you through Prometheus fundamentals, including metric exposure, scraping, storage, querying with PromQL, custom exporter creation in Go, dynamic configuration reloading, and visualizing data with Grafana, while also covering alerting with Alertmanager and best practices for accurate histogram bucket design.

AlertingGrafanaMetrics
0 likes · 20 min read
Master Prometheus: From Metrics Collection to Alerts and Grafana Visualization
Architect's Tech Stack
Architect's Tech Stack
Jun 28, 2022 · Backend Development

Using Alibaba Druid Connection Pool with Spring Boot: Dependencies, Configuration, Monitoring, and Customization

This article explains how to integrate Alibaba's Druid database connection pool into a Spring Boot application, covering Maven dependencies, YAML configuration, built‑in filters, monitoring pages, ad removal techniques, and programmatic access to Druid statistics for comprehensive backend performance management.

ConfigurationDatabase Connection PoolDruid
0 likes · 14 min read
Using Alibaba Druid Connection Pool with Spring Boot: Dependencies, Configuration, Monitoring, and Customization
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 28, 2022 · Big Data

How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events

This article details Kuaishou’s real‑time data warehouse architecture and its comprehensive assurance framework—including forward lifecycle standards, reverse fault‑injection testing, and Spring Festival event practices—highlighting challenges of massive traffic, high timeliness, accuracy, and stability, and outlining future plans for automation, batch‑stream integration, and cost reduction.

FlinkReal-time StreamingSLA
0 likes · 23 min read
How Kuaishou Guarantees Real‑Time Data Warehouse Reliability During Billion‑Scale Events
AntTech
AntTech
Jun 28, 2022 · Operations

AntMonitor: Evolution, Features, and Core Technologies of Ant Group’s Observability Platform

The article details Ant Group’s AntMonitor observability platform, covering its development timeline, holographic monitoring capabilities, integrated performance analysis, efficient data integration, built‑in AI‑driven analytics, Monitoring‑as‑a‑Service, and the underlying high‑performance time‑series database and cloud‑native architecture that support massive real‑time data processing.

CloudNativeTimeSeriesDatabaseaiops
0 likes · 17 min read
AntMonitor: Evolution, Features, and Core Technologies of Ant Group’s Observability Platform
IT Architects Alliance
IT Architects Alliance
Jun 27, 2022 · Operations

Comprehensive Guide to Prometheus: Metrics Collection, Storage, Querying, Alerting and Visualization

This article provides a detailed overview of Prometheus, covering its architecture, metric exposure, scraping models, storage format, metric types, custom exporter implementation in Go, PromQL query language, built‑in functions, Grafana integration, and alerting with Alertmanager, offering practical code examples throughout.

AlertingGoGrafana
0 likes · 20 min read
Comprehensive Guide to Prometheus: Metrics Collection, Storage, Querying, Alerting and Visualization
Top Architect
Top Architect
Jun 27, 2022 · Backend Development

Microservices and Kubernetes: A Comprehensive Guide to Design, Implementation, and High‑Availability Deployment

This article presents a step‑by‑step tutorial on designing a simple front‑end/back‑end separated microservice system, implementing it with Spring Boot, deploying it on Kubernetes, and enhancing reliability with multi‑instance registration, monitoring, logging, tracing, and traffic control mechanisms.

MicroservicesSpring Bootmonitoring
0 likes · 18 min read
Microservices and Kubernetes: A Comprehensive Guide to Design, Implementation, and High‑Availability Deployment
Architecture Talk
Architecture Talk
Jun 27, 2022 · Operations

Why Build an SRE System? A Complete Guide to Site Reliability Engineering

This article explains the motivations behind Site Reliability Engineering (SRE), outlines its strategic goals, defines key concepts such as SLI, SLO, SLA and error budget, introduces the four golden metrics for monitoring distributed systems, and provides practical guidance on building, operating, and continuously improving an SRE practice.

Error BudgetSLISLO
0 likes · 14 min read
Why Build an SRE System? A Complete Guide to Site Reliability Engineering
IT Architects Alliance
IT Architects Alliance
Jun 26, 2022 · Cloud Native

How to Build a High‑Availability Microservices System on Kubernetes from Scratch

This guide walks through designing a simple Java Spring Boot microservice architecture, implementing the services, deploying them on a Kubernetes cluster with Eureka registration, adding Prometheus‑Grafana monitoring, configuring logging, tracing, and flow‑control, and validating high availability using real‑time dashboards and tools like Lens.

Cloud NativeKubernetesMicroservices
0 likes · 19 min read
How to Build a High‑Availability Microservices System on Kubernetes from Scratch
Efficient Ops
Efficient Ops
Jun 22, 2022 · Operations

Top 13 Essential Linux Ops Tools Every Sysadmin Should Master

This guide introduces thirteen practical Linux operations tools—from network bandwidth monitors like Nethogs to security scanners such as NMap—providing concise descriptions, installation commands, and usage tips to help system administrators efficiently manage and secure their servers.

OperationsSysadminmonitoring
0 likes · 12 min read
Top 13 Essential Linux Ops Tools Every Sysadmin Should Master
Top Architect
Top Architect
Jun 22, 2022 · Backend Development

Designing API Error Codes and Messages: Best Practices

This article explains how to design clear and consistent API error codes and messages by borrowing the segmentation logic of HTTP status codes, defining code‑message pairs, providing personalized user‑facing messages, and centralizing handling for monitoring and alerting, ultimately reducing communication and maintenance costs.

Error CodesHTTP statusapi-design
0 likes · 6 min read
Designing API Error Codes and Messages: Best Practices
Inke Technology
Inke Technology
Jun 22, 2022 · Operations

How InnoLive Cut Monitoring Costs by 86% with Nightingale

This article details InnoLive's migration from Open‑Falcon to the Nightingale monitoring platform, describing the pain points of their previous system, the selection process, deployment architecture, collection practices, and the substantial cost and performance benefits achieved.

Cost reductionOpen-FalconOperations
0 likes · 10 min read
How InnoLive Cut Monitoring Costs by 86% with Nightingale
Qunar Tech Salon
Qunar Tech Salon
Jun 22, 2022 · Operations

Design and Implementation of Multi‑Cluster HPA Metrics Collection, Analysis, and Reporting in Kubernetes

This article explains the background, benefits, and measurement criteria of Kubernetes Horizontal‑Pod‑Autoscaler (HPA), describes the creation of metric tables and SQL queries for collecting scaling events and CPU usage, and presents a Python‑based workflow that aggregates the data, stores daily reports, validates results, and sends automated email summaries.

HPAKubernetesOperations
0 likes · 19 min read
Design and Implementation of Multi‑Cluster HPA Metrics Collection, Analysis, and Reporting in Kubernetes
HaoDF Tech Team
HaoDF Tech Team
Jun 21, 2022 · Operations

Evolution and High‑Availability Construction of the Haodafu Offline Message Push System

This article describes how the Haodafu offline push service grew from a simple PHP notification tool into a robust, highly‑available micro‑service platform by redesigning architecture, adopting vendor push channels, adding message‑queue reliability, implementing comprehensive monitoring, observability, and a fault‑diagnosis platform to ensure delivery rates and operational stability.

Mobile BackendSREhigh availability
0 likes · 21 min read
Evolution and High‑Availability Construction of the Haodafu Offline Message Push System
ITPUB
ITPUB
Jun 18, 2022 · Operations

How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes

This article recounts a three‑day image‑upload outage in a mini‑program, analyzes the multi‑layer causes, and shows how combining Metrics‑Driven Development with SRE and a custom observability platform dramatically reduces diagnosis time and improves reliability.

Metrics-Driven DevelopmentMini ProgramOperations
0 likes · 20 min read
How MDD and SRE Cut Mini‑Program Image‑Upload Failures from Days to Minutes
Xingsheng Youxuan Technology Community
Xingsheng Youxuan Technology Community
Jun 17, 2022 · Frontend Development

How Prism Transformed Front‑End Monitoring at Scale: Architecture, Challenges & Insights

This article details the design, challenges, and solutions behind Prism, a self‑built front‑end monitoring platform that collects multi‑device SDK data, processes it through Kafka, Flink and ClickHouse, visualizes metrics, integrates with A/B testing, and outlines future enhancements for broader enterprise adoption.

AB testingfrontendmonitoring
0 likes · 14 min read
How Prism Transformed Front‑End Monitoring at Scale: Architecture, Challenges & Insights
Architecture Digest
Architecture Digest
Jun 17, 2022 · Cloud Native

Vivo Container Cluster Monitoring Architecture and Cloud‑Native Practices

This article describes Vivo's practical experience building a cloud‑native monitoring system for large‑scale container clusters, covering the shortcomings of traditional monitoring, the Prometheus‑centric ecosystem, high‑availability architecture, challenges faced, and future directions such as automation and AI‑driven operations.

PrometheusVictoriaMetricsVivo
0 likes · 13 min read
Vivo Container Cluster Monitoring Architecture and Cloud‑Native Practices
Ops Development Stories
Ops Development Stories
Jun 16, 2022 · Operations

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

This article outlines a comprehensive approach to handling call‑center incidents, covering fault boundary definition, emergency recovery actions, rapid root‑cause localization, enhanced monitoring strategies, clear alerting, proactive automation, and the creation of concise, regularly exercised emergency response plans.

Operationscall centerfault-recovery
0 likes · 14 min read
How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery
Selected Java Interview Questions
Selected Java Interview Questions
Jun 13, 2022 · Backend Development

Guide to Setting Up Spring Boot Admin for Monitoring Spring Boot Applications

This article provides a step‑by‑step tutorial on installing and configuring Spring Boot Admin, including Maven dependencies, server and client setup, YML properties, security, Nacos registration, email notifications, custom health indicators, and Micrometer metrics to monitor Spring Boot services.

ConfigurationMetricsMicroservices
0 likes · 14 min read
Guide to Setting Up Spring Boot Admin for Monitoring Spring Boot Applications
Top Architect
Top Architect
Jun 13, 2022 · Backend Development

Microservice Architecture Roadmap and Key Components Explained

This article outlines a comprehensive roadmap for adopting microservice architecture, describing its benefits, core concepts, and essential tools such as Docker, container orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, tracing, persistence, caching, and cloud providers.

Microservicesarchitecturecaching
0 likes · 15 min read
Microservice Architecture Roadmap and Key Components Explained
Java Baker
Java Baker
Jun 12, 2022 · Operations

System Capacity Checklist: Key Metrics Every Architect Should Track

Architects should treat system capacity like a pre‑flight checklist, using this comprehensive guide to monitor resource usage across services, databases, and queues, and to define business metrics and state‑machine indicators that reveal bottlenecks and guide scaling decisions.

MetricsOperationsarchitecture
0 likes · 5 min read
System Capacity Checklist: Key Metrics Every Architect Should Track
Top Architect
Top Architect
Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

Operationscall centeremergency plan
0 likes · 12 min read
Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems
NetEase Game Operations Platform
NetEase Game Operations Platform
Jun 10, 2022 · Databases

Apache Doris Deployment and Optimization at NetEase Interactive Entertainment

This article details NetEase Interactive Entertainment's adoption of Apache Doris for large‑scale game data analytics, covering background, Doris architecture, cluster governance, tablet and compaction tuning, scaling strategies, monitoring, alerting, and fault‑handling practices to improve performance and stability.

Apache DorisBig DataCluster Management
0 likes · 22 min read
Apache Doris Deployment and Optimization at NetEase Interactive Entertainment
Huolala Tech
Huolala Tech
Jun 9, 2022 · Mobile Development

How Huolala Optimized iOS App Startup Speed: Tools, Metrics, and Best Practices

This article details Huolala's systematic approach to improving iOS app startup performance, covering metric definitions, monitoring setup, tool usage, optimization techniques across launch phases, long‑tail handling, anti‑degradation measures, and the resulting performance gains.

app startupiOSmonitoring
0 likes · 19 min read
How Huolala Optimized iOS App Startup Speed: Tools, Metrics, and Best Practices
JavaEdge
JavaEdge
Jun 3, 2022 · Operations

How to Scale Systems: From Load Metrics to Architecture Strategies

This article explains how to describe current system load, choose appropriate load parameters, analyze Twitter's scaling challenges, compare relational and push‑based timeline designs, clarify latency versus response time, emphasize percentile monitoring, and evaluate vertical versus horizontal scaling and hybrid approaches for handling increasing traffic.

LatencyLoad TestingScalability
0 likes · 15 min read
How to Scale Systems: From Load Metrics to Architecture Strategies
Top Architect
Top Architect
Jun 2, 2022 · Cloud Native

A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes

This article walks readers through the complete lifecycle of a microservice system—from architectural design and Java Spring Boot implementation to Kubernetes deployment, high‑availability setup, monitoring with Prometheus/Grafana, tracing with Zipkin, and flow‑control with Sentinel—providing practical code snippets and step‑by‑step instructions.

KubernetesMicroservicescloud-native
0 likes · 21 min read
A Beginner's Guide to Designing, Implementing, and Deploying Microservices on Kubernetes
Architecture Digest
Architecture Digest
Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

Operationscall centeremergency procedures
0 likes · 13 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
Tencent Cloud Developer
Tencent Cloud Developer
May 30, 2022 · Cloud Native

An Introduction to Prometheus: Metrics Collection, Storage, Querying, Visualization and Alerting

Prometheus is an open‑source monitoring system that scrapes metrics from services or exporters, stores them in a time‑series database, lets users query with PromQL, visualizes data via its web UI or Grafana, and sends alerts through Alertmanager, supporting custom Go metrics, various discovery methods, and four metric types.

AlertingGoGrafana
0 likes · 21 min read
An Introduction to Prometheus: Metrics Collection, Storage, Querying, Visualization and Alerting
Efficient Ops
Efficient Ops
May 29, 2022 · Operations

How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams

This article details a practical, semi‑automated monitoring solution for environments with fewer than 500 nodes, covering active monitoring concepts, Prometheus data modeling, service‑framework instrumentation, data scraping and visualization with Grafana, and alert handling via AlertManager.

GrafanaOperationsPrometheus
0 likes · 13 min read
How to Build a Semi‑Automated Prometheus Monitoring Stack for Small Teams
Liangxu Linux
Liangxu Linux
May 26, 2022 · Operations

Master Linux Disk Usage: Essential Commands to Monitor and Manage Filesystems

This guide compiles practical Linux one‑liners for sorting directories by size, listing subdirectories, monitoring disk usage changes, removing unwanted .svn folders, checking inode and partition statistics, and measuring read/write performance, providing sysadmins with quick, actionable tools.

Filesystemcommand-linedisk usage
0 likes · 5 min read
Master Linux Disk Usage: Essential Commands to Monitor and Manage Filesystems
dbaplus Community
dbaplus Community
May 26, 2022 · Databases

How Meituan Built a Scalable Autonomous Database System to Slash MTTR

This article details Meituan's journey from rapid database growth and operational bottlenecks to a multi‑year roadmap that combines platform‑level monitoring, rule‑based and AI‑enhanced root‑cause analysis, and automated remediation, ultimately delivering measurable improvements in alert accuracy, recall rates, and overall database reliability.

AIScalabilityautomation
0 likes · 19 min read
How Meituan Built a Scalable Autonomous Database System to Slash MTTR
Top Architect
Top Architect
May 26, 2022 · Backend Development

Designing API Return Codes and Messages: Best Practices for Backend Services

This article explains how to design clear, consistent API return codes and messages by referencing HTTP status codes, defining code‑message pairs, supporting personalized messages for different clients, and using unified handling for monitoring and alerting, ultimately improving communication and maintenance costs.

Error CodesHTTP status codesSoftware Architecture
0 likes · 6 min read
Designing API Return Codes and Messages: Best Practices for Backend Services
Open Source Linux
Open Source Linux
May 26, 2022 · Operations

Optimizing Zabbix Agent Monitoring for Linux and Windows: Best Practices

This guide explains how Zabbix agent monitors Linux and Windows systems, compares active and passive modes, and provides detailed optimization tips for OS metrics, CPU, memory, filesystem, Windows services, performance counters, and event logs, including alarm suppression and macro usage.

LinuxOperationsWindows
0 likes · 11 min read
Optimizing Zabbix Agent Monitoring for Linux and Windows: Best Practices
Architecture Digest
Architecture Digest
May 19, 2022 · Operations

Designing High‑Availability Stateless Services: Redundancy, Load Balancing, Scaling, and Monitoring

The article explains how to build highly available stateless services by using redundant deployment, vertical and horizontal scaling, appropriate load‑balancing algorithms, monitoring, and automated recovery, and also discusses high‑concurrency identification, CDN/OSS usage, and practical recommendations for cloud‑native environments.

Vertical Scalinghigh availabilityhorizontal scaling
0 likes · 11 min read
Designing High‑Availability Stateless Services: Redundancy, Load Balancing, Scaling, and Monitoring
Qunar Tech Salon
Qunar Tech Salon
May 19, 2022 · Operations

Design and Optimization of a Large‑Scale Monitoring System at Qunar.com

This article describes the architecture, challenges, and performance optimizations of Qunar.com's Watcher monitoring platform, covering massive metric collection, master‑worker redesign, Graphite/Whisper storage enhancements, and future migration to Go‑based cloud‑native solutions.

Cloud NativeDistributed Systemsci/cd
0 likes · 13 min read
Design and Optimization of a Large‑Scale Monitoring System at Qunar.com
MaGe Linux Operations
MaGe Linux Operations
May 17, 2022 · Backend Development

Mastering Backend Development: Key Concepts, Architecture, and Best Practices

This comprehensive guide explores essential backend development concepts—from system design principles like high cohesion and low coupling, through architecture patterns such as high availability and load balancing, to network communication, fault handling, monitoring, and deployment strategies, providing clear explanations for developers.

BackendScalabilitySystem Design
0 likes · 32 min read
Mastering Backend Development: Key Concepts, Architecture, and Best Practices
ByteDance Data Platform
ByteDance Data Platform
May 16, 2022 · Operations

How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale

This article explains how ByteDance’s self‑built SLA assurance platform addresses data pipeline communication costs, unclear responsibilities, and operational pressure by introducing roles, a streamlined signing workflow, checkpoint and recommendation calculations, and real‑time monitoring to achieve a 99.1% SLA compliance rate.

OperationsSLAmonitoring
0 likes · 9 min read
How ByteDance’s SLA Assurance Platform Guarantees Data Reliability at Scale
SQB Blog
SQB Blog
May 9, 2022 · Operations

How Havok Enables Realistic Full‑Link Load Testing for Scalable Services

This article explains how the Havok full‑link load testing platform was designed and built to replay real traffic safely, provide capacity‑assessment data, support multiple test types, and offer real‑time monitoring and circuit‑breaker protection for large‑scale online services.

Load Testingcapacity planningfull‑link testing
0 likes · 16 min read
How Havok Enables Realistic Full‑Link Load Testing for Scalable Services
DevOps Cloud Academy
DevOps Cloud Academy
May 7, 2022 · Operations

Optimizing Zabbix Monitoring for Linux and Windows Systems

This article provides a comprehensive guide on configuring and optimizing Zabbix agent monitoring for Linux and Windows, covering agent types, passive and active modes, macro variables, LLD macros, CPU/memory/file‑system metrics, and Windows service, performance counter, and event‑log monitoring.

LinuxOperationsWindows
0 likes · 9 min read
Optimizing Zabbix Monitoring for Linux and Windows Systems
Efficient Ops
Efficient Ops
Apr 29, 2022 · Operations

How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons

This article shares Ctrip's practical experiences in scaling a hybrid private‑cloud platform to over ten thousand nodes, covering Kubernetes control‑plane stability, host monitoring, network observability, image management, and capacity planning to ensure high availability for massive online services.

KubernetesNetwork Observabilitycloud operations
0 likes · 18 min read
How Ctrip Scaled Its Cloud Platform to 10k Nodes: Real‑World Kubernetes Ops Lessons
Volcano Engine Developer Services
Volcano Engine Developer Services
Apr 26, 2022 · Operations

How Volcano Engine’s TLS Transforms Log Management for Kubernetes at Scale

This article explains the challenges of traditional open‑source log collection in cloud‑native environments, describes Volcano Engine’s unified TLS architecture, its centralized configuration, CRD‑based deployment, and showcases real‑world case studies that demonstrate improved availability, efficiency, and scalability.

Cloud NativeDistributed SystemsKubernetes
0 likes · 15 min read
How Volcano Engine’s TLS Transforms Log Management for Kubernetes at Scale
Kuaishou Frontend Engineering
Kuaishou Frontend Engineering
Apr 21, 2022 · Mobile Development

How Kuaishou Optimized iOS App Startup and Prevented Performance Degradation

This article details Kuaishou's systematic approach to iOS app startup optimization, covering premain and postmain phases, dynamic library lazy loading, +load and static initializer monitoring, binary reordering, task scheduling, background fetch, prewarm mechanisms, and a comprehensive anti‑degradation framework to sustain launch performance.

Mobile Developmentapp startupdegradation prevention
0 likes · 26 min read
How Kuaishou Optimized iOS App Startup and Prevented Performance Degradation
58 Tech
58 Tech
Apr 21, 2022 · Frontend Development

Interview with Li Yi on Building 58 Group’s Large Front‑End Technology Service System

In this interview, Li Yi, head of 58 Group’s Front‑End Technology Department, explains how the company built its large‑scale front‑end service system—including a Hybrid permission platform, a React Native hot‑update platform, and the Beidou monitoring system—while discussing cross‑platform frameworks, performance challenges, low‑code adoption, and advice for newcomers.

Fluttercross‑platformfrontend
0 likes · 11 min read
Interview with Li Yi on Building 58 Group’s Large Front‑End Technology Service System
DaTaobao Tech
DaTaobao Tech
Apr 20, 2022 · Operations

Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions

Wireless operations and maintenance (O&M) evolved from backend‑focused practices to address stability and performance of mobile‑device services, tackling low issue detection rates and delayed responses through improved monitoring, gray‑release tagging, phased rollouts, AI‑driven diagnostics, and automated release gates, while inviting collaborative development.

gray releaseincident responsemobile maintenance
0 likes · 13 min read
Understanding Wireless Operations and Maintenance: Origins, Challenges, and Future Directions
Architect
Architect
Apr 16, 2022 · Operations

A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices

This article explains what SRE is, why it was created, how its responsibilities differ across companies, and breaks the work into Infrastructure, Platform, and Business SRE while covering deployment, on‑call, SLI/SLO, incident post‑mortems, capacity planning, user support, and career advice.

OncallOperationsSLI/SLO
0 likes · 22 min read
A Comprehensive Overview of Site Reliability Engineering (SRE) Roles and Practices
ELab Team
ELab Team
Apr 16, 2022 · Frontend Development

Master Front‑End Monitoring: From Data Collection to Performance Metrics

This article outlines the end‑to‑end workflow for front‑end monitoring in an APM platform, covering data collection, reporting, cleaning, storage, and consumption, and dives deep into environment info, exception handling, performance metrics, and efficient data upload strategies.

APMMetricsWeb
0 likes · 18 min read
Master Front‑End Monitoring: From Data Collection to Performance Metrics
Maoyan Technology Team
Maoyan Technology Team
Apr 13, 2022 · Big Data

Inside Maoyan’s Near‑Real‑Time Transaction Data Center

The article details Maoyan’s transaction data center, explaining its background, the need for a unified real‑time order model, the benefits of reduced coupling and improved data accuracy, and describes the system’s architecture, components, data collection, processing, task scheduling, monitoring, and future plans.

Real-Timebig-datadata center
0 likes · 29 min read
Inside Maoyan’s Near‑Real‑Time Transaction Data Center
DeWu Technology
DeWu Technology
Apr 11, 2022 · Backend Development

Content Fallback Strategies for Community Feed Services

To keep community feeds smooth despite network or backend failures, the system employs multiple fallback mechanisms—including Redis‑cached recommendation pools, CDN‑mirrored response files, asynchronous content caching, specialized follow‑feed and outfit‑selection caches, detail‑page preloading, and fine‑grained monitoring with alerts—ensuring continuous scrolling and a better user experience.

BackendCDNcontent fallback
0 likes · 11 min read
Content Fallback Strategies for Community Feed Services
dbaplus Community
dbaplus Community
Apr 10, 2022 · Operations

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.

Resource ManagementSRESystem Lifecycle
0 likes · 16 min read
How to Build a Practical SRE Operations Framework for Large‑Scale Systems
Yunxuetang Frontend Team
Yunxuetang Frontend Team
Apr 8, 2022 · Frontend Development

Essential Front-End Tech Insights: SSR, Monitoring, Babel, CSS 2022 & More

This article collection explores key front‑end topics—including Vue server‑side rendering, comprehensive monitoring strategies, Babel compilation fundamentals, the latest CSS 2022 features, and decorator design patterns, and tile‑layout techniques for BI reporting—while also introducing the Cloud Classroom front‑end team.

CSSSSRbabel
0 likes · 4 min read
Essential Front-End Tech Insights: SSR, Monitoring, Babel, CSS 2022 & More
Open Source Linux
Open Source Linux
Apr 6, 2022 · Cloud Native

Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive

This article explains how Prometheus’s time‑series database handles massive monitoring data, from basic concepts and query examples to storage engine design, indexing strategies, and powerful data computation techniques such as recording rules.

PrometheusTSDBTime Series
0 likes · 8 min read
Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive
SQB Blog
SQB Blog
Apr 2, 2022 · Operations

Designing a Next‑Gen Observability Platform: From Zipkin to Hera

This article chronicles the evolution of a company's monitoring system from a Zipkin‑based tracing solution to a cloud‑native observability platform called Hera, detailing design goals, technology choices, challenges with MySQL storage, and the adoption of Prometheus‑compatible metrics, Jaeger tracing, and Kubernetes operators.

Distributed TracingPrometheusjaeger
0 likes · 22 min read
Designing a Next‑Gen Observability Platform: From Zipkin to Hera
Open Source Linux
Open Source Linux
Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

Operationsautomationcall center
0 likes · 13 min read
How to Speed Up Call Center Incident Recovery with Proven Ops Strategies
Alibaba Terminal Technology
Alibaba Terminal Technology
Mar 31, 2022 · Frontend Development

Boosting Taobao Mini Program Performance: Key Lessons and Best Practices

This article examines how Taobao mini programs tackled stability and speed challenges by redefining user experience metrics, standardizing operational data, analyzing performance stages, and implementing best‑practice solutions such as engine reuse, data prefetching, and template snapshot rendering, resulting in measurable improvements.

Mini ProgramTaobaodata analysis
0 likes · 9 min read
Boosting Taobao Mini Program Performance: Key Lessons and Best Practices
DeWu Technology
DeWu Technology
Mar 28, 2022 · Backend Development

Loss Prevention Architecture and Real-Time Data Reconciliation for E‑commerce Platforms

The e‑commerce platform’s loss‑prevention architecture combines domain‑modeled scenario identification, pre‑emptive checks, automated testing, and a real‑time data‑reconciliation pipeline using Dcheck and rule factories to detect anomalies, trigger alerts, and execute emergency response plans, thereby minimizing financial risk and ensuring transaction stability.

loss preventionmonitoringreal-time reconciliation
0 likes · 13 min read
Loss Prevention Architecture and Real-Time Data Reconciliation for E‑commerce Platforms
IT Architects Alliance
IT Architects Alliance
Mar 27, 2022 · Backend Development

Simulating a 10‑Billion Red‑Envelope System with Go: From 3K to 6K QPS

This article details a step‑by‑step engineering experiment that reproduces a high‑throughput "red‑envelope" service, outlining the required hardware, software stack, load‑generation logic, monitoring setup, and performance results for handling up to 6 000 QPS on a 100‑million‑user scale.

BackendDistributed SystemsGo
0 likes · 21 min read
Simulating a 10‑Billion Red‑Envelope System with Go: From 3K to 6K QPS
Programmer DD
Programmer DD
Mar 24, 2022 · Databases

Master RedisInsight: Install, Configure, and Use the Ultimate Redis GUI

Learn how to install, configure, and operate RedisInsight—an intuitive Redis GUI—on Linux and Kubernetes, covering package download, environment setup, service deployment, basic usage, memory analysis, and key management, with step‑by‑step commands and visual screenshots.

Database GUIInstallationKubernetes
0 likes · 8 min read
Master RedisInsight: Install, Configure, and Use the Ultimate Redis GUI
Efficient Ops
Efficient Ops
Mar 20, 2022 · Operations

How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices

This guide walks through essential Kubernetes operations—from node kernel upgrades and Docker daemon tuning to pod resource limits, scheduling policies, health probes, logging standards, and comprehensive monitoring—providing practical commands and configurations to keep clusters stable and observable.

KubernetesNode ManagementOperations
0 likes · 18 min read
How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices
Open Source Linux
Open Source Linux
Mar 18, 2022 · Operations

Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus

This article traces the development of open‑source monitoring solutions from early tools like Nagios and Cacti through modern platforms such as Prometheus and Nightingale, comparing their strengths, weaknesses, and typical use cases while also looking ahead to emerging observability trends in cloud‑native environments.

NagiosOperationsPrometheus
0 likes · 14 min read
Evolution of Open‑Source Monitoring Tools: From Nagios to Prometheus
Top Architect
Top Architect
Mar 16, 2022 · Backend Development

Comprehensive Overview of Microservices Architecture

This article provides a detailed introduction to microservices, covering its definition, core principles such as small independent services, lightweight communication, independent deployment and management, the advantages and disadvantages, suitable organizational contexts, and the essential components like service discovery, API gateways, configuration centers, monitoring, circuit breaking, container orchestration, and service mesh.

BackendDistributed Systemsarchitecture
0 likes · 15 min read
Comprehensive Overview of Microservices Architecture
Efficient Ops
Efficient Ops
Mar 15, 2022 · Cloud Native

How eBPF Powers Seamless Observability in Cloud‑Native Kubernetes Environments

This article explains why the rise of Kubernetes as a cloud‑native standard brings new observability challenges, outlines how eBPF enables non‑intrusive, multi‑language, multi‑protocol data collection, and describes a comprehensive monitoring stack—including golden metrics, service topology, tracing, alerts, and network diagnostics—to achieve end‑to‑end visibility in complex Kubernetes deployments.

Cloud NativeKubernetesOperations
0 likes · 22 min read
How eBPF Powers Seamless Observability in Cloud‑Native Kubernetes Environments
DataFunTalk
DataFunTalk
Mar 15, 2022 · Big Data

Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel

This article details Bilibili's implementation of a hundred‑terabyte‑per‑day data synchronization pipeline, covering tool selection between DataX‑based Rider and SeaTunnel‑based AlterEgo, architecture design, performance tuning, logging optimization, rate‑limiting strategies, and comprehensive monitoring for large‑scale offline data ingestion and export.

Apache SeaTunnelBig DataTiDB
0 likes · 13 min read
Bilibili's Billion‑Scale Data Synchronization Using Apache SeaTunnel
Open Source Linux
Open Source Linux
Mar 11, 2022 · Operations

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

This article presents a curated list of practical Linux operation tools—including Nethogs, IOzone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—detailing their purpose, download links, installation commands, and basic usage to help system administrators improve monitoring, performance testing, and security on Linux servers.

LinuxOperationsSysadmin
0 likes · 12 min read
Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities
Efficient Ops
Efficient Ops
Mar 10, 2022 · Operations

Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive

This article explains how Prometheus transforms raw monitoring data into actionable insights by using a time‑series database (TSDB) that efficiently stores massive metric streams, supports powerful queries, and enables pre‑computed calculations for fast dashboards and alerts.

PrometheusTSDBTimeSeries
0 likes · 7 min read
Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive
DeWu Technology
DeWu Technology
Mar 9, 2022 · Backend Development

Evolution of a Message Platform from v1.0 to v3.0

The DeWu App’s messaging platform progressed from a tightly‑coupled v1.0 push API plagued by latency and priority issues to a v2.0 unified interface with fatigue control, multi‑version sending, and robust monitoring, and finally to a v3.0 modular, stateless architecture driven by an orchestration engine that separates core capabilities from business extensions while providing dynamic strategies, SLA‑level latency guarantees, and comprehensive stability monitoring.

System Architecturemessage platformmonitoring
0 likes · 6 min read
Evolution of a Message Platform from v1.0 to v3.0
Ops Development Stories
Ops Development Stories
Mar 4, 2022 · Cloud Native

Why Observability Is the ‘Force’ Empowering Modern IT Systems

This talk explains why observability is essential for cloud‑native IT systems, covering its core value of empowerment, various definitions, evaluation criteria such as zero‑intrusion, multidimensionality and real‑time response, and practical building approaches using SaaS, open‑source and integration, illustrated with numerous industry case studies.

OLAPSaaSeBPF
0 likes · 24 min read
Why Observability Is the ‘Force’ Empowering Modern IT Systems
ITPUB
ITPUB
Mar 3, 2022 · Operations

Unlock Linux Performance: Master Metrics, Tools, and Optimization Techniques

This guide explains Linux performance optimization by defining key metrics such as throughput, latency, average load, and CPU usage, describing how to select and interpret tools like vmstat, pidstat, perf, and dstat, and offering concrete steps to diagnose and fix CPU, memory, I/O, and context‑switch bottlenecks.

CPULinuxMemory
0 likes · 46 min read
Unlock Linux Performance: Master Metrics, Tools, and Optimization Techniques
DevOps Cloud Academy
DevOps Cloud Academy
Mar 2, 2022 · Operations

Promoter: Rendering AlertManager Graphs for DingTalk Notifications Using Go

The article introduces Promoter, a Go‑based webhook that fetches Prometheus metrics, renders alert graphs with gonum/plot, stores the images in S3‑compatible object storage, and embeds them in DingTalk notifications, providing deployment instructions, template customization, and core implementation details.

AlertmanagerDingTalkGo
0 likes · 10 min read
Promoter: Rendering AlertManager Graphs for DingTalk Notifications Using Go
Efficient Ops
Efficient Ops
Mar 1, 2022 · Operations

Master Linux Performance: Key Metrics, Tools, and Optimization Techniques

This guide explains Linux performance optimization by defining core metrics such as throughput, latency, and average load, describing how to select and benchmark indicators, outlining essential analysis tools like vmstat, pidstat, and perf, and providing practical CPU and memory tuning strategies to eliminate bottlenecks.

CPULinuxMemory
0 likes · 47 min read
Master Linux Performance: Key Metrics, Tools, and Optimization Techniques
Java Architect Essentials
Java Architect Essentials
Feb 28, 2022 · Databases

Configuring Alibaba Druid DataSource and Monitoring in Spring Boot

This article explains the fundamentals of Alibaba Druid as a Java database connection pool, shows how to add Maven dependencies, configure properties and filters in Spring Boot, set up monitoring pages, remove built‑in ads, and retrieve runtime statistics via DruidStatManagerFacade.

ConfigurationDatabase Connection PoolDruid
0 likes · 15 min read
Configuring Alibaba Druid DataSource and Monitoring in Spring Boot
HaoDF Tech Team
HaoDF Tech Team
Feb 28, 2022 · Information Security

Partner Data Security Closed‑Loop Management at Haodf Online

This article outlines how Haodf Online implements a closed‑loop partner data security framework—covering background regulations, SDL‑based lifecycle stages, partner information handling, security assessment, API testing, monitoring, and continuous improvement—to protect sensitive medical data across its ecosystem.

API SecurityInformation SecuritySDL
0 likes · 14 min read
Partner Data Security Closed‑Loop Management at Haodf Online