Tagged articles
2179 articles
Page 12 of 22
Dada Group Technology
Dada Group Technology
Feb 25, 2022 · Databases

Practical Deployment and Operation Guide for StarRocks OLAP Database

This article presents a comprehensive overview of StarRocks, covering its key features, deployment challenges, backup and synchronization methods, cluster configuration and upgrade procedures, as well as monitoring and alerting solutions, followed by practical lessons learned from real‑world usage.

BackupStarRocksmonitoring
0 likes · 13 min read
Practical Deployment and Operation Guide for StarRocks OLAP Database
IT Services Circle
IT Services Circle
Feb 24, 2022 · Databases

Diagnosing and Solving Redis Performance Issues

This article explains how to detect Redis latency problems, measure baseline performance, monitor slow commands, and address common causes such as network round‑trip delays, fork‑generated RDB snapshots, transparent huge pages, swap usage, AOF settings, key expiration, and big‑key handling, providing practical troubleshooting steps and solutions.

Latencydatabasemonitoring
0 likes · 20 min read
Diagnosing and Solving Redis Performance Issues
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Feb 24, 2022 · Backend Development

How to Build a 100‑Billion Red‑Envelope System that Handles 60 k QPS

This article details the design, implementation, and performance testing of a scalable red‑envelope service capable of handling up to 100 billion requests, supporting 1 million concurrent users per server, achieving peak QPS of 60 k, and outlines hardware, software, and monitoring strategies.

Performance Testinghigh concurrencymonitoring
0 likes · 17 min read
How to Build a 100‑Billion Red‑Envelope System that Handles 60 k QPS
DaTaobao Tech
DaTaobao Tech
Feb 21, 2022 · Frontend Development

Focused Gray Release Monitoring and Alert Configuration for Frontend Quality

To raise front‑end quality, the team implements gray‑release monitoring that triggers log analysis at a 5 % rollout, automatically generates reports within ten minutes, and uses dynamic thresholds and noise‑reduction tactics to detect errors early, enabling rapid rollback or expansion and markedly improving stability and release efficiency.

AlertingMetricsfrontend
0 likes · 9 min read
Focused Gray Release Monitoring and Alert Configuration for Frontend Quality
IT Architects Alliance
IT Architects Alliance
Feb 15, 2022 · Operations

What Real-World Performance Tuning Taught Us About Legacy Web Apps

After a traffic surge exposed severe latency in a 15-year-old multi-service web platform, we used monitoring to discover a DB-connection leak caused by a liveness probe, corrected it, and distilled four practical lessons on latency metrics, tooling, legacy maintenance, and code vigilance.

APMLoad TestingOperations
0 likes · 9 min read
What Real-World Performance Tuning Taught Us About Legacy Web Apps
Top Architect
Top Architect
Feb 13, 2022 · Operations

Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems

The article shares a real‑world case study of a legacy multi‑service web platform where traffic spikes exposed DB connection leaks, leading to a 90% response‑time bottleneck, and outlines four key takeaways about tail‑latency metrics, investing in tools and people, actively maintaining legacy systems, and treating every line of code as critical for performance.

BackendLoad TestingSRE
0 likes · 9 min read
Performance Optimization Lessons from a Legacy Web Application: Monitoring, Load Testing, and Maintaining Old Systems
MaGe Linux Operations
MaGe Linux Operations
Feb 9, 2022 · Backend Development

Mastering Tars: Deploy, Manage, and Monitor a High‑Performance Microservice Framework

This guide provides a comprehensive overview of the Tars microservice framework, covering its core concepts, deployment methods across various environments, configuration management, service discovery, logging, monitoring, and operational features such as gray releases and circuit‑breaker strategies.

ConfigurationDeploymentMicroservices
0 likes · 18 min read
Mastering Tars: Deploy, Manage, and Monitor a High‑Performance Microservice Framework
Architect
Architect
Feb 5, 2022 · Backend Development

Best Practices for Designing Consistent RESTful APIs

This article presents a concise, step‑by‑step guide to designing clean, consistent RESTful APIs, covering resource naming, URL conventions, HTTP methods, versioning, pagination, field selection, security, monitoring, error handling, and documentation tools, with concrete code examples for each rule.

HTTP methodsURL conventionsVersioning
0 likes · 10 min read
Best Practices for Designing Consistent RESTful APIs
MaGe Linux Operations
MaGe Linux Operations
Feb 5, 2022 · Operations

Essential Linux Bash Scripts for Server Operations and Automation

This article presents a collection of practical Bash scripts for Linux servers, covering DOS attack IP blocking, alert emailing, MySQL backup (single and multi‑loop), Nginx log rotation and analysis, real‑time network traffic monitoring, system initialization, and disk usage checks across multiple hosts.

BashLinuxServer
0 likes · 10 min read
Essential Linux Bash Scripts for Server Operations and Automation
Top Architect
Top Architect
Feb 3, 2022 · Backend Development

System Performance Issue Analysis, Diagnosis, and Optimization for Business Applications

This article explains how to analyze, diagnose, and optimize performance problems in production business systems, covering the typical causes such as high concurrency, data growth, hardware limits, and environment changes, and detailing practical steps for hardware, OS, database, middleware, JVM tuning, code review, and APM monitoring.

BackendJVMdiagnosis
0 likes · 15 min read
System Performance Issue Analysis, Diagnosis, and Optimization for Business Applications
DataFunTalk
DataFunTalk
Feb 1, 2022 · Big Data

Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms

This article presents Meituan's large‑scale Kafka deployment, describing the current state and challenges of massive data ingestion, detailing latency‑reduction techniques, cluster‑level optimizations, SSD‑based caching, isolation strategies, full‑link monitoring, lifecycle management, and future directions for high availability.

Cluster ManagementKafkaMeituan
0 likes · 22 min read
Kafka at Meituan: Practices, Challenges, and Optimizations for Large‑Scale Data Platforms
dbaplus Community
dbaplus Community
Jan 27, 2022 · Databases

Why ClickHouse Beats Elasticsearch for Microservice Governance Data at Scale

The article examines the data‑storage problems caused by rapid microservice growth, explains why traditional Hadoop/Spark stacks were rejected, presents benchmark comparisons that show ClickHouse’s superior performance and compression, and details practical ClickHouse deployment, schema design, sharding, TTL, indexing, and monitoring integrations for real‑time analytics.

DataAnalyticsDatabaseDesignOLAP
0 likes · 27 min read
Why ClickHouse Beats Elasticsearch for Microservice Governance Data at Scale
Architecture Digest
Architecture Digest
Jan 27, 2022 · Backend Development

Designing API Error Codes and Result Codes: Best Practices

This article explains why a well‑designed API error‑code system—using consistent numeric or string codes, clear messages, HTTP‑status‑like segmentation, personalized user messages, and unified handling for monitoring—reduces communication overhead, simplifies maintenance, and improves overall backend reliability.

Error CodesHTTP statusapi-design
0 likes · 6 min read
Designing API Error Codes and Result Codes: Best Practices
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jan 26, 2022 · Operations

Mastering Microservice Monitoring, Fault Tolerance, and Security: A Complete Guide

This article explains how to monitor microservice architectures, describes log, tracing, and metric monitoring, compares open‑source tracing tools, outlines fault‑tolerance strategies such as timeout, rate‑limiting, degradation, async buffering and circuit breaking, and details access‑security mechanisms including gateway authentication, service‑side auth, and OAuth2.0 token flows, while also introducing container technology and its role in microservice deployment.

ContainersMicroservicesfault-tolerance
0 likes · 43 min read
Mastering Microservice Monitoring, Fault Tolerance, and Security: A Complete Guide
Youzan Coder
Youzan Coder
Jan 26, 2022 · Big Data

How to Build a Robust Data Quality Assurance Strategy for Large-Scale Data Platforms

This article outlines a comprehensive data quality assurance framework for a massive reporting platform, covering the data pipeline architecture, detailed testing methods for timeliness, completeness, and accuracy, as well as application‑level checks, downgrade and backup strategies, and future automation plans.

Data Qualityautomationbig data testing
0 likes · 14 min read
How to Build a Robust Data Quality Assurance Strategy for Large-Scale Data Platforms
IT Architects Alliance
IT Architects Alliance
Jan 23, 2022 · Operations

Microservice Monitoring, Fault Tolerance, Access Security, and Container Technology Overview

This article provides a comprehensive guide to microservice monitoring—including log, tracing, and metrics approaches—fault‑tolerance isolation techniques, access‑security mechanisms such as API‑gateway and OAuth2.0, and the role of container technologies like Docker in cloud‑native deployments.

Cloud NativeContainersMicroservices
0 likes · 30 min read
Microservice Monitoring, Fault Tolerance, Access Security, and Container Technology Overview
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Jan 21, 2022 · Frontend Development

White‑Screen Detection and Performance Optimization for Front‑End Applications

The article explains the concept of white‑screen time, its impact on user experience, and presents multiple detection methods—including Navigation Timing API, MutationObserver, element‑point analysis, and headless‑browser simulation—along with implementation code and a monitoring‑alert architecture for front‑end performance optimization.

MutationObserverPuppeteerfrontend
0 likes · 17 min read
White‑Screen Detection and Performance Optimization for Front‑End Applications
Efficient Ops
Efficient Ops
Jan 20, 2022 · Operations

Mastering Prometheus Metrics: Best Practices for Effective Monitoring

This article outlines practical guidelines for designing Prometheus metrics, covering how to define monitoring targets, choose appropriate vectors and labels, name metrics and labels correctly, select histogram buckets, and leverage Grafana features to visualize and troubleshoot data effectively.

GrafanaMetricsPrometheus
0 likes · 11 min read
Mastering Prometheus Metrics: Best Practices for Effective Monitoring
IT Architects Alliance
IT Architects Alliance
Jan 20, 2022 · Cloud Native

How to Build a High‑Availability Microservices System on Kubernetes – A Complete Guide

This guide walks you through designing a simple front‑back separation microservice architecture, implementing it with Java Spring Boot, deploying multiple instances with Eureka, adding Prometheus‑Grafana monitoring, logging, tracing, flow control, and finally installing Kubernetes using K8seasy and verifying high‑availability across the cluster.

Cloud NativeKubernetesMicroservices
0 likes · 19 min read
How to Build a High‑Availability Microservices System on Kubernetes – A Complete Guide
Baidu Geek Talk
Baidu Geek Talk
Jan 19, 2022 · Big Data

Quantile Computation in Baidu Advertising System: Architecture and Implementation

Baidu’s advertising platform computes high‑precision response‑time quantiles at massive scale by intercepting each API call, locally summarizing data with mergeable T‑Digest histograms, periodically uploading compressed, Base64‑encoded summaries to a warehouse where they are merged on demand, enabling low‑latency, cost‑effective percentile analysis with sub‑0.1% error.

QuantileT-Digestdata aggregation
0 likes · 11 min read
Quantile Computation in Baidu Advertising System: Architecture and Implementation
IT Xianyu
IT Xianyu
Jan 14, 2022 · Operations

Redis Monitoring, Data Migration, and Cluster Management Tools Overview

This article introduces essential Redis operational tools, covering the INFO command for monitoring, Prometheus‑based redis‑exporter visualization, the Redis‑shake data migration utility, Redis‑full‑check consistency verification, and the CacheCloud platform for comprehensive cluster management.

CacheCloudData MigrationOperations
0 likes · 10 min read
Redis Monitoring, Data Migration, and Cluster Management Tools Overview
Top Architect
Top Architect
Jan 13, 2022 · Backend Development

Microservice Architecture Roadmap: Core Components and Recommended Tools

This article presents a comprehensive roadmap for adopting microservice architecture, explaining why it is chosen, outlining essential concerns such as Docker, container orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, tracing, data persistence, caching, and cloud providers, and recommending popular tools for each component.

DockerKubernetesMicroservices
0 likes · 16 min read
Microservice Architecture Roadmap: Core Components and Recommended Tools
Programmer DD
Programmer DD
Jan 12, 2022 · Backend Development

How to Build a Complete Backend Stack for Your Startup from Scratch

This guide walks startup leaders through designing and assembling a full backend technology stack—from language and component choices to processes, systems, and deployment tools—providing practical recommendations, diagrams, and best‑practice tips for building scalable, maintainable services.

Backend ArchitectureCloud ServicesDevOps
0 likes · 30 min read
How to Build a Complete Backend Stack for Your Startup from Scratch
DataFunTalk
DataFunTalk
Jan 7, 2022 · Artificial Intelligence

Building an Intelligent Risk Control Tool System: Architecture and Key Components

This article presents a comprehensive overview of constructing an intelligent risk control tool system, detailing its evolution from manual processes to automated platforms, describing the core "three‑piece" suite (model, decision, and feature platforms) along with supporting data and monitoring platforms, and explaining the functions and interactions of each module such as data ingestion, feature engineering, automated modeling, decision flow, and real‑time monitoring.

Data PlatformModelingdecision engine
0 likes · 13 min read
Building an Intelligent Risk Control Tool System: Architecture and Key Components
HomeTech
HomeTech
Jan 6, 2022 · Operations

Design and Implementation of a Centralized Database Log Collection and Analysis Platform

This article describes the background, architecture, and implementation of a centralized database log collection and analysis platform built in 2021, detailing how logs from hosts, containers, and databases are normalized, streamed through Kafka, processed with Flink, stored in Elasticsearch, visualized with Kibana, and extended with alerting and configuration management to improve fault diagnosis and lay the groundwork for future AI‑driven operations.

Big DataKibanalog collection
0 likes · 5 min read
Design and Implementation of a Centralized Database Log Collection and Analysis Platform
Practical DevOps Architecture
Practical DevOps Architecture
Jan 5, 2022 · Operations

Deploying Prometheus and Node Exporter on a Linux Host

This guide walks through installing Prometheus and Node Exporter on a Linux server, copying binaries to system paths, configuring Prometheus with scrape jobs for the local node and remote hosts, and running the exporters with specific collector options for system metrics.

OperationsPrometheusmonitoring
0 likes · 4 min read
Deploying Prometheus and Node Exporter on a Linux Host
Architects' Tech Alliance
Architects' Tech Alliance
Jan 5, 2022 · Backend Development

Essential Microservice Architecture Roadmap: Tools, Patterns, and Best Practices

This guide outlines why microservice architecture is preferred for large applications, presents a clear learning roadmap, and details each critical concern—such as Docker, orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, tracing, persistence, caching, and cloud providers—along with recommended tools.

Backend ArchitectureCloud NativeDocker
0 likes · 14 min read
Essential Microservice Architecture Roadmap: Tools, Patterns, and Best Practices
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 5, 2022 · Operations

Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB

This article details the background, research, architecture, performance testing, and deployment of a comprehensive monitoring system that leverages Prometheus, Grafana, and M3DB to provide flexible metric collection, automatic dashboard generation, and a custom alerting service for large‑scale business services.

AlertingMetricsTime Series
0 likes · 16 min read
Design and Implementation of a Multi‑Dimensional Monitoring Platform Based on Prometheus and M3DB
Architecture Digest
Architecture Digest
Dec 31, 2021 · Backend Development

Why I Chose Microservice Architecture and a Roadmap of Its Core Components

This article explains why microservice architecture is preferred over monolithic applications, outlines a learning roadmap, and details essential components such as Docker, container orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, distributed tracing, data persistence, caching, and cloud providers.

Backend ArchitectureDockerKubernetes
0 likes · 13 min read
Why I Chose Microservice Architecture and a Roadmap of Its Core Components
21CTO
21CTO
Dec 30, 2021 · Backend Development

Why Choose Microservices? A Practical Roadmap and Tool Guide

This article outlines why microservice architecture is preferred over monolithic designs, presents a clear learning roadmap, and details essential concerns such as Docker, orchestration, API gateways, load balancing, service discovery, logging, monitoring, tracing, persistence, caching, and cloud providers, with recommended tools for each.

DockerMicroservicesapi-gateway
0 likes · 15 min read
Why Choose Microservices? A Practical Roadmap and Tool Guide
Liulishuo Tech Team
Liulishuo Tech Team
Dec 30, 2021 · Operations

Design and Implementation of an Alert Scheduling System (GoAlert) and Notification Center

This article explains why alerts and on‑call scheduling are needed, outlines the core principles of an alert scheduling system, describes the architecture evolution from PagerDuty to GoAlert and Notice‑Center, and details the implementation, code snippets, and future outlook for a comprehensive operations monitoring solution.

AlertingNotification Systemgoalert
0 likes · 14 min read
Design and Implementation of an Alert Scheduling System (GoAlert) and Notification Center
HomeTech
HomeTech
Dec 30, 2021 · Operations

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

Open-FalconOperationsmonitoring
0 likes · 11 min read
Open-falcon in Automotive Home: Application, Architecture, and Customizations
DataFunSummit
DataFunSummit
Dec 29, 2021 · Operations

How to Build an Operations Monitoring Platform with Spring Boot Admin

This article explains what Spring Boot Admin is, walks through creating a server and client to monitor Spring Boot applications, shows how to configure ports, enable the admin UI, and set up email and custom alert notifications for operational health monitoring.

OperationsSpring Bootjava
0 likes · 12 min read
How to Build an Operations Monitoring Platform with Spring Boot Admin
DeWu Technology
DeWu Technology
Dec 24, 2021 · Operations

How to Quickly Attribute Live‑Streaming Alert Issues in a Kubernetes Environment

This article walks through a real‑world live‑streaming service alert where response time and goroutine spikes were traced through Grafana metrics, MySQL/Redis performance, routing logic, and Istio sidecar load, ultimately revealing a mis‑reported Istio metric and a resource‑allocation fix to prevent future jitter.

IstioKubernetesOperations
0 likes · 11 min read
How to Quickly Attribute Live‑Streaming Alert Issues in a Kubernetes Environment
Architecture Digest
Architecture Digest
Dec 23, 2021 · Operations

Using Filebeat and Graylog for Centralized Log Collection and Monitoring

This article explains how to deploy and configure Filebeat and Graylog for centralized log collection, covering installation methods, configuration files, Docker deployment, input modules, pipelines, and practical examples for efficiently gathering and analyzing logs across multiple environments.

DockerFilebeatGraylog
0 likes · 15 min read
Using Filebeat and Graylog for Centralized Log Collection and Monitoring
ITPUB
ITPUB
Dec 20, 2021 · Databases

From Database Developer to New DBA: Boosting MySQL Efficiency and Automation

The article shares a senior DBA's journey from early database engine development to modern MySQL operations, outlining practical methods for improving efficiency, automating monitoring, building data‑driven processes, and redefining the DBA role for proactive, high‑impact service delivery.

DBADatabase operationsautomation
0 likes · 33 min read
From Database Developer to New DBA: Boosting MySQL Efficiency and Automation
Beike Product & Technology
Beike Product & Technology
Dec 17, 2021 · Operations

Practices for Monitoring, Resource Optimization, and Containerization of Large-Scale Flink Jobs at Beike

This article describes Beike's real‑time computing team's end‑to‑end practices for collecting and storing Flink metrics, building visual monitoring dashboards, implementing multi‑level alerting, analyzing logs, estimating CPU and memory resources, and deploying Flink on Kubernetes with containerization and storage separation to improve stability, resource utilization, and operational efficiency.

FlinkKubernetesMetrics
0 likes · 25 min read
Practices for Monitoring, Resource Optimization, and Containerization of Large-Scale Flink Jobs at Beike
Alibaba Cloud Native
Alibaba Cloud Native
Dec 16, 2021 · Cloud Native

From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey

This article traces the 30‑year evolution of system monitoring, explains the differences between monitoring, APM and observability, outlines key practices for building an observability platform, and provides a step‑by‑step guide to implementing Prometheus + Grafana in a cloud‑native environment.

APMARMSGrafana
0 likes · 18 min read
From Legacy Monitoring to Modern Observability: A Cloud‑Native Journey
Architecture Digest
Architecture Digest
Dec 16, 2021 · Operations

System Performance Issue Analysis, Diagnosis, and Optimization Process

This article outlines a comprehensive approach to diagnosing and optimizing performance problems in production business systems, covering common causes such as concurrency spikes, data growth, and environment changes, and detailing hardware, middleware, database, JVM, code-level analyses, monitoring tools, and APM strategies.

JVMdatabasediagnostics
0 likes · 15 min read
System Performance Issue Analysis, Diagnosis, and Optimization Process
NetEase Smart Enterprise Tech+
NetEase Smart Enterprise Tech+
Dec 14, 2021 · Backend Development

How NetEase Cloud’s Distributed Recording Cluster Ensures High‑Availability and Scalability

This article explains the architecture and key features of NetEase Cloud's local server‑side recording cluster, detailing how dynamic scaling, multi‑backup high availability, load‑balancing strategies, monitoring, and an embedded registration center enable secure, reliable, and scalable recording for data‑sensitive applications.

Distributed SystemsJava SDKREST API
0 likes · 11 min read
How NetEase Cloud’s Distributed Recording Cluster Ensures High‑Availability and Scalability
Programmer DD
Programmer DD
Dec 12, 2021 · Operations

How Netflix’s Telltale Transforms Monitoring for 100+ Services

This article explains Netflix’s home‑grown monitoring system Telltale, detailing its design, multi‑dimensional health‑assessment model, intelligent alerting, integration with Slack, deployment monitoring, and continuous optimization that together keep over a hundred production applications running smoothly.

AlertingMicroservicesNetflix
0 likes · 13 min read
How Netflix’s Telltale Transforms Monitoring for 100+ Services
IT Architects Alliance
IT Architects Alliance
Dec 12, 2021 · Operations

System Performance Issue Analysis and Optimization Process for Business Applications

The article outlines a comprehensive process for diagnosing and optimizing performance problems in production business systems, covering causes such as high concurrency, data growth, hardware constraints, and detailing analysis of hardware, OS, database, middleware, JVM settings, code inefficiencies, and the role of monitoring and APM tools.

BackendDatabase TuningJVM
0 likes · 13 min read
System Performance Issue Analysis and Optimization Process for Business Applications
Selected Java Interview Questions
Selected Java Interview Questions
Dec 10, 2021 · Backend Development

A Comprehensive Guide to Spring Boot Actuator: Quick Start, Endpoints, and Monitoring

This article provides a step‑by‑step tutorial on using Spring Boot Actuator to monitor microservice applications, covering quick setup, essential endpoints such as health, metrics, loggers, info, beans, heapdump, threaddump and shutdown, endpoint exposure configuration, and securing them with Spring Security.

ActuatorBackendEndpoints
0 likes · 14 min read
A Comprehensive Guide to Spring Boot Actuator: Quick Start, Endpoints, and Monitoring
Ctrip Technology
Ctrip Technology
Dec 9, 2021 · Databases

TiDB Operational Practices at Ctrip: Architecture, Use Cases, Performance Tuning, Monitoring, and Tooling

This article details Ctrip's migration from MySQL to TiDB, describing the multi‑data‑center architecture, real‑world use cases such as the international CDP platform and hotel settlement, performance tuning measures, comprehensive monitoring and alerting, auxiliary tools, and future roadmap for the distributed NewSQL database.

HTAPTiDBmonitoring
0 likes · 16 min read
TiDB Operational Practices at Ctrip: Architecture, Use Cases, Performance Tuning, Monitoring, and Tooling
IT Architects Alliance
IT Architects Alliance
Dec 9, 2021 · Backend Development

How to Build a Billion‑User Scalable User Center: Architecture, APIs, Token Fallback, and Security

This article presents a comprehensive, practical design for an ultra‑large‑scale user center, covering microservice architecture, API separation, token generation with graceful degradation, data‑sharding strategies, password encryption, asynchronous processing, and detailed monitoring to ensure high availability, performance, and security.

MicroservicesScalabilityToken
0 likes · 16 min read
How to Build a Billion‑User Scalable User Center: Architecture, APIs, Token Fallback, and Security
Alibaba Cloud Native
Alibaba Cloud Native
Dec 7, 2021 · Operations

How Information Entropy Powers AI‑Driven Alert Noise Reduction in Cloud‑Native Operations

This article explains how Shannon's information entropy and NLP are combined in Alibaba Cloud's ARMS intelligent noise reduction to quantify alert uncertainty, filter redundant notifications, and automatically prioritize critical incidents, offering a practical, self‑learning solution for modern monitoring environments.

Alert Noise ReductionNLPinformation entropy
0 likes · 11 min read
How Information Entropy Powers AI‑Driven Alert Noise Reduction in Cloud‑Native Operations
Efficient Ops
Efficient Ops
Dec 6, 2021 · Operations

How Scenario‑Based AIOps Transforms IT Operations: Insights from GOPS 2023

The article summarizes a GOPS conference presentation by Dingmao Technology on AIOps scenario‑driven construction, detailing challenges, definition of scenarios, technical methods, roadmap planning, and future prospects, while showcasing practical examples and supporting technologies for intelligent IT operations.

Artificial IntelligenceData IntegrationIT Operations
0 likes · 8 min read
How Scenario‑Based AIOps Transforms IT Operations: Insights from GOPS 2023
Top Architect
Top Architect
Dec 3, 2021 · Operations

Centralized Log Collection with Filebeat and Graylog

This article explains how to use Filebeat together with Graylog to collect, process, and visualize logs from multiple services and environments, covering tool introductions, configuration files, component details, deployment methods, and practical code examples.

DockerELKFilebeat
0 likes · 19 min read
Centralized Log Collection with Filebeat and Graylog
Architecture Digest
Architecture Digest
Dec 3, 2021 · Backend Development

Design Practices for a Billion‑Scale User Center

This article presents a comprehensive set of design practices for building a highly available, high‑performance, and secure user‑center system that can handle hundreds of millions of users, covering service architecture, API design, token degradation, data sharding, security, asynchronous processing, and monitoring.

ScalabilityTokendatabase sharding
0 likes · 15 min read
Design Practices for a Billion‑Scale User Center
Alibaba Cloud Native
Alibaba Cloud Native
Nov 30, 2021 · Cloud Native

How to Detect and Resolve Slow Calls in Kubernetes: Best Practices & Real‑World Cases

This article explains why slow calls in Kubernetes can jeopardize user experience, project timelines, and system stability, outlines five common causes, introduces the golden‑signal and USE analysis framework, and walks through three practical case studies with step‑by‑step troubleshooting and remediation techniques.

ARMSKubernetesmonitoring
0 likes · 15 min read
How to Detect and Resolve Slow Calls in Kubernetes: Best Practices & Real‑World Cases
Cloud Native Technology Community
Cloud Native Technology Community
Nov 25, 2021 · Databases

Why Is My Redis Slowing Down? A Complete Troubleshooting Guide

This article provides a systematic, step‑by‑step methodology for diagnosing Redis latency spikes, covering baseline performance testing, slow‑log analysis, high‑complexity commands, big‑key handling, expiration patterns, memory limits, fork overhead, huge‑page settings, AOF configurations, CPU binding, swap usage, memory fragmentation, network saturation, and practical monitoring tips.

Latencydatabasemonitoring
0 likes · 42 min read
Why Is My Redis Slowing Down? A Complete Troubleshooting Guide
Baidu Geek Talk
Baidu Geek Talk
Nov 24, 2021 · Operations

How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices

Baidu’s Fengjing monitoring platform tackles the daunting challenge of pinpointing failures in its massive Java‑based microservice ecosystem by employing a non‑intrusive probe that captures log metadata, stores it in a database, and reconstructs full request‑level logs with minimal storage overhead.

Distributed TracingMicroservicesholographic logging
0 likes · 9 min read
How Baidu’s Fengjing Uses Holographic Logs to Debug Massive Microservices
Efficient Ops
Efficient Ops
Nov 24, 2021 · Operations

Practical Prometheus in Kubernetes: Tips, Limits, and Scaling

This article shares practical experiences and best‑practice guidelines for deploying and operating Prometheus in Kubernetes, covering version selection, inherent limitations, exporter choices, metric design, multi‑cluster scraping, memory and storage planning, GPU monitoring, timezone handling, and alerting considerations.

ExportersGrafanaPrometheus
0 likes · 21 min read
Practical Prometheus in Kubernetes: Tips, Limits, and Scaling
dbaplus Community
dbaplus Community
Nov 22, 2021 · Databases

Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts

Qunar’s DBA team overhauled their MySQL monitoring and alert system—originally built on Nagios and NRPE—by integrating a Kafka‑based pipeline, a custom alarm service, and MySQL‑stored alert templates, achieving flexible thresholds, granular silencing, high‑availability processing, and early‑stage intelligent management of alerts, slow queries, and disk space.

AlertingDBAKafka
0 likes · 14 min read
Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts
JD Retail Technology
JD Retail Technology
Nov 22, 2021 · Backend Development

Designing a High‑Performance Log Collection System with UDP, Compression, and ClickHouse

The article analyzes the high cost and scalability challenges of traditional log collection pipelines and proposes a streamlined architecture that uses in‑memory buffering, UDP transport, aggressive compression, and ClickHouse storage to achieve massive throughput while drastically reducing hardware and operational expenses.

High ThroughputUDPclickhouse
0 likes · 15 min read
Designing a High‑Performance Log Collection System with UDP, Compression, and ClickHouse
Ops Development Stories
Ops Development Stories
Nov 22, 2021 · Cloud Native

Mastering Kubernetes Pod Resource Requests, Limits, and QoS

Learn how to properly configure CPU and Memory requests and limits for Kubernetes Pods, understand QoS classes, manage namespace quotas with LimitRange and ResourceQuota, and monitor resource usage using Prometheus queries and Grafana dashboards to ensure stable, efficient cluster operations.

KubernetesQoSResourceQuota
0 likes · 11 min read
Mastering Kubernetes Pod Resource Requests, Limits, and QoS
IT Architects Alliance
IT Architects Alliance
Nov 20, 2021 · Operations

Analysis and Optimization of Business System Performance

This article outlines a comprehensive approach to diagnosing and optimizing performance problems in production business systems, covering analysis processes, hardware, OS, database, middleware, JVM tuning, code inefficiencies, and monitoring techniques to identify root causes and improve system reliability.

Database TuningOperationsSystem optimization
0 likes · 16 min read
Analysis and Optimization of Business System Performance
vivo Internet Technology
vivo Internet Technology
Nov 17, 2021 · Operations

Design and Architecture of a Unified Alert Convergence System for Monitoring

The paper presents a unified alert convergence system that centralizes metric calculation, detection, and alarm handling across monitoring subsystems, employing mechanisms such as convergence, claiming, silencing, escalation, and a Redis‑based delayed queue integrated via Kafka or REST to reduce alarm fatigue, improve MTTA/MTTR, and enable future AI‑driven AIOps.

MTTAMTTROperations
0 likes · 18 min read
Design and Architecture of a Unified Alert Convergence System for Monitoring
Efficient Ops
Efficient Ops
Nov 16, 2021 · Operations

How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes

This article explains why monitoring is essential for production stability, compares white‑box and black‑box approaches, and provides a step‑by‑step guide to deploying Prometheus, configuring scrape targets, using Pushgateway and Alertmanager, and scaling the solution with Thanos in a Kubernetes environment.

AlertmanagerPrometheusPushgateway
0 likes · 21 min read
How to Build a Scalable Prometheus Monitoring System with Thanos on Kubernetes
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 15, 2021 · Operations

A Comprehensive Overview of Kafka Monitoring Tools

This article provides a comprehensive overview of popular Kafka monitoring solutions—including JMX, Kafka Manager (CMAK), Kafka Eagle, and Logi‑KafkaManager—detailing their features, installation steps, configuration examples, and comparative advantages, while also mentioning custom setups using JMXTrans, InfluxDB, and Grafana.

CMAKKafkaKafka Eagle
0 likes · 8 min read
A Comprehensive Overview of Kafka Monitoring Tools
Open Source Linux
Open Source Linux
Nov 14, 2021 · Databases

Essential Redis Monitoring Metrics Every Engineer Should Know

This guide outlines the key Redis monitoring metrics—including performance, memory, basic activity, persistence, and error indicators—explains their meanings, shows how to retrieve them with Redis commands, and provides practical tips for effective performance and health tracking.

ErrorMetricsmonitoring
0 likes · 6 min read
Essential Redis Monitoring Metrics Every Engineer Should Know
dbaplus Community
dbaplus Community
Nov 14, 2021 · Operations

How to Boost Service Reliability: SRE Basics and Tackling Technical Debt

This article explains the fundamentals of Site Reliability Engineering, outlines a complete SRE workflow from prevention to post‑mortem, details key availability metrics and golden indicators, examines how technical debt arises and can be mitigated, and describes the tooling and practices needed to keep large‑scale services healthy.

SRETechnical Debtmonitoring
0 likes · 18 min read
How to Boost Service Reliability: SRE Basics and Tackling Technical Debt
Alibaba Terminal Technology
Alibaba Terminal Technology
Nov 10, 2021 · Operations

How QianNiu Cut Plugin Issues by 50%: Open Architecture & Monitoring Secrets

The article explains how Alibaba’s QianNiu multi‑platform workbench reduced open‑plugin related user complaints by half through defining open nodes, optimizing long plugin startup chains, building permission‑request loops, establishing comprehensive data‑driven metrics, and creating an open‑experience dashboard that monitors performance, reliability, and user‑perceived issues across mobile and desktop.

Open PlatformUser experiencedata metrics
0 likes · 14 min read
How QianNiu Cut Plugin Issues by 50%: Open Architecture & Monitoring Secrets
IT Architects Alliance
IT Architects Alliance
Nov 7, 2021 · Cloud Native

Why Microservices Matter: Core Architecture, Benefits, and Real-World Practices

This article provides a comprehensive overview of microservices, covering its origin, core architectural principles, key characteristics, advantages and drawbacks, suitable organizational contexts, and essential technical components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration.

Microservicesarchitecturecontainer orchestration
0 likes · 17 min read
Why Microservices Matter: Core Architecture, Benefits, and Real-World Practices
YunZhu Net Technology Team
YunZhu Net Technology Team
Nov 5, 2021 · Backend Development

Practical Java Performance Optimization: Metrics, Bottleneck Identification, and Governance Strategies

This article shares practical Java performance‑optimization techniques, covering UI and non‑UI latency metrics, baseline data collection, bottleneck discovery with tools like Arthas, chronic issue handling, and a comprehensive set of governance measures ranging from network‑level caching to code‑level refactoring, asynchronous processing, and service splitting to achieve stable sub‑200 ms response times.

ArthasBackendcaching
0 likes · 19 min read
Practical Java Performance Optimization: Metrics, Bottleneck Identification, and Governance Strategies
Open Source Linux
Open Source Linux
Oct 31, 2021 · Operations

Designing Effective Metrics: From Requirements to Labels and Buckets

This guide explains how to define, name, and organize monitoring metrics—covering Google’s four golden indicators, system‑specific measurement objects, vector selection, label conventions, bucket design, and practical Grafana tips—for reliable observability of diverse services.

Metricslabelingmonitoring
0 likes · 10 min read
Designing Effective Metrics: From Requirements to Labels and Buckets
Baidu Geek Talk
Baidu Geek Talk
Oct 29, 2021 · Industry Insights

Baidu’s QCon 2021 Highlights: Elastic Scaling, Search Architecture, AI Chips

This article compiles Baidu engineers' QCon 2021 talks, covering micro‑service evolution, large‑scale container elastic scaling, search system elasticity, AI‑chip deployment at massive scale, and cost‑focused monitoring, each with abstracts, outlines and key takeaways for practitioners.

AI chipsCloud NativeMicroservices
0 likes · 11 min read
Baidu’s QCon 2021 Highlights: Elastic Scaling, Search Architecture, AI Chips
Huolala Tech
Huolala Tech
Oct 29, 2021 · Operations

How Huolala Guarantees Cloud‑Native Stability at Scale

In this detailed account of Huolala's 2021 Cloud Operations Best Practices talk, the company shares its multi‑cloud architecture, service‑oriented governance, capacity‑testing, monitoring, and risk‑prediction techniques that together ensure high‑availability and efficient scaling for its diverse logistics services.

Operationscapacity testingmonitoring
0 likes · 17 min read
How Huolala Guarantees Cloud‑Native Stability at Scale
政采云技术
政采云技术
Oct 28, 2021 · Backend Development

HikariCP Overview (Part 1): Initialization, Core Components, Monitoring and Configuration

This article provides a detailed analysis of HikariCP’s initialization, core components, startup flow, connection acquisition logic, monitoring metrics, and key configuration parameters, illustrating how Spring Boot 2.x leverages this high‑performance JDBC connection pool and offering guidance for tuning and extending it.

ConfigurationConnection PoolHikariCP
0 likes · 14 min read
HikariCP Overview (Part 1): Initialization, Core Components, Monitoring and Configuration
dbaplus Community
dbaplus Community
Oct 26, 2021 · Databases

Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching

JD.com’s customer service team leverages the open‑source MPP database Doris to power real‑time and offline OLAP dashboards, detailing data ingestion pipelines, full‑link monitoring, dual‑stream high‑availability design, dynamic partition management, multi‑layer caching strategies, and performance optimizations applied during the 2020 11.11 shopping festival.

Big DataOLAPReal-time analytics
0 likes · 15 min read
Scaling JD.com Customer Service with Doris OLAP: Architecture & Caching
Baidu Geek Talk
Baidu Geek Talk
Oct 20, 2021 · Operations

Practical Strategies for Building High‑Availability Systems

This article presents a comprehensive, step‑by‑step guide on improving system reliability through early fault detection, scope reduction, frequency reduction, and rapid incident handling, using real‑world practices from Baidu's commercial hosting platform.

Log StandardizationOperationscapacity planning
0 likes · 20 min read
Practical Strategies for Building High‑Availability Systems
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Oct 16, 2021 · Backend Development

Handling MQ Failures: Encapsulation, Degradation, and Message Resend Strategies

The article explains how to properly deal with message‑queue (MQ) outages by first encapsulating MQ operations, then applying degradation tactics such as persisting failed messages to a database, disk, or log, and finally implementing scheduled or manual message‑replay mechanisms while emphasizing monitoring and fallback logic.

BackendFailure HandlingMQ
0 likes · 5 min read
Handling MQ Failures: Encapsulation, Degradation, and Message Resend Strategies
360 Tech Engineering
360 Tech Engineering
Oct 15, 2021 · Operations

Log Collection Architecture Using Filebeat, Logstash, and Kafka

This article describes a lightweight, resource‑efficient log collection solution that combines Filebeat agents, optional Logstash aggregation, and Kafka transport, detailing configuration choices, meta‑persistence, back‑pressure mechanisms, monitoring setup, and deployment architecture for reliable at‑least‑once delivery.

FilebeatLogstashOperations
0 likes · 14 min read
Log Collection Architecture Using Filebeat, Logstash, and Kafka