Tagged articles
2179 articles
Page 10 of 22
Efficient Ops
Efficient Ops
Nov 29, 2022 · Operations

How to Retrieve and Process Prometheus Metrics via Its API

This article explains how to use the Prometheus HTTP API to query instant and range metrics, interpret the JSON responses, and fetch data programmatically with Python, providing code examples and details on request parameters, error handling, and practical usage.

APIDevOpsMetrics
0 likes · 8 min read
How to Retrieve and Process Prometheus Metrics via Its API
DataFunTalk
DataFunTalk
Nov 28, 2022 · Databases

Optimizing Real‑Time Data Warehouse with Apache Doris at 360 DataTech

Facing stricter security, accuracy, and latency demands, 360 DataTech rebuilt its real‑time data warehouse by selecting Apache Doris for its high‑performance writes, SQL compatibility, low operational complexity, and active community, then detailed the architecture, ingestion, query acceleration, monitoring, troubleshooting, and future plans.

Apache DorisData Import OptimizationSQL acceleration
0 likes · 19 min read
Optimizing Real‑Time Data Warehouse with Apache Doris at 360 DataTech
TAL Education Technology
TAL Education Technology
Nov 17, 2022 · Big Data

Real-Time Data Warehouse: Background, Value Assessment, and Half-Year Progress

This article outlines the background and terminology of data warehousing, presents a formula for evaluating warehouse value, and details the team's half‑year efforts—including architecture selection, quality assurance, stability governance, and data‑value externalization—to improve efficiency, quality, stability, and cost in real‑time data services.

Data GovernanceReal-time analyticsdata operations
0 likes · 10 min read
Real-Time Data Warehouse: Background, Value Assessment, and Half-Year Progress
Java High-Performance Architecture
Java High-Performance Architecture
Nov 17, 2022 · Backend Development

Dynamic Thread Pools in Java: Real‑Time Monitoring and Auto‑Tuning with DynamicTp

This article introduces DynamicTp, a Java library that extends ThreadPoolExecutor with runtime parameter adjustment, real‑time monitoring, alerting, and integration with configuration centers, offering a lightweight, zero‑intrusion solution for managing thread pools in microservice environments.

DynamicTpMicroservicesSpringBoot
0 likes · 11 min read
Dynamic Thread Pools in Java: Real‑Time Monitoring and Auto‑Tuning with DynamicTp
vivo Internet Technology
vivo Internet Technology
Nov 16, 2022 · Industry Insights

Vivo 2022 Dev Conference: Frontend Compiler, Low‑Code, Real‑Time & Cloud‑Native

The 2022 Vivo developer conference showcased a series of technical breakthroughs—including a custom wepy‑chameleon compiler for frontend upgrades, low‑code platforms for backend and game development, a real‑time computing platform built on Flink, advanced graph scheduling, cloud‑native container strategies, monitoring enhancements, database automation, and large‑scale messaging middleware—highlighting Vivo's comprehensive push toward efficiency and innovation across its internet services.

Cloud NativeContainerMessaging
0 likes · 14 min read
Vivo 2022 Dev Conference: Frontend Compiler, Low‑Code, Real‑Time & Cloud‑Native
Efficient Ops
Efficient Ops
Nov 15, 2022 · Operations

Master Linux Performance: Key Metrics, Tools, and Optimization Strategies

This comprehensive guide explains Linux performance optimization by defining key metrics such as throughput and latency, interpreting average load, analyzing CPU context switches, memory management, and I/O behavior, and recommending practical tools and techniques—including vmstat, pidstat, perf, and dstat—to identify and resolve bottlenecks.

CPULinuxMemory
0 likes · 45 min read
Master Linux Performance: Key Metrics, Tools, and Optimization Strategies
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Nov 14, 2022 · Operations

Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator

The article reviews classic availability metrics such as Success‑Ratio, Incident‑Ratio, MTTR/MTTF, Error‑Budget, and SLA/SLO, then introduces User‑Uptime—a per‑user success time proportion that ignores long idle periods—and its windowed variant, showing how it complements existing indicators for more user‑centric reliability insight.

AvailabilityReliabilitySRE
0 likes · 27 min read
Quantifying Internet Service Availability: Classic Metrics and the New User‑Uptime Indicator
Open Source Linux
Open Source Linux
Nov 14, 2022 · Operations

Essential Ops Checklist: From Safe Commands to Performance Tuning

This article shares practical operations guidelines covering safe command usage, backup strategies, security hardening, daily monitoring, performance tuning, and the right mindset to prevent data loss and ensure stable, secure Linux server management.

Backupbest practicesmonitoring
0 likes · 12 min read
Essential Ops Checklist: From Safe Commands to Performance Tuning
DevOps Operations Practice
DevOps Operations Practice
Nov 13, 2022 · Operations

Deploying Zabbix Monitoring Platform with Docker Containers

This article provides a step‑by‑step guide to quickly set up the latest Zabbix monitoring platform using Docker, covering Docker installation, MySQL volume creation, deployment of Zabbix server, web UI, Java gateway, agents, and host configuration for comprehensive system monitoring.

Container DeploymentLinuxZabbix
0 likes · 8 min read
Deploying Zabbix Monitoring Platform with Docker Containers
dbaplus Community
dbaplus Community
Nov 7, 2022 · Operations

Automating Fault Self‑Healing: A Practical Guide for Operations Teams

This article explains why disk‑space alerts demand automated handling, introduces the concept of fault self‑healing, outlines required process standards, describes monitoring platform dimensions, details a multi‑source self‑healing platform architecture, and offers practical steps for integration, notification, and continuous improvement.

CMDBOperationsfault self-healing
0 likes · 9 min read
Automating Fault Self‑Healing: A Practical Guide for Operations Teams
Efficient Ops
Efficient Ops
Nov 6, 2022 · Operations

Visualizing Business‑Process Monitoring with Grafana, Diagram & FlowCharting

This article examines the evolution of a monitoring platform, identifies key challenges such as alarm overload and fragmented data, and presents a solution that combines Grafana with Diagram and FlowCharting plugins to create business‑process‑oriented, data‑driven visualizations for faster issue resolution.

DiagramFlowChartingGrafana
0 likes · 10 min read
Visualizing Business‑Process Monitoring with Grafana, Diagram & FlowCharting
Top Architect
Top Architect
Nov 6, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment

This article provides a thorough overview of backend development, covering system development principles, architectural design patterns, network communication techniques, common faults and exceptions, monitoring and alerting strategies, service governance practices, and deployment workflows, all illustrated with clear explanations and practical examples.

BackendDeploymentSystem Design
0 likes · 33 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment
Architect's Guide
Architect's Guide
Nov 5, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, and Deployment

This article provides a comprehensive overview of backend development, covering system development principles, architecture design patterns, network communication techniques, fault and exception handling, monitoring and alerting strategies, service governance, testing methodologies, and deployment practices to help developers build robust, scalable, and maintainable services.

DeploymentSystem Designarchitecture
0 likes · 33 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Networking, Fault Handling, Monitoring, and Deployment
Tencent Cloud Developer
Tencent Cloud Developer
Nov 3, 2022 · Operations

Grafana Tutorial: Building Monitoring Dashboards for Frontend Performance

This tutorial shows how the QQ Live frontend team integrated Tencent Cloud RUM and used Grafana to create a monitoring dashboard, covering data source setup, plugin installation, panel selection and configuration, advanced features like thresholds and variables, and in‑Grafana transformations for flexible frontend performance observability.

DashboardGrafanaPlugins
0 likes · 21 min read
Grafana Tutorial: Building Monitoring Dashboards for Frontend Performance
dbaplus Community
dbaplus Community
Nov 1, 2022 · Backend Development

Why Chengdu’s COVID Testing System Crashed and How to Build Resilient Backend Services

The article analyzes the Chengdu COVID‑19 testing system failure, outlining its architecture, estimating traffic, identifying infrastructure and software bottlenecks, and recommending sharding, message‑queue decoupling, comprehensive monitoring, and multi‑vendor coordination to build a more reliable backend platform.

ScalabilitySystem Designcloud
0 likes · 13 min read
Why Chengdu’s COVID Testing System Crashed and How to Build Resilient Backend Services
AI Cyberspace
AI Cyberspace
Nov 1, 2022 · Cloud Native

10 Essential Cloud‑Native Tools Every Agile Team Should Use

This article outlines ten indispensable cloud‑native tools—from Docker and Kubernetes to Serverless, Helm, Ansible, and Wireshark—explaining how each supports modern agile development, improves infrastructure stability, and accelerates digital transformation in post‑pandemic enterprises.

Cloud NativeDevOpsDocker
0 likes · 16 min read
10 Essential Cloud‑Native Tools Every Agile Team Should Use
Bilibili Tech
Bilibili Tech
Nov 1, 2022 · Big Data

Design and Implementation of a Data Quality Platform for Large-Scale Data Processing

Bilibili built a scalable data‑quality platform that records metrics from heterogeneous sources, checks them with a rich DSL, alerts once with root‑cause analysis, and uses event‑driven and time‑window scheduling, automated workflows, and intelligent monitoring to ensure real‑time, accurate, trustworthy data for petabyte‑scale processing.

Data QualityRoot Cause Analysisautomation
0 likes · 20 min read
Design and Implementation of a Data Quality Platform for Large-Scale Data Processing
Open Source Linux
Open Source Linux
Oct 30, 2022 · Operations

Unlock Kubernetes Insights: Master Event Types, Monitoring, and Alerting

This guide explains what Kubernetes events are, how to list and filter them, categorizes common event types, and shows practical ways to collect, store, and alert on events using native commands and open‑source tools, helping teams reduce alert fatigue and improve cluster observability.

AlertingEventsKubernetes
0 likes · 11 min read
Unlock Kubernetes Insights: Master Event Types, Monitoring, and Alerting
LOFTER Tech Team
LOFTER Tech Team
Oct 26, 2022 · Operations

Efficient Nginx Log Analysis Using GoAccess and Practical Case Studies

This article explains why Nginx logs are critical, compares various log‑analysis tools, provides detailed installation and configuration steps for GoAccess, discusses selection criteria, and shares real‑world case studies that demonstrate how to extract valuable system and business insights from massive access logs.

Nginxgoaccesslog analysis
0 likes · 20 min read
Efficient Nginx Log Analysis Using GoAccess and Practical Case Studies
Efficient Ops
Efficient Ops
Oct 25, 2022 · Cloud Native

How Guangdong Mobile Built a Resilient Container Cloud from Scratch

This article details Guangdong Mobile's end‑to‑end journey of designing, constructing, and operating a production‑grade container cloud platform, covering architecture decisions, monitoring, logging, high‑availability, scaling, network optimization, upgrade challenges, and lessons learned for cloud‑native practitioners.

Cloud NativeDevOpsKubernetes
0 likes · 26 min read
How Guangdong Mobile Built a Resilient Container Cloud from Scratch
Software Development Quality
Software Development Quality
Oct 23, 2022 · Operations

Top Observability Tools: Datadog, Grafana, Instana, New Relic, Prometheus

This article provides an overview of five leading observability solutions—Datadog, Grafana, Instana, New Relic, and Prometheus—detailing their core features, supported data sources, deployment models, and how they help teams monitor cloud‑native applications, infrastructure, and services to ensure reliability and performance.

DevOpscloud-nativemonitoring
0 likes · 4 min read
Top Observability Tools: Datadog, Grafana, Instana, New Relic, Prometheus
360 Quality & Efficiency
360 Quality & Efficiency
Oct 21, 2022 · Operations

Comprehensive Load‑Testing Plan and Implementation Using JMeter, Docker, and a Monitoring Stack

This document outlines the background, objectives, scenarios, strategies, TPS estimations, metric definitions, step‑by‑step testing process, component selection, script examples, various execution modes (GUI, CLI, distributed, Docker), and the monitoring architecture built with JMeter, InfluxDB, Prometheus, and Grafana for a large‑scale long‑connection service.

DockerJMeterPerformance Testing
0 likes · 16 min read
Comprehensive Load‑Testing Plan and Implementation Using JMeter, Docker, and a Monitoring Stack
Code Ape Tech Column
Code Ape Tech Column
Oct 21, 2022 · Operations

Fundamentals and Comparative Overview of Open‑Source Monitoring Systems (Zabbix, Open‑Falcon, Prometheus)

This article systematically introduces monitoring fundamentals, explains the architecture and key metrics of typical monitoring objects, compares three popular open‑source monitoring solutions—Zabbix, Open‑Falcon, and Prometheus—and provides practical guidance for selecting the most suitable system.

Open-FalconPrometheusSystem Architecture
0 likes · 20 min read
Fundamentals and Comparative Overview of Open‑Source Monitoring Systems (Zabbix, Open‑Falcon, Prometheus)
Top Architect
Top Architect
Oct 18, 2022 · Operations

Apache SkyWalking APM: Concepts, Docker Installation, and UI Guide

This article introduces Application Performance Management (APM), explains the features of Apache SkyWalking for micro‑service and cloud‑native monitoring, and provides step‑by‑step Docker‑compose installation, agent configuration, and a detailed walkthrough of the SkyWalking UI components.

APMDockerMicroservices
0 likes · 13 min read
Apache SkyWalking APM: Concepts, Docker Installation, and UI Guide
IT Architects Alliance
IT Architects Alliance
Oct 17, 2022 · Backend Development

Why Choose Microservice Architecture? A Roadmap and Key Concerns

This article explains why microservice architecture is preferred over monolithic applications, outlines a learning roadmap, and details essential concerns such as Docker, container orchestration, API gateways, load balancing, service discovery, event buses, logging, monitoring, distributed tracing, data persistence, caching, and cloud providers, with recommended tools for each.

Backend Architectureapi-gatewaycloud-native
0 likes · 17 min read
Why Choose Microservice Architecture? A Roadmap and Key Concerns
JD Retail Technology
JD Retail Technology
Oct 14, 2022 · Industry Insights

Key Front‑End Innovations from JD’s 2022 Tech Salon: AR/VR, Low‑Code & More

The 2022 JD Front‑End Technology Salon gathered experts from leading companies to share practical insights on virtual digital humans, Node.js monitoring, lightweight cross‑platform rendering, low‑code platforms, micro‑frontend solutions, and other cutting‑edge front‑end technologies driving business growth.

AR/VRTech Conferencefrontend
0 likes · 13 min read
Key Front‑End Innovations from JD’s 2022 Tech Salon: AR/VR, Low‑Code & More
58 Tech
58 Tech
Oct 11, 2022 · Operations

Design and Implementation of the “Sentinel” Monitoring System for Enterprise Data Reporting

The article details the background, five‑layer architecture, core modules, data model, processing, storage, and alert strategies of the Sentinel monitoring system built on Nebula Graph and integrated with Enterprise WeChat, highlighting its real‑time monitoring, task tracing, and the resulting improvements in reporting timeliness and reliability.

Enterprise WeChatGraph DatabaseNebula Graph
0 likes · 13 min read
Design and Implementation of the “Sentinel” Monitoring System for Enterprise Data Reporting
dbaplus Community
dbaplus Community
Oct 10, 2022 · Databases

How to Collect Comprehensive MySQL Metrics Using Only Built‑In SHOW Commands

This guide explains how to gather extensive MySQL monitoring data—including connections, buffer cache, locks, SQL activity, statement counts, throughput, server variables, and slow‑query analysis—solely with MySQL's native SHOW statements, providing low‑overhead, real‑time insight for database administrators.

SHOW commandsdatabase metricsmonitoring
0 likes · 10 min read
How to Collect Comprehensive MySQL Metrics Using Only Built‑In SHOW Commands

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

This article explains how X‑Select’s Data Quality Platform (DQC) addresses common data quality problems in large‑scale data development by defining six quality dimensions, leveraging open‑source solutions such as Apache Griffin and Qualitis, and implementing rule definition, execution, alerting, and workflow interruption within a Spark‑based architecture.

Big DataData PlatformData Quality
0 likes · 15 min read
Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform
Code Ape Tech Column
Code Ape Tech Column
Oct 8, 2022 · Cloud Native

A Practical Roadmap for Microservice Architecture: Concepts, Tools, and Best Practices

This article presents a comprehensive microservice architecture roadmap, explaining core concepts, why to adopt microservices, and recommending tools such as Docker, Kubernetes, API gateways, service discovery, logging, monitoring, tracing, data persistence, caching, and cloud providers for building scalable, resilient applications.

Dockermonitoring
0 likes · 15 min read
A Practical Roadmap for Microservice Architecture: Concepts, Tools, and Best Practices
Liangxu Linux
Liangxu Linux
Oct 2, 2022 · Operations

Essential Linux Ops Practices: Prevent Disasters and Boost Stability

Drawing from three and a half years of Linux operations, this guide outlines practical standards for testing, confirming commands, avoiding concurrent edits, mandatory backups, data safety, security hardening, continuous monitoring, performance tuning, and the right mindset to keep production environments stable and secure.

BackupLinuxOperations
0 likes · 12 min read
Essential Linux Ops Practices: Prevent Disasters and Boost Stability
政采云技术
政采云技术
Sep 29, 2022 · Operations

Design Considerations for Rebuilding a Procurement Warning Monitoring System

The article discusses the limitations of the existing procurement warning system, outlines the need for a comprehensive redesign to support complex business scenarios, and emphasizes continuous user feedback and seamless information and data flow to maintain long‑term competitiveness.

System Designmonitoringprocurement
0 likes · 2 min read
Design Considerations for Rebuilding a Procurement Warning Monitoring System
Aikesheng Open Source Community
Aikesheng Open Source Community
Sep 27, 2022 · Operations

Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management

This article shares practical experiences and solutions for improving an Alertmanager‑based alert system, addressing problems such as noisy alerts, lack of escalation, missing recovery notifications, suppression limitations, and cumbersome silence management by redesigning architecture, adding custom scripts, and extending database support.

AlertingAlertmanagerOperations
0 likes · 19 min read
Refactoring Alertmanager: Reducing Noise, Improving Escalation, Suppression, and Silence Management
dbaplus Community
dbaplus Community
Sep 26, 2022 · Backend Development

How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring

Ctrip’s internal Dashboard monitoring platform, originally built on HBase, was redesigned by migrating its core writer and storage components to a hybrid VictoriaMetrics‑ClickHouse solution, delivering faster queries, higher write stability, and full Prometheus compatibility while keeping the user experience unchanged.

DashboardHBaseMetrics
0 likes · 13 min read
How Ctrip Replaced HBase with VictoriaMetrics & ClickHouse for Scalable Metrics Monitoring
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 26, 2022 · Operations

How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices

This article analyzes the challenges of alert overload in large‑scale microservice environments and presents a systematic approach—including timeliness metrics, a maturity model, lifecycle tracking, feedback loops, downgrade mechanisms, and cross‑service aggregation—to improve alert effectiveness and reduce noise.

Alert ManagementMTTRMicroservices
0 likes · 16 min read
How to Tame Alert Storms: Building a Systematic Monitoring and Alerting Framework for Microservices
IT Architects Alliance
IT Architects Alliance
Sep 23, 2022 · Cloud Native

How to Build a High‑Availability Microservices System on Kubernetes – A Complete Guide

This guide walks through designing a simple front‑end/back‑end microservices architecture, implementing it with Spring Boot and Eureka, deploying the services on a Kubernetes cluster using K8seasy, and adding high‑availability features such as multi‑instance registration, Prometheus‑Grafana monitoring, Zipkin tracing, and Sentinel flow‑control.

Cloud NativeGrafanaKubernetes
0 likes · 20 min read
How to Build a High‑Availability Microservices System on Kubernetes – A Complete Guide
Huolala Tech
Huolala Tech
Sep 22, 2022 · Operations

How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud

This article details the evolution of monitoring technologies, HuoLala's three‑phase monitoring architecture, the integration of Prometheus, VictoriaMetrics and SkyWalking, zero‑intrusion bytecode instrumentation, full‑link trace sampling, visual dashboards, metric‑trace‑log correlation, and future plans for root‑cause analysis and intelligent alerting.

Operationsbytecodecloud
0 likes · 24 min read
How HuoLala Engineered a Scalable, High‑Availability Monitoring System for Multi‑Cloud
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Sep 20, 2022 · Cloud Native

Build a High‑Availability Microservices System on Kubernetes: Step‑by‑Step Guide

This comprehensive tutorial walks you through designing a simple front‑back separation microservice project, implementing it with Spring Boot, deploying it on a Kubernetes cluster using K8seasy, and adding essential features such as service registration, multi‑instance high availability, monitoring with Prometheus and Grafana, logging via Kafka, tracing with Zipkin, and flow control with Sentinel, all verified through dashboards and tracing tools.

DeploymentKubernetesMicroservices
0 likes · 21 min read
Build a High‑Availability Microservices System on Kubernetes: Step‑by‑Step Guide
Open Source Linux
Open Source Linux
Sep 19, 2022 · Databases

Master RedisInsight: Install, Configure, and Use on Linux & Kubernetes

RedisInsight is a powerful GUI for Redis that enables monitoring, CLI interaction, and module support; this guide walks through its features, step‑by‑step installation on physical servers and Kubernetes, environment configuration, service startup, and basic usage including memory analysis and data operations.

Database GUIInstallationKubernetes
0 likes · 7 min read
Master RedisInsight: Install, Configure, and Use on Linux & Kubernetes
Open Source Linux
Open Source Linux
Sep 15, 2022 · Databases

Master RedisInsight: Install, Deploy on Kubernetes, and Unlock Redis Monitoring

RedisInsight is a powerful GUI for Redis that offers cluster support, SSL/TLS connections, and memory analysis; this guide walks you through its features, step‑by‑step physical and Kubernetes installations, environment configuration, service startup, and basic usage for monitoring and managing Redis instances.

Database ManagementKubernetesRedisInsight
0 likes · 7 min read
Master RedisInsight: Install, Deploy on Kubernetes, and Unlock Redis Monitoring
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 14, 2022 · Operations

Mastering System Stability: From Fault Prevention to Emergency Response

This article outlines a comprehensive safety‑production framework that covers pre‑incident fault prevention, incident response, and post‑mortem improvement, detailing design‑for‑failure principles such as redundancy, isolation, idempotence, monitoring, automation, disaster recovery, scaling, rate‑limiting, and continuous testing to ensure reliable, resilient services.

Reliabilitydisaster recoveryincident management
0 likes · 16 min read
Mastering System Stability: From Fault Prevention to Emergency Response
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Sep 13, 2022 · Operations

How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices

This article details Yanxuan's four‑year evolution of a unified monitoring, alerting, and event‑bus platform for micro‑service architectures, covering design principles, technology selection, multi‑stage implementation, dynamic sampling, custom plugins, data modeling, visualization upgrades, and the final fault‑driven, system‑wide integration.

AlertingFull‑Link TracingMicroservices
0 likes · 23 min read
How Yanxuan Built a Scalable Full‑Link Monitoring, Alerting, and Event‑Bus System for Microservices
MaGe Linux Operations
MaGe Linux Operations
Sep 12, 2022 · Databases

Master MySQL Monitoring with Built‑in SHOW Commands: A Complete Guide

This article explains how to collect comprehensive MySQL performance metrics using only native SHOW commands, covering connections, buffer pool, locks, SQL statements, throughput, server variables, and slow‑query analysis, while also offering practical tips for interpreting and optimizing the results.

monitoringmysqlslow-query
0 likes · 10 min read
Master MySQL Monitoring with Built‑in SHOW Commands: A Complete Guide
Top Architect
Top Architect
Sep 9, 2022 · Backend Development

Ensuring Reliable Message Delivery with Kafka: Preventing Message Loss

This article explains how to use a message queue like Kafka to decouple systems and control traffic, identifies the three main points where message loss can occur—producer, broker, and consumer—and provides practical detection methods and configuration recommendations to guarantee reliable, loss‑free message delivery.

Data ConsistencyMessage Queuemonitoring
0 likes · 12 min read
Ensuring Reliable Message Delivery with Kafka: Preventing Message Loss
DeWu Technology
DeWu Technology
Sep 7, 2022 · Operations

DeWu Full-Chain Load Testing Platform: Design and Implementation

DeWu’s new full‑chain load‑testing platform replaces expensive 1:1 replica environments with a decentralized, container‑based system that isolates test data via middleware markers, supports multiple protocols (HTTP, Dubbo, gRPC, WebSocket, JDBC, Java), offers fixed‑QPS and thread modes, auto‑generates detailed reports, and achieves low CPU/memory usage while paving the way for future features such as data sanitization and dynamic throughput adjustment.

JMeterLoad Testingmonitoring
0 likes · 9 min read
DeWu Full-Chain Load Testing Platform: Design and Implementation
Tencent Cloud Developer
Tencent Cloud Developer
Sep 7, 2022 · Cloud Native

Why Build Probe Capabilities Based on OpenTelemetry for Cloud‑Native Observability

Building probe capabilities on OpenTelemetry gives cloud‑native teams a vendor‑neutral, standardized way to extend monitoring into full observability—supporting large‑scale, language‑specific instrumentation, plug‑and‑play plugins, and seamless integration with APM backends—so developers and operators can detect, debug, and predict faults across distributed containers.

APMCloud NativeNode.js
0 likes · 15 min read
Why Build Probe Capabilities Based on OpenTelemetry for Cloud‑Native Observability
dbaplus Community
dbaplus Community
Sep 5, 2022 · Operations

How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform

This article details the evolution of NetEase's self‑built time‑series database EyesTSDB into a cloud‑native, second‑level monitoring solution, covering its architecture, core features, integration with VictoriaMetrics, custom plugin workflow, CMDB linkage, real‑world use cases, and future challenges.

CMDB integrationMetricsTime Series Database
0 likes · 21 min read
How EyesTSDB Evolved into a Cloud‑Native, Second‑Level Monitoring Platform
JavaEdge
JavaEdge
Sep 5, 2022 · Operations

Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook

From kickoff meetings and traffic forecasting to load‑testing strategies, rate‑limiting designs, emergency runbooks, and post‑event retrospectives, this guide walks engineers through the complete technical workflow required to ensure a Double‑11‑scale e‑commerce promotion runs smoothly and safely.

Load TestingTraffic Engineeringincident response
0 likes · 12 min read
Engineering Double‑11‑Scale E‑Commerce Events: A Complete Technical Playbook
Wukong Talks Architecture
Wukong Talks Architecture
Sep 5, 2022 · Backend Development

Microservice Architecture: Evolution, Challenges, and Best Practices

This article explains the transition from monolithic to microservice architecture using an online supermarket example, highlights the problems caused by rapid, unplanned scaling, and presents systematic solutions such as service decomposition, database sharding, monitoring, tracing, gateway, service discovery, circuit breaking, testing, and service mesh adoption.

BackendMicroservicesService Mesh
0 likes · 21 min read
Microservice Architecture: Evolution, Challenges, and Best Practices
Open Source Linux
Open Source Linux
Sep 1, 2022 · Operations

What’s New in Zabbix 6.0? Enhanced Monitoring, HA, AI & Cloud Features Explained

Zabbix 6.0 introduces a suite of enhancements—including high‑availability clustering, advanced business‑service monitoring with SLA calculations, root‑cause analysis, machine‑learning‑based anomaly detection, Kubernetes templates, a redesigned audit log, TLS certificate checks, UI improvements, customizable branding, and new integrations—aimed at boosting operational visibility and efficiency across cloud and on‑premise environments.

KubernetesOperationsZabbix
0 likes · 12 min read
What’s New in Zabbix 6.0? Enhanced Monitoring, HA, AI & Cloud Features Explained
dbaplus Community
dbaplus Community
Sep 1, 2022 · Operations

How Vivo’s Server‑Side Monitoring Evolved: Architecture, Data Flow, and Alert Strategies

This article provides a comprehensive overview of Vivo's server‑side monitoring system, detailing its architecture evolution, data collection pipelines, OpenTSDB storage design, alerting mechanisms, and comparisons with other mainstream monitoring solutions, offering practical guidance for technology selection and implementation.

OpenTSDBOperationsSystem Architecture
0 likes · 18 min read
How Vivo’s Server‑Side Monitoring Evolved: Architecture, Data Flow, and Alert Strategies
DevOps
DevOps
Aug 31, 2022 · Operations

Key Software Performance Metrics for Successful Development

This article explains why performance testing is essential before large‑scale deployment and outlines fourteen critical software performance metrics—such as response time, request rate, error rate, CPU utilization, and concurrent users—to help development teams measure, analyze, and improve their products.

QASoftware Testingmonitoring
0 likes · 7 min read
Key Software Performance Metrics for Successful Development
DevOps Cloud Academy
DevOps Cloud Academy
Aug 28, 2022 · Operations

Understanding the DevOps Lifecycle and Its Toolchain

This article explains the stages of the DevOps lifecycle—planning, coding, building, testing, releasing, deploying, operating, and monitoring—along with the popular tools used at each phase to enable continuous integration, delivery, and deployment.

DevOpsautomationci/cd
0 likes · 6 min read
Understanding the DevOps Lifecycle and Its Toolchain
Zuoyebang Tech Team
Zuoyebang Tech Team
Aug 26, 2022 · Operations

How We Built a Three‑Layer Stability System for Massive Scale Operations

This article details the operational mindset, stability framework, and transformation journey of the Zuoyebang infrastructure team, covering service lifecycle management, standardization, cloud‑native architecture, multi‑active deployment, incident pre‑plan platforms, traffic scheduling, monitoring, capacity planning, and future directions for SRE service‑orientation.

InfrastructureOperationsSRE
0 likes · 20 min read
How We Built a Three‑Layer Stability System for Massive Scale Operations
dbaplus Community
dbaplus Community
Aug 24, 2022 · Backend Development

From Monolith to Microservices: Transforming an Online Supermarket

This article walks through the evolution of an online supermarket from a simple monolithic web app to a fully fledged microservice architecture, detailing the motivations, design decisions, component breakdown, common pitfalls, and essential practices such as monitoring, tracing, logging, service discovery, resilience patterns, testing, and the role of service meshes.

MicroservicesService Meshcircuit breaker
0 likes · 24 min read
From Monolith to Microservices: Transforming an Online Supermarket
Architects Research Society
Architects Research Society
Aug 24, 2022 · Operations

Choosing Appropriate SLIs and Defining SLOs for Reliable Services

This guide explains how to select suitable service‑level indicators (SLIs), define customer‑centric service‑level objectives (SLOs), use error budgets, and iteratively improve reliability for various system types such as services, data processing, and storage, with practical recommendations for Google Cloud environments.

Google CloudSLISLO
0 likes · 10 min read
Choosing Appropriate SLIs and Defining SLOs for Reliable Services
Wukong Talks Architecture
Wukong Talks Architecture
Aug 24, 2022 · Cloud Native

Overview of ZuanZuan Cloud Platform: Architecture, Image Management, Release Upgrade, Container Monitoring, and Log Collection

The article introduces the ZuanZuan cloud platform, detailing its overall architecture, image management workflow that abstracts Dockerfiles, release‑upgrade strategies with custom controllers, container monitoring evolution to Prometheus, and log‑collection mechanisms that handle large Java‑based log volumes.

ContainerImage Managementcloud-native
0 likes · 8 min read
Overview of ZuanZuan Cloud Platform: Architecture, Image Management, Release Upgrade, Container Monitoring, and Log Collection
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 23, 2022 · Cloud Native

Why iLogtail’s Open‑Source Cloud‑Native Agent Is Redefining Observability

This article explores the open‑source release of Alibaba Cloud's iLogtail, detailing its lightweight, high‑performance design, multi‑tenant isolation, plugin architecture, Kubernetes integration, and the differences between its enterprise and community editions, while highlighting its role in modern observability pipelines.

Kuberneteslog collectionmonitoring
0 likes · 27 min read
Why iLogtail’s Open‑Source Cloud‑Native Agent Is Redefining Observability
Shopee Tech Team
Shopee Tech Team
Aug 18, 2022 · Cloud Native

Shopee Druid Cloud Native Architecture Evolution: Design and Implementation

Shopee transformed its Druid analytics platform from a fragile physical‑machine setup into a cloud‑native, Kubernetes‑orchestrated solution that adds independent clusters, automatic scaling, traffic management, GitOps‑driven deployment, and container isolation, delivering higher stability, efficiency, lower cost, and stronger security alongside integrated monitoring and visualization tools.

DruidKubernetesScalability
0 likes · 20 min read
Shopee Druid Cloud Native Architecture Evolution: Design and Implementation
High Availability Architecture
High Availability Architecture
Aug 15, 2022 · Big Data

Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform

This article explains why event‑tracking (埋点) governance is essential, outlines the methodology and practice of full‑link tracking management, and introduces the one‑stop tracking platform with its innovative features such as standardized processes, verification tools, real‑time dashboards, cross‑platform data unification, and future roadmap.

AnalyticsBig DataData Governance
0 likes · 15 min read
Comprehensive Guide to Event Tracking Governance and the One‑Stop Tracking Management Platform
Bilibili Tech
Bilibili Tech
Aug 12, 2022 · Operations

SLO Implementation and Alerting Strategies – Bilibili SRE Practices

The article outlines Bilibili’s refined SLO framework—categorizing services into four business tiers, selecting availability, latency, and freshness SLIs, setting concrete SLO targets, and employing multi‑window error‑budget and consumption‑rate alerting strategies to improve stability and provide comprehensive quality dashboards.

AlertingMetricsSLO
0 likes · 18 min read
SLO Implementation and Alerting Strategies – Bilibili SRE Practices
Open Source Linux
Open Source Linux
Aug 12, 2022 · Operations

What’s New in Grafana 9.0? Explore Visual Query Builders and UI Enhancements

Grafana 9.0 focuses on improving user experience for observability and data visualization, introducing visual Prometheus and Loki query builders, an Explore‑to‑dashboard workflow, a revamped heatmap panel, command palette, panel search, trace panels, navigation upgrades, and enhanced alerting, all aimed at making data discovery and investigation more intuitive and efficient.

DashboardGrafanaLoki
0 likes · 9 min read
What’s New in Grafana 9.0? Explore Visual Query Builders and UI Enhancements
Huolala Tech
Huolala Tech
Aug 11, 2022 · Operations

How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale

This article details Huolala's journey from basic monitoring to an AI‑driven intelligent observability platform, covering AIOps concepts, a comprehensive monitoring framework, practical implementations, automated alert analysis, lessons learned, and future directions for large‑scale operations.

DevOpsHuolalaOperations
0 likes · 18 min read
How Huolala Built an AI‑Powered Intelligent Monitoring Platform at Scale
Open Source Linux
Open Source Linux
Aug 11, 2022 · Operations

Master Zabbix: From Installation to Advanced Monitoring and Alerting

This comprehensive guide explains why monitoring is essential, describes reliability metrics, walks through Zabbix installation, web UI configuration, custom monitoring, trigger creation, alert integration, distributed monitoring, SNMP support, and large‑scale server monitoring using scripts, APIs, and auto‑discovery.

AlertingProxySNMP
0 likes · 24 min read
Master Zabbix: From Installation to Advanced Monitoring and Alerting
IT Architects Alliance
IT Architects Alliance
Aug 9, 2022 · Backend Development

Mastering Consistent API Design: 22 Essential Best Practices

This guide presents 22 practical rules for designing clean, consistent RESTful APIs—including resource-oriented URLs, kebab‑case paths, camelCase parameters, proper use of HTTP verbs, versioning, pagination, field selection, CORS, security, and monitoring—to help developers avoid common pitfalls and improve API usability.

HTTP methodsURL conventionsVersioning
0 likes · 9 min read
Mastering Consistent API Design: 22 Essential Best Practices
Alibaba Cloud Native
Alibaba Cloud Native
Aug 9, 2022 · Cloud Native

How AIA Built a Scalable Cloud‑Native Observability Platform for Insurance

This case study details how AIA Insurance transformed legacy insurance systems into a cloud‑native, micro‑service architecture and implemented a comprehensive observability platform using Kubernetes, Prometheus, Grafana and custom data pipelines to improve SLA, fault detection, and business‑level monitoring.

DevOpsMicroservicesmonitoring
0 likes · 11 min read
How AIA Built a Scalable Cloud‑Native Observability Platform for Insurance
Efficient Ops
Efficient Ops
Aug 8, 2022 · Operations

Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More

This guide walks you through practical Linux operations—from using xargs for efficient file handling and running commands in the background, to monitoring high‑memory and high‑CPU processes, viewing multiple logs with multitail, continuous ping logging, checking TCP states, identifying top IPs on port 80, and leveraging SSH for port forwarding.

OpsSSHShell
0 likes · 10 min read
Master Essential Linux Ops: xargs, Background Jobs, Process Monitoring & More
Senior Brother's Insights
Senior Brother's Insights
Aug 8, 2022 · Databases

How to Install and Use RedisInsight for Redis Cluster Management

RedisInsight is a powerful GUI for Redis that supports cluster management, SSL connections, and memory analysis; this guide walks through downloading, installing on Linux, configuring environment variables, running as a service, deploying via Kubernetes, and using the web UI to monitor and operate Redis instances.

Database ManagementGUIInstallation
0 likes · 7 min read
How to Install and Use RedisInsight for Redis Cluster Management
ByteDance Web Infra
ByteDance Web Infra
Aug 8, 2022 · Frontend Development

Design and Architecture of a Multi‑Environment Frontend Monitoring SDK

The article explains how a frontend monitoring SDK can support diverse environments such as web, mini‑programs, and Electron by decoupling logic into interchangeable roles, providing a rich plugin lifecycle, enabling business‑driven extensions, on‑demand loading, and rigorous quality controls while minimizing impact on the host application.

SDKarchitecturefrontend
0 likes · 15 min read
Design and Architecture of a Multi‑Environment Frontend Monitoring SDK
Ops Development Stories
Ops Development Stories
Aug 6, 2022 · Cloud Native

8 Proven Strategies to Beat Alert Fatigue in Kubernetes

This article explains why alert fatigue harms on‑call teams in Kubernetes environments and offers eight practical techniques—ranging from metric definition to alert suppression—to reduce noise, improve response efficiency, and protect team well‑being.

KubernetesOperationsalert fatigue
0 likes · 8 min read
8 Proven Strategies to Beat Alert Fatigue in Kubernetes
Aikesheng Open Source Community
Aikesheng Open Source Community
Jul 30, 2022 · Databases

Weekly Technical Newsletter: MySQL Releases, Linux I/O Optimization, Monitoring Tips, and SQLE Updates

This weekly newsletter curates top community technical shares covering MySQL 8.0.30 GA, master‑slave replication recovery, InnoDB parameters, Linux I/O optimization, two‑phase commit, Prometheus‑Grafana monitoring pitfalls, as well as the latest SQLE 1.2207.0 release, development progress, and upcoming plans.

Linux I/Omonitoringmysql
0 likes · 4 min read
Weekly Technical Newsletter: MySQL Releases, Linux I/O Optimization, Monitoring Tips, and SQLE Updates
dbaplus Community
dbaplus Community
Jul 25, 2022 · Operations

How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager

Designing a unified, enterprise‑level monitoring and alerting platform, this article analyzes the shortcomings of standard Prometheus‑Grafana‑AlertManager setups, outlines platform‑vs‑business responsibilities, details architecture, user‑scenario requirements, component selection, high‑availability strategies, and deployment models to achieve scalable, easy‑to‑use observability.

Operationshigh-availabilitymonitoring
0 likes · 12 min read
How to Build an Enterprise‑Grade Monitoring & Alerting Platform with Prometheus, Grafana, and AlertManager
Open Source Linux
Open Source Linux
Jul 25, 2022 · Cloud Native

How to Decode Container CPU Metrics in Prometheus and Docker Stats

This article explains the key Prometheus metrics for Kubernetes container CPU usage, provides exact PromQL formulas for calculating per‑container CPU percentages, and details how Docker stats reports memory and CPU usage, including the necessary calculations and sample code.

CPU MetricsDockerKubernetes
0 likes · 8 min read
How to Decode Container CPU Metrics in Prometheus and Docker Stats
dbaplus Community
dbaplus Community
Jul 21, 2022 · Operations

How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform

This article details Huolala's journey from a fragmented monitoring stack to a unified, AI‑enhanced observability platform, covering AIOps concepts, the design of a comprehensive monitoring framework, concrete implementation of metrics, tracing, logging, alerting, and lessons learned for large‑scale operations.

DevOpsaiopscloud
0 likes · 19 min read
How Huolala Built an AI‑Powered End‑to‑End Monitoring Platform
TAL Education Technology
TAL Education Technology
Jul 21, 2022 · Frontend Development

Front‑end Performance Monitoring and Optimization Techniques

This article explains the concepts of controllable and uncontrollable latency in front‑end performance, outlines methods for collecting static, dynamic, and API performance data using network waterfall, Chrome Performance, console timing, and Performance Timing API, and provides practical code examples and optimization strategies to improve user experience.

Performance APIWebaxios
0 likes · 14 min read
Front‑end Performance Monitoring and Optimization Techniques
Top Architect
Top Architect
Jul 21, 2022 · Backend Development

Performance Monitoring and Optimization Practices for Backend Systems

The article outlines practical approaches to monitor and resolve performance bottlenecks in backend applications, covering database slow‑query logs, interface latency, message‑queue backlogs, segmentation timing, caching, batch calls, multithreading, and database tuning techniques such as indexing and transaction isolation.

Backendcachingmonitoring
0 likes · 9 min read
Performance Monitoring and Optimization Practices for Backend Systems
JD Retail Technology
JD Retail Technology
Jul 21, 2022 · Backend Development

Design and Implementation of JD's 01 Payment Platform: Unified Payment Service Layer and One‑Stop Integration

The article outlines JD's 01 Payment Platform, describing how a unified payment service layer and a one‑stop integration platform were built to replace the legacy checkout system, improve performance, support multi‑channel SaaS/PaaS capabilities, enable intelligent monitoring, and successfully handle large‑scale promotional events.

PaaSSaaSarchitecture
0 likes · 10 min read
Design and Implementation of JD's 01 Payment Platform: Unified Payment Service Layer and One‑Stop Integration
MaGe Linux Operations
MaGe Linux Operations
Jul 20, 2022 · Operations

What a Typical Ops Day Looks Like—and How to Make It More Productive

The author recounts a chaotic typical day for Chinese operations engineers, then proposes a balanced schedule that prioritizes urgent firefighting tasks while dedicating most time to proactive monitoring, performance tuning, tool development, and continuous learning for long‑term system stability.

DevOpsOperationsmonitoring
0 likes · 4 min read
What a Typical Ops Day Looks Like—and How to Make It More Productive
IT Architects Alliance
IT Architects Alliance
Jul 18, 2022 · Operations

Comparison of Prometheus and Zabbix Monitoring Solutions

This article compares Prometheus and Zabbix, outlining their histories, architectures, storage models, configuration complexity, community activity, and suitability for different environments, and concludes with recommendations on when to choose each monitoring system.

ComparisonOperationsPrometheus
0 likes · 9 min read
Comparison of Prometheus and Zabbix Monitoring Solutions