Tagged articles
2179 articles
Page 7 of 22
Bitu Technology
Bitu Technology
Mar 15, 2024 · Artificial Intelligence

Monitoring Quality Issues in Tubi’s Recommendation System

This article explains how Tubi monitors the quality of its recommendation system by identifying potential failure points, tracking key data streams such as model input, final recommendation output, and training data, and designing a scalable, real‑time monitoring solution with clear protocols and extensible metrics.

Data QualityReal-TimeScalability
0 likes · 11 min read
Monitoring Quality Issues in Tubi’s Recommendation System
Practical DevOps Architecture
Practical DevOps Architecture
Mar 15, 2024 · Operations

Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development

This multi‑chapter guide provides in‑depth, hands‑on instruction for configuring and optimizing all Prometheus components, exploring Kubernetes monitoring, source‑code analysis, custom exporter development, high‑availability setups, service discovery, resource‑efficient scraping, and integrating Thanos for long‑term storage.

KubernetesOperationsPrometheus
0 likes · 4 min read
Comprehensive Practical Guide to Prometheus Configuration, Optimization, and Source Code Development
DevOps Operations Practice
DevOps Operations Practice
Mar 14, 2024 · Operations

Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions

This article analyzes why a single Prometheus instance repeatedly runs out of memory and crashes, explains the underlying storage mechanisms, and presents practical solutions such as metric reduction, retention tuning, federation architecture, and remote storage integration to improve stability and scalability.

FederationPrometheusmonitoring
0 likes · 6 min read
Resolving Frequent Crashes of a Single-Node Prometheus Deployment: Analysis and Solutions
Linux Cloud Computing Practice
Linux Cloud Computing Practice
Mar 13, 2024 · Operations

Top 10 Essential Tools Every Operations Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, typical use cases, key advantages, and real‑world examples, helping professionals streamline automation, monitoring, configuration, and deployment tasks and improve overall system reliability.

InfrastructureOperationsmonitoring
0 likes · 6 min read
Top 10 Essential Tools Every Operations Engineer Should Master
Linux Code Review Hub
Linux Code Review Hub
Mar 5, 2024 · Operations

Why Did Opening a Log with Vim Kill the Java Process?

A port alarm revealed a missing Java process, which was later traced to an OOM kill triggered by vim loading a 37 GB nginx log into an 8 GB container, illustrating how editor behavior and Linux's OOM killer can unexpectedly terminate critical services.

ContainerLinuxNginx
0 likes · 7 min read
Why Did Opening a Log with Vim Kill the Java Process?
Architecture & Thinking
Architecture & Thinking
Mar 5, 2024 · Databases

How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More

This article examines how database middleware tackles the demanding needs of large‑scale internet services by providing centralized connection‑pool management, transparent read‑write splitting, diverse load‑balancing algorithms, sharding support, automatic failover, security controls, comprehensive monitoring, and flexible backup‑recovery mechanisms.

Connection Poolfault tolerancemonitoring
0 likes · 9 min read
How Database Middleware Solves High‑Traffic Challenges: Connection Pools, Sharding, and More
JD Tech
JD Tech
Feb 28, 2024 · Databases

Detecting and Monitoring Database Deadlocks with EasyBI: A Practical Case Study

This article recounts how a production database deadlock was uncovered during testing, explains the use of the EasyBI monitoring tool to collect and visualize error and claim statistics, and shares the step‑by‑step configuration, analysis, and lessons learned for preventing similar issues in future systems.

EasyBIError Handlingdatabase
0 likes · 8 min read
Detecting and Monitoring Database Deadlocks with EasyBI: A Practical Case Study
Huolala Tech
Huolala Tech
Feb 28, 2024 · Operations

How Huolala Created an Intelligent Automated Testing System to Raise Coverage & Cut Regression Costs

Facing rapid business expansion, Huolala’s quality assurance team tackled redundant code, high regression costs, and lack of coverage metrics by designing an intelligent automated testing framework that analyzes effective code, provides smart test case recommendations, visualizes progress, and integrates monitoring, resulting in significant coverage improvements and efficiency gains across services.

Automated TestingJaCoCocode coverage
0 likes · 25 min read
How Huolala Created an Intelligent Automated Testing System to Raise Coverage & Cut Regression Costs
DevOps Operations Practice
DevOps Operations Practice
Feb 16, 2024 · Operations

Linux, Networking, Container, and Monitoring Interview Questions

This article compiles a comprehensive set of interview-style questions covering Linux file handling, CPU metrics, link types, TCP handshakes, process vs thread, TCP/UDP differences, DDoS mitigation, Keepalived operation, TIME_WAIT optimization, container networking, Kubernetes components, deployment strategies, monitoring concepts, Prometheus architecture, and common web‑site operational issues.

ContainersLinuxinterview
0 likes · 4 min read
Linux, Networking, Container, and Monitoring Interview Questions
MaGe Linux Operations
MaGe Linux Operations
Feb 14, 2024 · Operations

Master Linux Performance: Key Factors and Essential Optimization Tools

This article examines the various hardware and OS resources that affect Linux performance—including CPU, memory, disk I/O, and network bandwidth—then details practical optimization techniques and essential monitoring tools such as vmstat, iostat, free, sar, and netstat to diagnose and improve system efficiency.

LinuxSysadminmonitoring
0 likes · 16 min read
Master Linux Performance: Key Factors and Essential Optimization Tools
MaGe Linux Operations
MaGe Linux Operations
Feb 7, 2024 · Databases

How to Build a Real‑Time Data Guard System for Dameng Database

This guide walks through setting up a Dameng data‑guard service using a primary, standby, and monitor server, covering data preparation, configuration of dm.ini, dmmal.ini, dmarch.ini, dmwatcher.ini, starting services, OGUID setup, mode switching, and monitoring to achieve high‑availability replication.

BackupDamengData Guard
0 likes · 12 min read
How to Build a Real‑Time Data Guard System for Dameng Database
DevOps Cloud Academy
DevOps Cloud Academy
Feb 2, 2024 · Operations

DevOps Tools for 2024: A Comprehensive Overview

An extensive overview of essential DevOps tools for 2024, covering categories such as version control, CI/CD, container orchestration, configuration management, infrastructure as code, monitoring, collaboration, artifact repositories, testing, security, deployment automation, serverless platforms, and database management to guide effective tool selection.

DevOpsInfrastructure as CodeTooling
0 likes · 7 min read
DevOps Tools for 2024: A Comprehensive Overview
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

emergency planningfault handlingincident response
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
IT Services Circle
IT Services Circle
Jan 25, 2024 · Operations

How to Resolve Online Message Queue Backlog Issues

This article explains why message queues can become backlogged, identifies producer and consumer causes, and provides practical strategies—including adding consumers, increasing queue capacity, optimizing consumption logic, implementing failure handling, and rapid remediation steps—to quickly resolve backlog in production environments.

BacklogMessage QueueOperations
0 likes · 7 min read
How to Resolve Online Message Queue Backlog Issues
DevOps
DevOps
Jan 23, 2024 · Operations

Collection of Bash Scripts for Server Monitoring, Automation, and Deployment

This article provides a curated set of Bash scripts covering MySQL replication monitoring, directory change detection, bulk user creation, website health checks, remote command execution, LNMP stack deployment, server resource reporting, high‑resource process identification, and automated deployment of Java and PHP projects, offering practical automation tools for system administrators.

BashDeploymentOperations
0 likes · 12 min read
Collection of Bash Scripts for Server Monitoring, Automation, and Deployment
Efficient Ops
Efficient Ops
Jan 22, 2024 · Operations

Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice

This guide explains monitoring fundamentals, clears common misconceptions, compares black‑box and white‑box approaches, outlines key metrics such as latency, traffic, errors and saturation, and provides a deep dive into Prometheus architecture, data model, query language, and practical examples for CPU, memory, and disk monitoring.

Prometheuscloud-nativemonitoring
0 likes · 15 min read
Mastering Monitoring: Black‑Box vs White‑Box, Metrics, and Prometheus in Practice
Efficient Ops
Efficient Ops
Jan 22, 2024 · Operations

How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency

At the 21st GOPS Global Operations Conference, New Oriental's senior operations manager Qi Chen detailed the demand, technical, and focus pressures that drove a phased, full‑process observability standardization, leveraging OpenTelemetry, Telegraf, Loki and CMDB tagging to achieve cost reduction and higher stability.

Cost reductionDevOpsOpenTelemetry
0 likes · 8 min read
How New Oriental Standardized Its Observability System to Cut Costs and Boost Efficiency
Efficient Ops
Efficient Ops
Jan 21, 2024 · Operations

Essential Bash Scripts for Efficient Server Operations and Automation

This article compiles a set of practical Bash scripts that cover MySQL replication monitoring, directory change detection with real‑time sync, bulk user creation, website health checks, remote command execution, one‑click LNMP deployment, resource usage reporting, high‑CPU process identification, and automated Java/Tomcat and PHP project deployments.

BashDeploymentmonitoring
0 likes · 12 min read
Essential Bash Scripts for Efficient Server Operations and Automation
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jan 18, 2024 · Frontend Development

Comprehensive Guide to Front-End Performance Optimization

This article systematically outlines common front‑end performance optimization techniques, explains key web performance metrics such as Speed Index, FCP, CLS, LCP and TBT, and provides practical strategies for resource compression, network and code optimization, as well as monitoring and measurement best practices.

Resource Compressionmonitoringnetwork
0 likes · 20 min read
Comprehensive Guide to Front-End Performance Optimization
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

SREfault handlingincident management
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
dbaplus Community
dbaplus Community
Jan 8, 2024 · Backend Development

How We Built an Automated Payment Channel Management System with Redis and Prometheus

To handle growing payment traffic and unreliable third‑party gateways, the team at Zhuanzhuan designed an automated payment‑channel management platform that uses a custom Redis‑based time‑series store, Prometheus monitoring, and a sliding‑window failure‑rate algorithm to detect, alert, and eventually auto‑switch faulty channels.

Prometheusautomationfault-tolerance
0 likes · 10 min read
How We Built an Automated Payment Channel Management System with Redis and Prometheus
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 5, 2024 · Operations

Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai

This article presents a detailed case study of how ZhaiZhai designed and implemented a unified monitoring platform—combining business services, middleware, and operations resources—by selecting Prometheus and M3DB, automating Grafana dashboards, creating a low‑noise alerting system, and achieving large‑scale observability with significant cost and efficiency gains.

AlertingM3DBOperations
0 likes · 21 min read
Building an Integrated Monitoring Platform: Architecture, Implementation, and Lessons from ZhaiZhai
转转QA
转转QA
Jan 4, 2024 · Operations

Automated Error Log Cleanup and Monitoring Mechanism for QA

This article describes how a QA team collaborated with developers to create an automated error‑log cleanup and monitoring system, detailing the background, offline follow‑up process, identified pain points, the design of a scheduled statistics solution, platform capabilities, observed benefits, and future improvement plans.

Error LoggingQAmonitoring
0 likes · 8 min read
Automated Error Log Cleanup and Monitoring Mechanism for QA
Zhuanzhuan Tech
Zhuanzhuan Tech
Jan 4, 2024 · Backend Development

Three‑Step Strategy for Identifying and Removing Zombie Services, Methods, and Component Dependencies

This article presents a detailed three‑step plan used by Zhezhuan to detect and eliminate zombie services, unused code methods, and obsolete component dependencies through monitoring, static analysis with Spoon, and Java‑agent based runtime tracing, achieving significant resource savings and improved code health.

backend optimizationjavamonitoring
0 likes · 13 min read
Three‑Step Strategy for Identifying and Removing Zombie Services, Methods, and Component Dependencies
Tencent Cloud Developer
Tencent Cloud Developer
Jan 3, 2024 · Backend Development

Exception Handling: Requirements, Modeling, and Best Practices in Backend Development

The article outlines backend exception‑handling best practices, detailing business requirements such as memory‑safe multithreaded throws, clear separation of concerns, framework fallback strategies, simple macro‑based APIs, unified error‑code monitoring, rich debugging information, extensible type‑erased models, and appropriate handling of critical, recoverable, and checked exceptions across development and production environments.

C++Exception Handlingdebugging
0 likes · 28 min read
Exception Handling: Requirements, Modeling, and Best Practices in Backend Development
Liangxu Linux
Liangxu Linux
Jan 2, 2024 · Information Security

How to Monitor Linux User Activity with Built‑In Commands and Auditd

This guide explains how to track Linux user activity and system events using native commands such as who, w, last, ps, ss, journalctl, and the auditd framework, providing step‑by‑step examples and advanced auditing techniques for security and compliance.

AuditdSysadmincommands
0 likes · 7 min read
How to Monitor Linux User Activity with Built‑In Commands and Auditd
Goodme Frontend Team
Goodme Frontend Team
Jan 1, 2024 · Frontend Development

How Guming’s Front‑End Data Center Enables Real‑Time Monitoring for Web, Mini‑Programs, Flutter & Node.js

Guming’s Front‑End Data Center integrates monitoring, performance, logging, and analytics for web, mini‑programs, Flutter clients, and Node.js services, offering real‑time alerts, high availability, sampling, multi‑channel data pipelines, custom charting, and detailed CPU/GC profiling to streamline issue diagnosis and business insights.

Data Platformfrontendmonitoring
0 likes · 10 min read
How Guming’s Front‑End Data Center Enables Real‑Time Monitoring for Web, Mini‑Programs, Flutter & Node.js
Architecture & Thinking
Architecture & Thinking
Dec 25, 2023 · Databases

How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages

This article explains what Redis hot keys are, the scenarios that generate them, their risks, and provides practical monitoring methods and mitigation strategies—including cache pre‑warming, distributed caching, rate limiting, and secondary caches—to keep production systems stable.

Hot Keyfault tolerancemonitoring
0 likes · 11 min read
How to Detect, Analyze, and Prevent Redis Hot Keys to Avoid Outages
Weimob Technology Center
Weimob Technology Center
Dec 22, 2023 · Big Data

Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob

The Weimob Technology Salon session on "Elasticsearch in Weimob's Practice" shares practical usage recommendations, monitoring setups with Prometheus and Grafana, field‑type guidance, and solutions to common operational challenges, offering developers actionable insights for high‑performance search deployments.

Big DataElasticsearchWeimob
0 likes · 5 min read
Unlocking Elasticsearch at Scale: Real‑World Practices from Weimob
Zuoyebang Tech Team
Zuoyebang Tech Team
Dec 22, 2023 · Databases

Unlocking Intelligent Database Operations: Inside Zyb’s Multi‑Cloud Platform

This article details how Zyb’s multi‑cloud database platform integrates diverse database types, a unified proxy layer, intelligent lifecycle management, automated task orchestration, monitoring, resource allocation, backup, and fault‑handling to achieve efficient, reliable, and secure database operations across cloud environments.

BackupIntelligent Operationsdatabases
0 likes · 19 min read
Unlocking Intelligent Database Operations: Inside Zyb’s Multi‑Cloud Platform
dbaplus Community
dbaplus Community
Dec 20, 2023 · Operations

Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage

This article outlines how a large‑scale Kafka deployment of over a thousand machines across dozens of clusters was engineered for stability and efficiency through a custom Guardian controller that adds partition‑level throttling, automatic balancing, multi‑tenant isolation, cross‑IDC management, tiered storage, audit capabilities, and fully automated operational workflows.

Cluster ManagementKafkaOperations
0 likes · 21 min read
Scaling Kafka to 1000+ Nodes: Governance, Auto‑Balancing & Tiered Storage
Architect
Architect
Dec 15, 2023 · Industry Insights

How Bilibili Engineered a Scalable Live‑Commerce Platform from Zero to One

This article details Bilibili's step‑by‑step transformation of a fragmented, high‑coupling live‑commerce system into a modular, platform‑centric architecture, covering product middle‑platform construction, unified standards, storage migration, monitoring with Prometheus/Grafana, and performance gains such as a three‑fold query speedup and a reduction of development cycles from 46 to 5 person‑days.

BilibiliMicroservicesScalability
0 likes · 24 min read
How Bilibili Engineered a Scalable Live‑Commerce Platform from Zero to One
DevOps Cloud Academy
DevOps Cloud Academy
Dec 14, 2023 · Operations

CI/CD Observability via OpenTelemetry at Grafana Labs

The article explains the importance of CI/CD observability, outlines common pipeline problems, introduces Grafana's GraCIe plugin built on OpenTelemetry, and discusses how enhanced visibility can improve reliability, decision‑making, and future standardization across CI/CD platforms.

DevOpsGrafanaOpenTelemetry
0 likes · 13 min read
CI/CD Observability via OpenTelemetry at Grafana Labs
dbaplus Community
dbaplus Community
Dec 13, 2023 · Fundamentals

How to Design Scalable, Maintainable Software Architecture: From Principles to Practice

This article explores how to build a robust engineering architecture by prioritizing product value, defining clear layered and DDD structures, selecting appropriate technologies, and establishing standards for exception, logging, monitoring, and team collaboration to achieve scalability, maintainability, reliability, security, and high performance.

Domain-Driven DesignException HandlingMicroservices
0 likes · 27 min read
How to Design Scalable, Maintainable Software Architecture: From Principles to Practice
Bilibili Tech
Bilibili Tech
Dec 12, 2023 · Backend Development

Platformization of Bilibili's Live‑Streaming E‑Commerce Business: Architecture, Implementation and Governance

Bilibili transformed its fast‑growing live‑streaming e‑commerce operation by constructing a modular platform that separates product, user, and application layers, introduces a unified product middle‑platform, standardized capabilities, real‑time attribute handling, and robust monitoring and governance, thereby reducing technical debt, improving stability, and preparing for hundred‑billion‑level GMV scaling.

Bilibilie-commerce platformlive streaming
0 likes · 24 min read
Platformization of Bilibili's Live‑Streaming E‑Commerce Business: Architecture, Implementation and Governance
Code Ape Tech Column
Code Ape Tech Column
Dec 12, 2023 · Operations

Centralized Log Collection with Filebeat and Graylog

This article explains how to use Filebeat together with Graylog to collect, ship, store, and analyze logs from multiple environments, covering tool introductions, configuration files, Docker deployment, Spring Boot integration, and practical search syntax for effective log monitoring.

ElasticsearchFilebeatGraylog
0 likes · 20 min read
Centralized Log Collection with Filebeat and Graylog
DataFunSummit
DataFunSummit
Dec 11, 2023 · Big Data

Design and Implementation of a Big Data Metadata Warehouse at Bilibili

This article presents Bilibili's big‑data metadata warehouse, covering its background, technology selection between data‑lake and data‑warehouse solutions, the architecture built on Prometheus, StarRocks, Flink and Routine Load, performance comparisons, diagnostic system design, and future development plans.

FlinkMetadata WarehouseStarRocks
0 likes · 20 min read
Design and Implementation of a Big Data Metadata Warehouse at Bilibili
Efficient Ops
Efficient Ops
Dec 10, 2023 · Cloud Native

How to Build a Complete Kubernetes Monitoring Stack with Prometheus & Grafana

This guide walks through a full Kubernetes monitoring solution using cAdvisor, node_exporter, Prometheus, and Grafana, covering architecture, data collection, service discovery, deployment steps with DaemonSets, and detailed YAML configurations for a production‑ready observability stack.

GrafanaKubernetesPrometheus
0 likes · 6 min read
How to Build a Complete Kubernetes Monitoring Stack with Prometheus & Grafana
DevOps Coach
DevOps Coach
Dec 8, 2023 · Frontend Development

How to Add Elastic RUM Monitoring to a Hugo Site

This guide explains what Elastic Real User Monitoring (RUM) is, outlines its key benefits, and provides step‑by‑step instructions with code snippets for integrating the Elastic RUM JavaScript agent into a Hugo static site, including configuration parameters and how to view the collected data in Kibana.

APMHugoRUM
0 likes · 14 min read
How to Add Elastic RUM Monitoring to a Hugo Site
Yunxuetang Frontend Team
Yunxuetang Frontend Team
Dec 8, 2023 · Frontend Development

Key Front-End Trends and Techniques to Watch in 2023

2023 saw rapid evolution in the front‑end ecosystem, highlighted by major events, a controversial Gemini AI demo, SkyWalking‑based performance and error monitoring, innovative text‑overflow handling, CSS techniques that boost long‑list rendering by up to seven times, and an automatic, non‑intrusive skeleton‑screen generation solution.

2023Skeleton Screenfrontend
0 likes · 4 min read
Key Front-End Trends and Techniques to Watch in 2023
Open Source Linux
Open Source Linux
Dec 8, 2023 · Operations

Top 5 Log Management Tools Every DevOps Engineer Should Know

This article reviews five leading log management solutions—Graylog, LogDNA, ELK Stack, Grafana Loki, and Splunk—detailing their core components, key features, and why they are valuable for monitoring, troubleshooting, and securing modern IT environments.

DevOpsELK StackGrafana Loki
0 likes · 7 min read
Top 5 Log Management Tools Every DevOps Engineer Should Know
HomeTech
HomeTech
Dec 8, 2023 · Mobile Development

Automotive Home Push Platform Architecture and Future Development

This article introduces the architecture and core functions of Automotive Home Push Platform, covering its development history, technical implementation, monitoring system, and future plans for intelligent message distribution.

Cloud NativeMicroservicesUser experience
0 likes · 9 min read
Automotive Home Push Platform Architecture and Future Development
Architect
Architect
Dec 5, 2023 · Backend Development

How to Build an Efficient, Low‑Complexity Microservices Architecture

This article outlines nine practical best‑practice steps for designing a low‑complexity, high‑efficiency microservices ecosystem, covering principles such as the Single Responsibility Principle, cross‑functional team organization, appropriate tooling, asynchronous communication, DevSecOps security, independent data stores, isolated deployment, orchestration, and effective monitoring, each illustrated with concrete examples.

Backend ArchitectureDevOpsDevSecOps
0 likes · 14 min read
How to Build an Efficient, Low‑Complexity Microservices Architecture
Efficient Ops
Efficient Ops
Dec 3, 2023 · Artificial Intelligence

How to Build a Zabbix Expert Advisor with GPT‑4 in Minutes

This guide walks you through why GPT‑4 outperforms GPT‑3.5, shows step‑by‑step how to create a Zabbix expert consultant using the new GPTs feature, and explains advanced configuration, knowledge‑base feeding, testing, and future possibilities for AI‑enhanced monitoring.

AI AssistantGPT-4Knowledge Base
0 likes · 7 min read
How to Build a Zabbix Expert Advisor with GPT‑4 in Minutes
Open Source Linux
Open Source Linux
Dec 1, 2023 · Operations

10 Essential Ops Tools Every Engineer Should Master

This article introduces ten indispensable tools for operations engineers, detailing each tool's functionality, suitable scenarios, advantages, and real‑world examples, and includes practical code snippets to help automate, monitor, and manage infrastructure efficiently.

Operationsautomationdevops tools
0 likes · 8 min read
10 Essential Ops Tools Every Engineer Should Master
Architect
Architect
Nov 30, 2023 · Cloud Native

From Monolith to Resilient Microservices: A Step‑by‑Step Architecture Evolution

The article walks through a real‑world online supermarket project, showing how a simple monolithic system evolves into a fully‑featured microservice architecture, detailing each refactoring stage, the problems encountered, and the concrete solutions such as service extraction, database sharding, monitoring, tracing, gateways, service discovery, reliability patterns, testing, and service‑mesh adoption.

Cloud NativeService Mesharchitecture
0 likes · 25 min read
From Monolith to Resilient Microservices: A Step‑by‑Step Architecture Evolution
DevOps
DevOps
Nov 29, 2023 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the journey of transforming a simple online supermarket from a monolithic application to a fully fledged microservice architecture, highlighting the motivations, design decisions, component breakdown, operational challenges, monitoring, tracing, resilience patterns, testing strategies, and the role of service meshes.

DevOpsMicroservicesService Mesh
0 likes · 21 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Architecture and Beyond
Architecture and Beyond
Nov 25, 2023 · Operations

Effective Log Management Strategy: Standards, SDK Integration, and Lifecycle Practices

The article outlines common logging problems and presents a comprehensive six‑step strategy—including clear logging standards, systematic standard management, a unified SDK, centralized log management systems, regular standard reviews, and lifecycle deprecation—to transform chaotic logs into a reliable tool that boosts development efficiency.

Log ManagementOperationsSDK
0 likes · 7 min read
Effective Log Management Strategy: Standards, SDK Integration, and Lifecycle Practices
Architect
Architect
Nov 24, 2023 · Industry Insights

How We Evolved the Voice Chat Room Architecture to Scale with Real‑Time Interaction

This article chronicles the year‑long evolution of the voice‑chat room system, detailing how product‑driven requirements forced successive redesigns of both the live‑streaming and RTC subsystems, the introduction of session‑and‑channel abstractions, migration of mic‑seat management to the backend, and the implementation of monitoring, testing, and deployment practices that keep the architecture stable and extensible.

Domain-Driven DesignMicroservicesRBAC
0 likes · 28 min read
How We Evolved the Voice Chat Room Architecture to Scale with Real‑Time Interaction
dbaplus Community
dbaplus Community
Nov 23, 2023 · Operations

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.

Alert Noise ReductionOperationsincident management
0 likes · 13 min read
How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples
Sanyou's Java Diary
Sanyou's Java Diary
Nov 23, 2023 · Backend Development

From Monolith to Microservices: A Complete Journey with Real‑World Examples

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully decoupled microservice architecture, covering initial requirements, common pitfalls, service decomposition, database splitting, monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing, frameworks, and service mesh, while illustrating each step with diagrams and practical advice.

Microservicescircuit breakermonitoring
0 likes · 22 min read
From Monolith to Microservices: A Complete Journey with Real‑World Examples
Baidu Geek Talk
Baidu Geek Talk
Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic
0 likes · 9 min read
Stability Assurance for Baidu Search Aladdin during Large-Scale Events
Qunar Tech Salon
Qunar Tech Salon
Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

MicroservicesOperationsTSDB
0 likes · 22 min read
Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
Alibaba Cloud Native
Alibaba Cloud Native
Nov 18, 2023 · Cloud Native

How eBPF Powers Next‑Gen Observability and Root‑Cause Analysis in Kubernetes

This talk explains the three major observability challenges in Kubernetes, demonstrates how eBPF enables comprehensive, low‑overhead data collection across all stack layers, and outlines a practical workflow that combines architecture awareness, application‑level metrics, and fault‑tree analysis to achieve automated root‑cause diagnosis.

Fault DiagnosisKuberneteseBPF
0 likes · 21 min read
How eBPF Powers Next‑Gen Observability and Root‑Cause Analysis in Kubernetes
Aikesheng Open Source Community
Aikesheng Open Source Community
Nov 15, 2023 · Databases

Understanding Redis Hotkeys: Issues, Detection Methods, and Mitigation Strategies

This article explains what Redis hotkeys are, the performance and replication problems they cause, various techniques for detecting them—including client statistics, MONITOR, the HOTKEYS command, and TCP packet capture—and practical mitigation approaches such as sharding, multi‑level caching, and monitoring optimization.

HotKeymonitoringperformance
0 likes · 9 min read
Understanding Redis Hotkeys: Issues, Detection Methods, and Mitigation Strategies
JD Retail Technology
JD Retail Technology
Nov 8, 2023 · Operations

Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events

The article analyzes the importance of system stability during major sales promotions, presents data‑driven insights on traffic and revenue, identifies key challenges such as massive traffic, data volume, and complex workflows, and offers comprehensive operational, application, storage, and monitoring measures to guarantee reliable performance under extreme load.

Deploymentdatabaselarge‑scale promotion
0 likes · 13 min read
Technical Strategies for Ensuring System Stability During Large‑Scale Promotional Events
DataFunSummit
DataFunSummit
Nov 6, 2023 · Big Data

Building and Managing Huolala's User Event Tracking System: Architecture, Governance, and Monitoring

This article details Huolala's user event tracking (埋点) system, covering its background, challenges, the construction of a four‑module management platform, backend SDK design, monitoring and quality assurance mechanisms, and future plans for service integration, data lineage, and governance optimization.

Data Governancebackend SDKdata pipeline
0 likes · 16 min read
Building and Managing Huolala's User Event Tracking System: Architecture, Governance, and Monitoring
Architect's Guide
Architect's Guide
Nov 6, 2023 · Operations

Comparison of Prometheus and Zabbix Monitoring Tools

This article compares the open‑source monitoring solutions Prometheus and Zabbix, outlining their histories, architectures, data collection methods, scalability, storage models, configuration complexity, community activity, and suitability for different environments such as traditional servers versus cloud‑native container platforms.

Cloud NativeOperationsPrometheus
0 likes · 8 min read
Comparison of Prometheus and Zabbix Monitoring Tools
NetEase LeiHuo Testing Center
NetEase LeiHuo Testing Center
Nov 3, 2023 · Operations

Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling

The article outlines how game QA and third‑party providers can improve cooperation by aligning basic performance concepts such as TPS, QPS and concurrency, selecting appropriate rate‑limiting strategies, establishing precise monitoring and alerting, and preparing clear incident‑response and delivery standards.

OperationsPerformance Testingmonitoring
0 likes · 15 min read
Best Practices for Third‑Party Interface Collaboration: Concepts, Rate Limiting, Monitoring, and Incident Handling
Data Thinking Notes
Data Thinking Notes
Nov 2, 2023 · Operations

How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse

This article details Bilibili's data quality assurance framework, covering its evolution across four data platform stages, the architecture of its quality data warehouse, core capabilities such as a complete assurance system, digital‑driven continuous optimization, and efficient incident handling, plus case studies, future plans, and a Q&A session.

Big DataBilibiliData Platform
0 likes · 27 min read
How Bilibili Built a Scalable Data Quality Assurance System for Its Data Warehouse
Efficient Ops
Efficient Ops
Nov 2, 2023 · Operations

How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation

The Industrial and Commercial Bank of China software development center created an SRE panoramic monitoring view system that unifies data channels, standardizes metrics, offers multi‑dimensional dashboards, and introduces an intelligent Ops Assistant, dramatically improving fault detection, response speed, and cross‑team operational efficiency.

Digital TransformationICBCOperations
0 likes · 6 min read
How ICBC’s SRE Team Built a Panoramic Monitoring System for Digital Ops Transformation
MaGe Linux Operations
MaGe Linux Operations
Oct 30, 2023 · Operations

Boost DevOps with Docker: Automation, Monitoring, and Log Management

This article explains how Docker integrates with DevOps practices to enhance automation, streamline continuous integration and deployment, enable comprehensive container, application, and infrastructure monitoring, and centralize log collection and analysis, providing practical code examples for building, testing, deploying, and managing services efficiently.

DevOpsLog Managementautomation
0 likes · 8 min read
Boost DevOps with Docker: Automation, Monitoring, and Log Management
MaGe Linux Operations
MaGe Linux Operations
Oct 27, 2023 · Cloud Native

Deploy Grafana and Prometheus on Kubernetes in Minutes

This guide walks you through preparing a Kubernetes cluster, creating deployment manifests, configuring Grafana and Prometheus, and verifying the monitoring setup, including code snippets and step‑by‑step commands for a seamless installation on a lightweight cloud server.

Cloud NativeDevOpsGrafana
0 likes · 7 min read
Deploy Grafana and Prometheus on Kubernetes in Minutes
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Oct 27, 2023 · Databases

Corona Technical Series: Time-Series Databases in Corona

The article explains how Corona leverages three time‑series databases—InfluxDB for storing pre‑aggregated user metrics and platform health data, ClickHouse for real‑time multidimensional log analysis with aggregations, and ElasticSearch for full‑text searchable log monitoring—detailing their schema designs and query examples.

CoronaDatabase ArchitectureInfluxDB
0 likes · 19 min read
Corona Technical Series: Time-Series Databases in Corona
Su San Talks Tech
Su San Talks Tech
Oct 27, 2023 · Operations

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.

Cloud Servicesdisaster recoveryincident response
0 likes · 12 min read
What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review
DevOps
DevOps
Oct 26, 2023 · Operations

Design and Implementation of SLA for Object Storage Services

This article explains how to design SLA metrics for object storage services, describes the S3 protocol, proposes availability calculations, outlines monitoring and alerting rules, and provides practical implementation examples using s3cmd, Python boto, and Java SDK to ensure reliable cloud storage operations.

SLAmonitoringobject storage
0 likes · 16 min read
Design and Implementation of SLA for Object Storage Services
HomeTech
HomeTech
Oct 25, 2023 · Operations

How Metrics‑Driven Development Supercharges a Used‑Car Platform

This article examines how a metrics‑driven development approach, combined with observability tools like Prometheus, helped a large online used‑car marketplace improve system insight, accelerate business processes, and deliver measurable performance and efficiency gains across both customer‑facing and dealer‑facing operations.

Data-Driven EngineeringMetrics-Driven DevelopmentSoftware Operations
0 likes · 16 min read
How Metrics‑Driven Development Supercharges a Used‑Car Platform
Efficient Ops
Efficient Ops
Oct 24, 2023 · Operations

How to Monitor Business Metrics with Prometheus in Kubernetes

This article explains how to use Prometheus to monitor business‑level metrics in a Kubernetes environment, covering observability fundamentals, metric definitions, metric types, exposing metrics via a /metrics endpoint, and practical Go code examples for defining, recording, and scraping custom metrics.

GoKubernetesMetrics
0 likes · 11 min read
How to Monitor Business Metrics with Prometheus in Kubernetes
Java High-Performance Architecture
Java High-Performance Architecture
Oct 22, 2023 · Backend Development

How DynamicTp Turns Java ThreadPoolExecutor into a Real‑Time, Configurable Powerhouse

This article introduces DynamicTp, a Java framework that extends ThreadPoolExecutor with dynamic configuration, real‑time monitoring, and alerting, enabling developers to adjust thread‑pool parameters on the fly, integrate with popular configuration centers, and achieve high‑availability and scalability in microservice environments.

Configuration CenterDynamic Thread PoolDynamicTp
0 likes · 12 min read
How DynamicTp Turns Java ThreadPoolExecutor into a Real‑Time, Configurable Powerhouse
DevOps Cloud Academy
DevOps Cloud Academy
Oct 18, 2023 · Operations

Comprehensive Overview of DevOps Tools for 2024

This article provides a detailed overview of the most widely used DevOps tools across categories such as version control, CI/CD, container orchestration, configuration management, infrastructure as code, monitoring, collaboration, artifact repositories, testing, security, deployment automation, serverless, and database management, helping practitioners choose the right solutions for their pipelines.

CollaborationDevOpsInfrastructure as Code
0 likes · 7 min read
Comprehensive Overview of DevOps Tools for 2024
Efficient Ops
Efficient Ops
Oct 15, 2023 · Databases

How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide

This article walks through practical methods for troubleshooting slow service alerts, diagnosing Redis performance bottlenecks, and reproducing issues with local demos and load simulations, offering concrete metrics, command‑line checks, and mitigation strategies such as scaling, rate‑limiting, and pipeline optimization.

Operationsmonitoringperformance
0 likes · 22 min read
How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide
JD Tech
JD Tech
Oct 13, 2023 · Operations

Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability

This article presents a practical pre‑alert monitoring solution for a high‑volume fund trading system, detailing how simple time‑based key‑point checks and targeted alerts reduce instant and end‑of‑day alarms, improve issue detection within 15 minutes, and enhance overall system stability and reconciliation efficiency.

fund‑tradingmonitoringpre‑alert
0 likes · 11 min read
Implementing a Real-Time Pre-Alert Monitoring System to Improve Fund Trading System Stability
JD Tech
JD Tech
Oct 11, 2023 · Fundamentals

Key Considerations for Building System Engineering Architecture: Design, Technology Selection, and Consensus

This article comprehensively discusses the essential aspects of constructing a system engineering architecture, emphasizing value‑first decision making, layered and DDD architectural patterns, technology selection criteria, exception handling, logging, monitoring, and the importance of establishing shared consensus among teams.

DDDException HandlingSoftware Architecture
0 likes · 26 min read
Key Considerations for Building System Engineering Architecture: Design, Technology Selection, and Consensus
Liangxu Linux
Liangxu Linux
Oct 10, 2023 · Operations

Master Kibana: Install, Configure, and Visualize Elasticsearch Data Step‑by‑Step

This guide walks you through installing Kibana, configuring its connection to Elasticsearch, creating index patterns, using Discover for searches, mastering Lucene‑based query syntax, building visualizations, assembling dashboards, and monitoring logs, all illustrated with clear screenshots and code examples.

DashboardData visualizationElasticsearch
0 likes · 14 min read
Master Kibana: Install, Configure, and Visualize Elasticsearch Data Step‑by‑Step
Alibaba Cloud Native
Alibaba Cloud Native
Oct 10, 2023 · Operations

Mastering Memcached: Features, Use Cases, and Prometheus Monitoring

This article explains Memcached’s architecture, key characteristics, suitable and unsuitable scenarios, memory management and LRU mechanisms, version details, and provides a comprehensive guide to monitoring its performance and health using Prometheus and Alibaba Cloud ARMS dashboards.

Cloud NativeMemcachedOperations
0 likes · 26 min read
Mastering Memcached: Features, Use Cases, and Prometheus Monitoring
JD Tech
JD Tech
Oct 10, 2023 · Operations

Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion

This article details how JDV, JD.com’s internal visual dashboard platform, tackled the massive data‑intensive 618 promotion by implementing real‑time updates, cross‑midnight count stops, request‑state control, heartbeat monitoring, proxy data sources, and a suite of developer tools to ensure stability, performance, and rapid feature delivery.

Data PlatformReal-Timelarge scale
0 likes · 18 min read
Technical Case Study of JDV Visual Dashboard Platform for the 618 Promotion