Tagged articles
2179 articles
Page 8 of 22
DataFunTalk
DataFunTalk
Oct 8, 2023 · Big Data

Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu

This article reveals how Baidu implements end‑to‑end DataOps for its commercial data products, covering challenges of massive report generation, the design of a layered data architecture, platform‑wide automation, serverless deployment, risk control, monitoring, and optimization to achieve scalable, reliable data pipelines.

Big DataDataOpsServerless
0 likes · 13 min read
Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu
Efficient Ops
Efficient Ops
Sep 26, 2023 · Operations

Mastering Zabbix: From Installation to Advanced Monitoring and Automation

This comprehensive guide walks you through Zabbix monitoring concepts, reliability calculations, installation methods, web UI configuration, host and template management, custom monitoring, alert integration with OneAlert, Grafana visualization, distributed monitoring, SNMP support, and practical scripts for large‑scale server environments.

AlertingGrafanaOps
0 likes · 28 min read
Mastering Zabbix: From Installation to Advanced Monitoring and Automation
Selected Java Interview Questions
Selected Java Interview Questions
Sep 24, 2023 · Operations

Comparison of Six Open-Source Log Management Tools

This article reviews six open‑source log management solutions—OpenObserve, Grafana Loki, SigNoz, Graylog, Syslog‑ng, and Highlight.io—detailing their features, advantages, and drawbacks to help engineers select the most suitable tool for observability, monitoring, and cost‑effective log handling.

Log ManagementTool comparisonmonitoring
0 likes · 15 min read
Comparison of Six Open-Source Log Management Tools
Alibaba Cloud Native
Alibaba Cloud Native
Sep 24, 2023 · Cloud Computing

Designing Highly Available Cloud‑Native Applications on Alibaba Cloud ACK

This article explains how to build robust, highly available cloud‑native applications on Alibaba Cloud Container Service for Kubernetes (ACK) by covering architecture principles, multi‑zone cluster design, Kubernetes HA features such as topology spread constraints and pod anti‑affinity, storage strategies, load‑balancing, virtual nodes, health probes, monitoring, and multi‑cluster deployment patterns.

ACKCloud NativeKubernetes
0 likes · 35 min read
Designing Highly Available Cloud‑Native Applications on Alibaba Cloud ACK
DevOps Coach
DevOps Coach
Sep 21, 2023 · Operations

What Is Observability (o11y) and Why It Matters for Modern Cloud‑Native Operations

The article explains the origins, common misconceptions, and a rigorous definition of observability (o11y), highlights its importance in cloud‑native environments, and describes how high‑cardinality, high‑dimensional telemetry enables effective debugging, troubleshooting, and performance analysis of modern distributed systems.

cloud-nativedebuggingmonitoring
0 likes · 11 min read
What Is Observability (o11y) and Why It Matters for Modern Cloud‑Native Operations
Architect
Architect
Sep 19, 2023 · Big Data

How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service

This article analyzes the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan platform, and details Tianyan's architecture, data collection, high‑throughput transmission, storage, retrieval, resource isolation, dynamic cleanup, and best‑practice recommendations, complete with code examples and performance insights.

Big DataDistributed SystemsELK
0 likes · 30 min read
How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service
Zhuanzhuan Tech
Zhuanzhuan Tech
Sep 19, 2023 · Operations

Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB

This article describes how ZhaiZhai unified dozens of legacy monitoring tools into a single, all‑in‑one observability platform by adopting Prometheus + Grafana, extending the Prometheus client to push metrics to M3DB, automating Grafana dashboard creation, and building a custom alerting service to reduce operational complexity and improve visibility across business, middleware, and infrastructure services.

AlertingGrafanaM3DB
0 likes · 21 min read
Design and Implementation of an Integrated Monitoring System at ZhaiZhai Using Prometheus, Grafana, and M3DB
DaTaobao Tech
DaTaobao Tech
Sep 18, 2023 · Databases

Comprehensive Approach to Slow SQL Detection and Governance

The Taobao platform’s slow‑SQL governance team implemented a comprehensive detection and governance pipeline—combining internal slow‑log tools, database slow‑query logs, and JVM‑Sandbox instrumentation to capture full SQL details, scoring high‑risk queries by execution time, scans, and standards violations, then prioritizing remediation through health scores, branch‑diff checks, and issue tracking—significantly cutting DB‑related incidents and boosting system stability.

databasegovernancejvm-sandbox
0 likes · 12 min read
Comprehensive Approach to Slow SQL Detection and Governance
Huolala Tech
Huolala Tech
Sep 14, 2023 · Operations

Designing an Effective UI for Monitoring Alerts: Insights from Huolala

This article shares Huolala's experience designing a unified monitoring platform UI, covering the evolution from open‑source dashboards to a fully self‑developed solution, simplification of PromQL, computed metrics, log and trace integration, and the challenges of alert configuration and visualization.

AlertingOperationsPrometheus
0 likes · 16 min read
Designing an Effective UI for Monitoring Alerts: Insights from Huolala
IT Services Circle
IT Services Circle
Sep 14, 2023 · Backend Development

Key Techniques for Designing High‑Concurrency Systems

This article outlines essential architectural and operational strategies—including page static‑generation, CDN acceleration, caching layers, asynchronous processing, thread‑pool and MQ integration, sharding, connection pooling, read/write splitting, indexing, batch processing, clustering, load balancing, rate limiting, service degradation, failover, multi‑active deployment, stress testing, and monitoring—to build robust, high‑concurrency backend systems.

Backend Architecturecachinghigh concurrency
0 likes · 23 min read
Key Techniques for Designing High‑Concurrency Systems
MaGe Linux Operations
MaGe Linux Operations
Sep 13, 2023 · Cloud Native

Mastering Prometheus Metrics: Counters, Gauges, Histograms & Summaries Explained

This article introduces the fundamentals of metrics in IT monitoring, explains the structure of metric data points, explores dimensional metrics, and provides an in‑depth guide to Prometheus metric types—Counters, Gauges, Histograms, and Summaries—along with practical code examples and usage considerations in cloud‑native environments.

MetricsPrometheusmonitoring
0 likes · 19 min read
Mastering Prometheus Metrics: Counters, Gauges, Histograms & Summaries Explained
JD Cloud Developers
JD Cloud Developers
Sep 13, 2023 · Operations

Stability Engineering Explained: From Entropy Theory to Practical SRE

The article explores why building system stability is crucial by linking entropy theory to software reliability, introduces the availability formula, discusses common pitfalls and industry practices, and proposes a three‑stage governance framework—prevention, mitigation, and post‑mortem—to systematically improve operational resilience.

AvailabilityOperationsReliability
0 likes · 13 min read
Stability Engineering Explained: From Entropy Theory to Practical SRE
Efficient Ops
Efficient Ops
Sep 12, 2023 · Operations

Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries

This article explains how metrics are used to monitor software performance, introduces basic metric components and dimensional metrics, compares Prometheus, OpenMetrics and OpenTelemetry standards, and provides detailed guidance on Prometheus metric types—Counter, Gauge, Histogram, and Summary—with code examples and query patterns.

MetricsPrometheusPython
0 likes · 18 min read
Understanding Prometheus Metric Types: Counters, Gauges, Histograms & Summaries
Didi Tech
Didi Tech
Sep 12, 2023 · Operations

Observability: Concepts, Challenges, and Didi’s Implementation

The article explains observability as the ability to infer any system state from external data, contrasts it with traditional monitoring, outlines challenges of high‑dimensional, high‑cardinality data and storage costs, and describes Didi’s hybrid MTL architecture that separates low‑ and high‑cardinality logs and metrics while linking them via TraceIDs to provide detailed, cost‑effective insight and streamlined debugging.

DidiMicroserviceslogging
0 likes · 9 min read
Observability: Concepts, Challenges, and Didi’s Implementation
Architect
Architect
Sep 7, 2023 · Cloud Native

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

This article details how Vivo's container platform faced exploding metric volumes, component overload, data gaps, and storage spikes, and explains the step‑by‑step architectural redesign, metric governance, performance tuning, cAdvisor redeployment, and VictoriaMetrics upgrade that restored high‑availability, low‑latency monitoring across a large Kubernetes fleet.

Cloud NativeKubernetesPrometheus
0 likes · 18 min read
How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics
MaGe Linux Operations
MaGe Linux Operations
Sep 2, 2023 · Operations

Top 5 Linux Monitoring Tools Every Ops Engineer Should Use

This article introduces five essential Linux monitoring tools—iotop, htop, IPTraf, Monit, and related resources—explaining how each helps operations engineers diagnose I/O, CPU, memory, and network issues in real time without a GUI, and offers guidance on installation and practical use cases.

IPTrafLinuxMonit
0 likes · 6 min read
Top 5 Linux Monitoring Tools Every Ops Engineer Should Use
dbaplus Community
dbaplus Community
Aug 30, 2023 · Operations

How Weibo Scales to Hundreds of Millions: Building a Resilient Hybrid‑Cloud Architecture

This article outlines Weibo's massive user‑scale challenges and presents a comprehensive high‑availability solution that combines capacity planning, distributed caching, micro‑service isolation, cross‑language RPC, service‑mesh governance, multi‑datacenter disaster recovery, containerization, and hybrid‑cloud scaling to ensure reliable service delivery.

MicroservicesService Meshhybrid cloud
0 likes · 15 min read
How Weibo Scales to Hundreds of Millions: Building a Resilient Hybrid‑Cloud Architecture
High Availability Architecture
High Availability Architecture
Aug 30, 2023 · Backend Development

Diagnosing and Optimizing JVM Memory Issues in a Core Service

This article details the identification, analysis, and resolution of JVM memory problems in a core music metadata service, covering GC tuning, large‑object handling, fault‑tolerance strategies, custom Dubbo codec monitoring, and non‑intrusive memory object tracking to improve performance and stability.

DubboJVMMemory Optimization
0 likes · 14 min read
Diagnosing and Optimizing JVM Memory Issues in a Core Service
DeWu Technology
DeWu Technology
Aug 28, 2023 · Operations

Real-time Data Warehouse Business-Side Chaos Engineering Practice

The article describes how a real‑time data warehouse supporting ad‑delivery metrics adopts both technical and business‑side chaos‑engineering, using red‑blue team drills to inject faults, monitor indicator anomalies, and refine response procedures, thereby enhancing early risk detection, system resilience, and overall data stability for the advertising platform.

Data QualityData WarehousingOps
0 likes · 16 min read
Real-time Data Warehouse Business-Side Chaos Engineering Practice
JD Retail Technology
JD Retail Technology
Aug 24, 2023 · Operations

High‑Availability Strategies for E‑commerce Large‑Scale Promotion Systems

This article outlines a comprehensive framework for preparing e‑commerce platforms for major sales events, covering the history of promotions, business models, system chain segmentation, stability goals, strategic planning, tactical measures, growth promotion, and reference resources to ensure high availability and reliable user experience.

e‑commercehigh availabilitylarge‑scale promotion
0 likes · 19 min read
High‑Availability Strategies for E‑commerce Large‑Scale Promotion Systems
Sohu Tech Products
Sohu Tech Products
Aug 23, 2023 · Operations

Implementing Global Pulsar Client Monitoring with a SkyWalking Plugin

To give the business team a global, application‑level view of Pulsar performance, the team built a SkyWalking Java‑Agent plugin that automatically collects producer and consumer metrics from the Pulsar client, exposing latency, backlog and failure counts via Prometheus without modifying the client code.

MetricsPrometheusPulsar
0 likes · 7 min read
Implementing Global Pulsar Client Monitoring with a SkyWalking Plugin
Efficient Ops
Efficient Ops
Aug 23, 2023 · Operations

How to Diagnose High Load with Low CPU on Linux: Tools & Tips

This guide explains how to analyze Linux load situations—whether CPU and load are both high or CPU is low while load remains high—by using commands like top, vmstat, iostat, sar, and jstack, and provides practical troubleshooting steps for common I/O‑related issues.

CPULoadOperations
0 likes · 11 min read
How to Diagnose High Load with Low CPU on Linux: Tools & Tips
dbaplus Community
dbaplus Community
Aug 22, 2023 · Operations

Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions

This article details Huolala's one‑stop monitoring platform called Monitor, covering its multi‑cloud architecture, data collection pipelines, real‑time business monitoring, unified alarm handling, and future AI‑driven enhancements, while sharing concrete metrics, incident case studies, and practical implementation steps for large‑scale observability.

GPTOperationscloud-native
0 likes · 19 min read
Designing a Multi‑Cloud Intelligent Monitoring Platform at Huolala: Architecture, Practices, and Future Directions
Efficient Ops
Efficient Ops
Aug 22, 2023 · Operations

Persisting Prometheus Alertmanager Alerts with Alertsnitch, MySQL, and Grafana

This article explains how Prometheus stores alerts only as time‑series data, why that limits historical queries, and provides a complete open‑source solution using Alertmanager, Alertsnitch, MySQL, and Grafana to persist, query, and visualize alerts in production environments.

Alert PersistenceAlertmanagerGrafana
0 likes · 10 min read
Persisting Prometheus Alertmanager Alerts with Alertsnitch, MySQL, and Grafana
Huolala Tech
Huolala Tech
Aug 18, 2023 · Operations

Beyond System Metrics: Building Effective Business Monitoring for Pricing Services

Facing unpredictable software behavior, the article explains why traditional system‑level monitoring often misses critical business issues, especially in complex pricing services, and presents a comprehensive approach that combines result (black‑box) and process (white‑box) monitoring, practical metrics, and actionable recommendations to improve observability and reduce operational risk.

Operationsbusiness metricsmonitoring
0 likes · 14 min read
Beyond System Metrics: Building Effective Business Monitoring for Pricing Services
dbaplus Community
dbaplus Community
Aug 14, 2023 · Operations

Designing Business‑Focused Monitoring for Banking Systems: Metrics, Alerts, and Implementation Challenges

The article outlines a practical framework for business‑level monitoring in banking systems, describing three evolution stages, key metrics such as transaction success rates and volume spikes, concrete alert rules, and the technical challenges of data collection, standardization, and massive parameter management.

AlertingMetricsOperations
0 likes · 14 min read
Designing Business‑Focused Monitoring for Banking Systems: Metrics, Alerts, and Implementation Challenges
DeWu Technology
DeWu Technology
Aug 14, 2023 · Operations

Capital Loss Prevention Practices and Technical System

Dewu’s capital‑loss prevention framework embeds risk assessment and technical safeguards—such as idempotency, distributed consistency, and active‑active multi‑region design—into architecture, organizes three defensive lines (development, QA, SRE), and employs real‑time, near‑real‑time, and offline verification plus regular drills, while advancing automated analysis and intelligent scaling.

Data ConsistencySREfinancial loss prevention
0 likes · 10 min read
Capital Loss Prevention Practices and Technical System
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Aug 10, 2023 · Operations

How Kubernetes Powers Modern DevOps Automation and Operations

By integrating Kubernetes with DevOps practices, teams can automate deployment pipelines, achieve dynamic resource allocation, centralize monitoring with tools like Prometheus and Grafana, and treat infrastructure as code, resulting in faster, higher-quality software delivery and improved collaboration between development and operations.

DevOpsInfrastructure as CodeKubernetes
0 likes · 7 min read
How Kubernetes Powers Modern DevOps Automation and Operations
Ctrip Technology
Ctrip Technology
Aug 3, 2023 · Operations

Intelligent Anomaly Detection for Ctrip Operations: LSTM Forecasting, Trend Analysis, Adaptive Thresholds, and Periodic Anomaly Filtering

The article describes Ctrip's AIOps approach to improving alert quality by combining statistical methods and machine‑learning models such as LSTM, trend analysis, adaptive threshold calculation, and dynamic‑time‑warping based periodic anomaly detection, achieving significant gains in precision and fault‑recall rates.

LSTMTime Seriesadaptive threshold
0 likes · 12 min read
Intelligent Anomaly Detection for Ctrip Operations: LSTM Forecasting, Trend Analysis, Adaptive Thresholds, and Periodic Anomaly Filtering
HelloTech
HelloTech
Aug 1, 2023 · Cloud Native

Elastic Scaling Practices in Cloud‑Native Kubernetes Environments

To overcome native HPA limits and business‑specific constraints in a fully containerized, cloud‑native Kubernetes environment, we implemented a dual‑threshold water‑level and scheduled scaling engine, hybrid‑cloud ClusterAutoScale, mixed‑deployment resource prioritization, and comprehensive Prometheus‑based observability, achieving higher utilization, lower costs, and a roadmap toward deeper optimization and AIOps.

Auto ScalingCloud NativeKubernetes
0 likes · 10 min read
Elastic Scaling Practices in Cloud‑Native Kubernetes Environments
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Jul 27, 2023 · Backend Development

How to Set Up and Secure Spring Boot Admin Server & Client with Dynamic Logging

This guide walks through setting up a Spring Boot Admin server and client, adding security, configuring logging, displaying client IPs, and dynamically adjusting log levels via the SBA UI, providing complete Maven dependencies, Java configuration classes, and YAML settings for a secure, observable Spring Boot ecosystem.

Spring Bootjavalogging
0 likes · 9 min read
How to Set Up and Secure Spring Boot Admin Server & Client with Dynamic Logging
Open Source Linux
Open Source Linux
Jul 27, 2023 · Operations

17 Essential Linux Ops Tricks to Boost Your Productivity

This article compiles seventeen practical Linux administration techniques—from batch file handling and directory checks to log analysis, disk monitoring, firewall rules, and network capture—each illustrated with ready‑to‑run shell commands and concise explanations for sysadmins.

OpsShellSysadmin
0 likes · 8 min read
17 Essential Linux Ops Tricks to Boost Your Productivity
Tech Architecture Stories
Tech Architecture Stories
Jul 23, 2023 · Backend Development

Beyond Scale: Rethinking Architecture Boundaries for Massive Services

This article reflects on years of designing large‑scale backend systems at Tencent, discussing how to define clear architecture boundaries, ensure high availability, integrate diverse technologies, and use observability and monitoring to continuously evolve and improve massive service architectures.

Distributed SystemsSystem Designarchitecture
0 likes · 25 min read
Beyond Scale: Rethinking Architecture Boundaries for Massive Services
Liangxu Linux
Liangxu Linux
Jul 22, 2023 · Operations

17 Essential Linux Sysadmin Commands to Boost Productivity

This article compiles 17 practical Linux operation tricks—from file searching and batch extraction to disk monitoring, log analysis, and firewall scripting—providing sysadmins with ready-to-use command snippets that can streamline daily tasks and potentially earn a raise.

BashScriptingSysadmin
0 likes · 8 min read
17 Essential Linux Sysadmin Commands to Boost Productivity
Test Development Learning Exchange
Test Development Learning Exchange
Jul 18, 2023 · Operations

Common System Performance Issues and Their Solutions

This article enumerates typical system performance problems such as slow response time, insufficient throughput, resource bottlenecks, database slowness, memory leaks, platform differences, network latency, security overhead, scalability limits, and provides practical optimization and mitigation strategies for each.

ScalabilitySystemsmonitoring
0 likes · 7 min read
Common System Performance Issues and Their Solutions
Test Development Learning Exchange
Test Development Learning Exchange
Jul 17, 2023 · Operations

Comprehensive Guide to Performance Testing Parameters, Metrics, and Tool Selection

This article explains key performance testing parameters such as concurrent users, TPS, response time, virtual users, and data volume, outlines essential monitoring metrics, details preparation steps and simple API testing procedures, and compares popular load‑testing tools like JMeter, Locust, and LoadRunner.

Performance TestingResponse Timemonitoring
0 likes · 12 min read
Comprehensive Guide to Performance Testing Parameters, Metrics, and Tool Selection
21CTO
21CTO
Jul 17, 2023 · Big Data

How WeChat Cut Query Latency from Seconds to 100 ms with Druid Optimizations

This case study explains how the WeChat multi‑dimensional monitoring platform identified performance bottlenecks in its Druid‑based data layer, analyzed user query patterns, and applied sub‑query splitting, Redis caching, and segment size reductions to achieve over 85% cache‑hit rates and bring average query latency down to around 100 ms.

Big DataDruidcaching
0 likes · 13 min read
How WeChat Cut Query Latency from Seconds to 100 ms with Druid Optimizations
Liangxu Linux
Liangxu Linux
Jul 16, 2023 · Operations

Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance

This article compiles practical operations guidelines covering safe testing, rigorous confirmation before commands, limiting multi‑person access, mandatory backups, careful use of destructive commands, SSH hardening, firewall rules, fine‑grained permissions, continuous monitoring, performance tuning steps, and a disciplined mindset to avoid costly incidents.

Backupmonitoringperformance tuning
0 likes · 10 min read
Essential Ops Checklist: Prevent Data Loss, Secure Servers, and Optimize Performance
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2023 · Operations

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

DevOpsMicroservicesOperations
0 likes · 20 min read
Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis
Didi Tech
Didi Tech
Jul 11, 2023 · Operations

DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations

Didi’s ride‑hailing R&D team addresses efficiency and stability challenges of a large micro‑service ecosystem by unifying a Go stack, common framework, and data models, using eBPF traffic recording for automated regression testing, and applying AIOps alert filtering, knowledge‑graph root‑cause analysis, and a localization robot for rapid fault recovery, while targeting full CI/CD automation with static analysis, service‑mesh observability, and chaos engineering.

CloudNativeMicroservicesaiops
0 likes · 22 min read
DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations
JD Retail Technology
JD Retail Technology
Jul 11, 2023 · Operations

Technical Strategies for Ensuring System Stability During the 618 Promotion

The article analyzes the importance of the 618 sales event, identifies factors that threaten system stability such as traffic spikes, massive data, complex scenarios, long delivery chains and low tolerance, and proposes comprehensive application, storage, and operational measures—including unitization, monitoring, logging, fast‑fail, rate‑limiting, degradation, database and cache designs, and emergency processes—to guarantee reliable service during the promotion.

Scalabilityhigh availabilitylarge‑scale promotion
0 likes · 14 min read
Technical Strategies for Ensuring System Stability During the 618 Promotion
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
Jul 10, 2023 · Mobile Development

Mobile Application Quality System – Standard Operating Procedure (SOP)

This document outlines a comprehensive Standard Operating Procedure for building and maintaining a mobile application quality system, covering background, pre‑emptive planning, coding standards, branch management, code review, AI‑assisted tools, monitoring, issue handling, and continuous improvement to ensure stable, high‑quality mobile products.

AI toolsMobileSOP
0 likes · 27 min read
Mobile Application Quality System – Standard Operating Procedure (SOP)
DataFunTalk
DataFunTalk
Jul 9, 2023 · Operations

Building High‑Performance Observability Data Pipelines with Vector and Honghu

This article explains the concepts and importance of observability, introduces the Vector data‑pipeline tool and its architecture, demonstrates how to configure sources, transforms and sinks, and shows how to integrate Vector with the Honghu platform to build a complete, real‑time monitoring solution for modern distributed systems.

Big DataHonghuVector
0 likes · 33 min read
Building High‑Performance Observability Data Pipelines with Vector and Honghu
Liangxu Linux
Liangxu Linux
Jul 9, 2023 · Backend Development

From Monolith to Microservices: A Practical Evolution Blueprint

This article walks through the step‑by‑step transformation of a simple online supermarket from a single‑node monolith to a fully fledged microservice architecture, highlighting the motivations, common pitfalls, component choices, monitoring, tracing, logging, resilience patterns, testing strategies, and the trade‑offs of frameworks versus service mesh.

Backend ArchitectureDistributed TracingMicroservices
0 likes · 24 min read
From Monolith to Microservices: A Practical Evolution Blueprint
DevOps Cloud Academy
DevOps Cloud Academy
Jul 9, 2023 · Cloud Native

Designing Scalable Kubernetes Applications: Best Practices

This article outlines comprehensive best‑practice guidelines for building Kubernetes applications, covering scalability design, containerization, pod scope, configuration management, health probes, deployments, service discovery, storage, monitoring, security, and CI/CD integration to achieve robust, highly available workloads.

ConfigMapKubernetesci/cd
0 likes · 9 min read
Designing Scalable Kubernetes Applications: Best Practices
Open Source Linux
Open Source Linux
Jul 4, 2023 · Operations

Master Redis Monitoring, Migration, and Cluster Management with Prometheus and CacheCloud

This guide walks through essential Redis operations, covering real‑time monitoring with the INFO command and Prometheus‑compatible exporters, data migration using Redis‑shake, consistency verification via Redis‑full‑check, and comprehensive cluster management with CacheCloud, providing practical tools for reliable Redis administration.

Data MigrationOperationsPrometheus
0 likes · 11 min read
Master Redis Monitoring, Migration, and Cluster Management with Prometheus and CacheCloud
ITPUB
ITPUB
Jun 30, 2023 · Operations

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

Reliabilityautomationdisaster-recovery
0 likes · 30 min read
How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook
Open Source Linux
Open Source Linux
Jun 30, 2023 · Cloud Native

Essential Kubernetes Tools to Boost Your DevOps Workflow

This article reviews a curated set of open‑source Kubernetes tools—including Helm, Flagger, Kubewatch, Gitkube, kube‑state‑metrics, Kamus, Untrak, Scope, Dashboard, Kops, cAdvisor, Kubespray, K9s, Kubetail, PowerfulSeal, and Popeye—that enhance management, security, monitoring, and deployment within DevOps pipelines.

cloud-nativemonitoringopen-source
0 likes · 11 min read
Essential Kubernetes Tools to Boost Your DevOps Workflow
dbaplus Community
dbaplus Community
Jun 28, 2023 · Operations

Identify and Fix System Performance Bottlenecks: Key Metrics and Optimization

The article outlines common system performance bottlenecks such as CPU, memory, disk I/O, network, exceptions, and databases, explains how to measure response time, TPS, and resource utilization, and provides a step‑by‑step bottom‑up and top‑down approach for testing, diagnosing, and optimizing Java‑based services.

bottleneckmonitoringoptimization
0 likes · 11 min read
Identify and Fix System Performance Bottlenecks: Key Metrics and Optimization
Top Architect
Top Architect
Jun 27, 2023 · Databases

Redis Performance Degradation: Root Causes and Optimization Techniques

This article explains how to benchmark Redis latency, identify common reasons for slowdowns such as high‑complexity commands, big keys, concentrated expirations, memory limits, fork overhead, swap usage, and CPU binding, and provides detailed configuration and operational steps to monitor and resolve each issue.

AOFLatencyMemory
0 likes · 34 min read
Redis Performance Degradation: Root Causes and Optimization Techniques
Programmer DD
Programmer DD
Jun 26, 2023 · Operations

What’s New in Grafana 10? Explore Correlations, Scenes, and Powerful New Panels

Grafana 10 introduces a suite of enhancements—including Correlations for cross‑data‑source linking, the Scenes front‑end library for building stunning dashboards, new Canvas, Trends, and Datagrid panels, CSV drag‑and‑drop support, sub‑folder organization, and improved data‑source selection—aimed at boosting analysis, collaboration, and efficiency for monitoring teams.

DashboardGrafanaNew Features
0 likes · 7 min read
What’s New in Grafana 10? Explore Correlations, Scenes, and Powerful New Panels
Architect
Architect
Jun 23, 2023 · Big Data

Optimizing Query Performance in WeChat's Multi‑Dimensional Monitoring Platform

This article details how the WeChat multi‑dimensional monitoring platform reduced average query latency from over 1000 ms to around 100 ms by analyzing user query patterns, redesigning the Druid data layer, splitting sub‑queries, introducing Redis caching, and employing sub‑dimension tables, achieving cache hit rates above 85%.

DruidWeChatmonitoring
0 likes · 13 min read
Optimizing Query Performance in WeChat's Multi‑Dimensional Monitoring Platform
MaGe Linux Operations
MaGe Linux Operations
Jun 22, 2023 · Cloud Native

Essential Open‑Source Kubernetes Tools to Supercharge Your DevOps

This article surveys a curated collection of open‑source Kubernetes utilities—including Helm, Flagger, Kubewatch, Gitkube, kube‑state‑metrics, Kamus, Untrak, Scope, Dashboard, Kops, cAdvisor, Kubespray, K9s, Kubetail, PowerfulSeal and Popeye—detailing their roles in deployment, monitoring, security, and cluster management for modern DevOps workflows.

KubernetesToolingmonitoring
0 likes · 15 min read
Essential Open‑Source Kubernetes Tools to Supercharge Your DevOps
Open Source Linux
Open Source Linux
Jun 21, 2023 · Cloud Native

From Monolith to Microservices: A Real‑World Journey and Lessons Learned

An online supermarket startup evolves its simple monolithic website into a fully distributed microservice architecture, detailing each transformation stage, the challenges encountered—such as code duplication, database bottlenecks, deployment complexity—and the solutions like service decomposition, monitoring, tracing, circuit breaking, and service mesh.

MicroservicesService Meshcircuit breaker
0 likes · 23 min read
From Monolith to Microservices: A Real‑World Journey and Lessons Learned
Baidu Geek Talk
Baidu Geek Talk
Jun 19, 2023 · Operations

How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits

This article examines the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan solution, details Tianyan's architecture—including Ingest, Store, Consumer, Elastic Agent, Fleet, APM, Beats, and Disruptor‑based high‑throughput pipelines—covers resource isolation, dynamic cleanup, and best‑practice recommendations for building a scalable, low‑latency log platform.

Distributed SystemsElastic StackLog Management
0 likes · 26 min read
How Baidu’s Tianyan Log Service Overcomes ELK’s Scaling and Performance Limits
vivo Internet Technology
vivo Internet Technology
Jun 14, 2023 · Backend Development

Stability Practices for Vivo Account System: Service Governance, Data Architecture, and Monitoring

Vivo’s account platform, serving 270 million users and over 100 billion daily requests, achieves high‑performance stability through disciplined service splitting, hierarchical dependency control, layered caching and sharding strategies, and comprehensive multi‑layer monitoring that together ensure scalability, availability, and rapid fault diagnosis.

Backendcachingdatabase
0 likes · 24 min read
Stability Practices for Vivo Account System: Service Governance, Data Architecture, and Monitoring
JD Cloud Developers
JD Cloud Developers
Jun 14, 2023 · Operations

How to Ensure System Stability During Mega Sales Events like 618

This article examines the technical and operational challenges of the 618 shopping festival, presenting data‑driven insights and detailed strategies—including modular deployment, monitoring, logging, fast‑failure, rate limiting, database and cache optimizations, and emergency response plans—to help teams maintain system stability under massive traffic spikes.

OperationsScalabilitylarge‑scale promotion
0 likes · 13 min read
How to Ensure System Stability During Mega Sales Events like 618
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jun 12, 2023 · Frontend Development

Design and Architecture of Corona: NetEase Cloud Music Multi‑Platform Front‑End Monitoring System

Corona is NetEase Cloud Music’s unified, cross‑platform front‑end monitoring system that ingests logs from Web, React Native, Node.js, Android, iOS, Flutter and Windows CEF, enriches them, routes them through real‑time anomaly and performance pipelines, stores them in HBase, and offers customizable alerts, de‑obfuscation, AI‑assisted analysis, and extensible reporting to ensure rapid fault detection and remediation across the organization.

architecturefrontendlogging
0 likes · 17 min read
Design and Architecture of Corona: NetEase Cloud Music Multi‑Platform Front‑End Monitoring System
DevOps Operations Practice
DevOps Operations Practice
Jun 11, 2023 · Operations

Practical Linux Administration Tools for System Monitoring and Management

This article presents a curated list of useful Linux command‑line tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap and Httperf—along with installation commands and brief usage notes to help system administrators monitor performance, security and resources effectively.

Linuxmonitoringtools
0 likes · 12 min read
Practical Linux Administration Tools for System Monitoring and Management
Tencent Cloud Developer
Tencent Cloud Developer
Jun 8, 2023 · Operations

Stability Governance in Tencent Search: Architecture, Incident Management, and Automation

The article outlines Tencent Search’s stability governance, detailing a multi‑layered availability architecture, disaster‑recovery mechanisms, precise monitoring, rapid emergency workflows, pre‑release interception, extensive automation, and a collaborative governance model that together enhance system resilience, incident detection, and swift remediation.

availability architectureincident responsemonitoring
0 likes · 28 min read
Stability Governance in Tencent Search: Architecture, Incident Management, and Automation
Efficient Ops
Efficient Ops
Jun 7, 2023 · Artificial Intelligence

How Guangdong Mobile Scaled AIOps: From Manual Ops to Intelligent Automation

This article details Guangdong Mobile's evolution of IT systems and operations, explains the four domain architecture, chronicles the AIOps adoption timeline, showcases intelligent anomaly detection, change assessment, fault diagnosis, and operation robots, and shares practical promotion methods and future outlook for AI‑driven IT operations.

Artificial IntelligenceFault DiagnosisIT Operations
0 likes · 19 min read
How Guangdong Mobile Scaled AIOps: From Manual Ops to Intelligent Automation
JD Tech
JD Tech
Jun 7, 2023 · Operations

Practical Guide to Achieving High Availability in Software Delivery

This article explains the concept of high availability, outlines the challenges of collaborative delivery, architectural design, coding practices, secure release, and deployment operations, and provides concrete steps, process standards, emergency plans, and self‑check tools to ensure reliable, fault‑tolerant software systems.

CollaborationDeploymentarchitecture
0 likes · 13 min read
Practical Guide to Achieving High Availability in Software Delivery
Tencent Cloud Developer
Tencent Cloud Developer
May 31, 2023 · Big Data

Performance Optimization of WeChat's Multi‑Dimensional Monitoring Platform

By analyzing that most queries were time‑series and older than a day, the WeChat monitoring team split large Druid queries into per‑day/hour sub‑queries, introduced a multi‑granularity Redis cache and sub‑dimension tables, boosting cache hits above 85 % and cutting average latency from over 1000 ms to about 140 ms while reducing Druid load to roughly 10 % of its original volume.

DruidWeChatcaching
0 likes · 13 min read
Performance Optimization of WeChat's Multi‑Dimensional Monitoring Platform
Architecture Digest
Architecture Digest
May 27, 2023 · Databases

Comprehensive MySQL Monitoring Using Built‑in SHOW Commands

This article explains how to collect extensive MySQL performance metrics—including connections, buffer pool statistics, lock information, SQL status, statement counts, throughput, server configuration, and slow‑query analysis—using only MySQL's native SHOW commands, providing practical commands, calculations, and optimization tips for effective database monitoring.

Performance Schemamonitoringslow-query
0 likes · 10 min read
Comprehensive MySQL Monitoring Using Built‑in SHOW Commands
NetEase Media Technology Team
NetEase Media Technology Team
May 23, 2023 · Cloud Native

How NetEase Media Scaled Flink with Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details NetEase Media's migration of most Flink jobs to a self‑built real‑time platform on Kubernetes, covering the benefits of K8s isolation, the chosen native deployment mode, performance‑critical optimizations, monitoring, resource‑recommendation, and future directions for cloud‑native streaming workloads.

Cloud NativeFlinkKubernetes
0 likes · 20 min read
How NetEase Media Scaled Flink with Kubernetes: Architecture, Optimizations, and Lessons Learned
DataFunSummit
DataFunSummit
May 18, 2023 · Databases

Building Graph Applications with TuGraph: Scenarios, Deployment, Modeling, Data Import, Development, Monitoring, and Integration

This guide walks through using the TuGraph graph database to design and deploy graph applications, covering real‑world scenarios, database selection, built‑in datasets, Docker/CentOS/Ubuntu deployment, model design, data import, debugging, operational monitoring, and integration with services or direct RESTful APIs.

DeploymentGraph DatabaseRESTful API
0 likes · 11 min read
Building Graph Applications with TuGraph: Scenarios, Deployment, Modeling, Data Import, Development, Monitoring, and Integration
Architect
Architect
May 16, 2023 · Operations

Stability Engineering Practices for the DuoliXiong Local Service Platform

This article outlines the stability engineering approach for Baidu's DuoliXiong local service platform, detailing business challenges, architectural design, development standards, code review, deployment processes, monitoring, and consistency solutions, and presents practical implementations such as automated scaling, fault tolerance, and final consistency mechanisms.

Microservicesmonitoringstability engineering
0 likes · 13 min read
Stability Engineering Practices for the DuoliXiong Local Service Platform
Efficient Ops
Efficient Ops
May 15, 2023 · Operations

Master Linux Performance Troubleshooting in the First 60 Seconds

This article shows how Netflix's performance engineering team uses ten essential Linux commands—such as uptime, vmstat, mpstat, iostat, and top—to quickly assess system load, resource saturation, and errors within the first minute of investigation, following the USE method.

Command-linemonitoringperformance
0 likes · 18 min read
Master Linux Performance Troubleshooting in the First 60 Seconds
DeWu Technology
DeWu Technology
May 15, 2023 · Frontend Development

Design and Implementation of a Front-End Inspection Platform for Performance and Stability

The team built a Node‑based front‑end inspection platform that boosted page‑stability testing speed from 0.4 to 4 pages per second, achieved high usage, execution and alarm accuracy, solved process‑exit issues by switching to fork, integrated six inspection components, cut regression testing of 100 pages from 60 to 10 minutes, and lowered CPU use to 20 % per pod, with plans to broaden coverage.

Scalabilityautomationfrontend
0 likes · 9 min read
Design and Implementation of a Front-End Inspection Platform for Performance and Stability
Efficient Ops
Efficient Ops
May 8, 2023 · Operations

How Intelligent Ops Transforms Container Cloud Management at Scale

This article summarizes a speaker’s insights from GOPS 2023 on the challenges of large‑scale container cloud operations and presents a comprehensive intelligent‑ops framework—including health scoring, automated pod anomaly detection, smart scaling, and multi‑center disaster recovery—to improve visibility, efficiency, and reliability in Kubernetes environments.

CloudNativeIntelligentOpsKubernetes
0 likes · 18 min read
How Intelligent Ops Transforms Container Cloud Management at Scale
Architecture Breakthrough
Architecture Breakthrough
May 8, 2023 · Backend Development

Designing a Robust Batch Processing Module: Key Architecture Insights

This article outlines the essential architectural considerations for building a production‑ready batch processing module, covering design principles, task scheduling, parallelism, error handling, resource management, data‑layer concerns, deployment strategies, and monitoring practices.

Backend ArchitectureBatch ProcessingScalability
0 likes · 10 min read
Designing a Robust Batch Processing Module: Key Architecture Insights
MaGe Linux Operations
MaGe Linux Operations
May 7, 2023 · Operations

Beyond top: Powerful Linux CLI Monitoring Tools You Should Try

This guide introduces several interactive command‑line utilities—htop, atop, nmon, vtop, bashtop, gtop and glances—explaining their features, key shortcuts, and installation steps so Linux administrators can monitor CPU, memory, disk and network usage more effectively than with the classic top command.

BashtopCLIGlances
0 likes · 9 min read
Beyond top: Powerful Linux CLI Monitoring Tools You Should Try
Liangxu Linux
Liangxu Linux
May 7, 2023 · Operations

Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities

This guide introduces a collection of practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, iftop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—detailing their purpose, installation commands, usage examples, and key options for system administrators.

Linuxmonitoringperformance
0 likes · 12 min read
Essential Linux Ops Tools: Monitoring, Performance, and Security Utilities