Tagged articles
2179 articles
Page 13 of 22
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 15, 2021 · Operations

How Unified Observability Transforms Quality Management in Cloud‑Native Environments

This article explores the challenges of quality monitoring in cloud‑native DevOps pipelines, outlines pain points of massive heterogeneous logs and alerts, and presents a unified observability platform that enables data consolidation, AI‑driven intelligent inspection, and smart alert management to improve system reliability.

AIAlertingData Unification
0 likes · 17 min read
How Unified Observability Transforms Quality Management in Cloud‑Native Environments
IT Architects Alliance
IT Architects Alliance
Oct 13, 2021 · Backend Development

Understanding Microservices Architecture: Core Concepts, Benefits, and Implementation Practices

This article provides a comprehensive overview of microservices architecture, covering its definition, key characteristics, advantages and drawbacks, suitable organizational contexts, core components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, as well as containerization and orchestration technologies.

Backend ArchitectureCloud NativeMicroservices
0 likes · 16 min read
Understanding Microservices Architecture: Core Concepts, Benefits, and Implementation Practices
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Oct 12, 2021 · Frontend Development

Frontend Monitoring Platform: Data Collection and Reporting Techniques

This article explains the data collection and reporting component of a complete frontend monitoring platform, detailing performance metrics such as FP, FCP, LCP, CLS, and providing practical JavaScript code examples for measuring, observing, and reporting these metrics, along with error and behavior monitoring techniques.

error trackingfrontendmonitoring
0 likes · 28 min read
Frontend Monitoring Platform: Data Collection and Reporting Techniques
Open Source Linux
Open Source Linux
Oct 11, 2021 · Operations

10 Essential Ops Principles Every Engineer Should Follow

This article shares ten practical operations guidelines—from avoiding duplicated work and embracing mistakes to emphasizing monitoring, backup roles, clear division of labor, and continuous improvement—aimed at boosting reliability, efficiency, and team cohesion for both engineers and managers.

OperationsReliabilitybest practices
0 likes · 10 min read
10 Essential Ops Principles Every Engineer Should Follow
Java Architect Essentials
Java Architect Essentials
Oct 10, 2021 · Operations

Guide to Using Nginx‑GUI for Visual Configuration, Performance Monitoring and Log Management

This article introduces Nginx‑GUI, explains its requirements and current implementation for configuration and performance monitoring, provides step‑by‑step installation and configuration instructions with code snippets, and lists the features already realized and the remaining challenges such as log analysis and traffic statistics.

ConfigurationGUILinux
0 likes · 4 min read
Guide to Using Nginx‑GUI for Visual Configuration, Performance Monitoring and Log Management
DevOps Cloud Academy
DevOps Cloud Academy
Oct 9, 2021 · Cloud Native

Serverless Application DevOps: Latest Practices and Implementation Guide

This article presents a comprehensive overview of serverless application DevOps, covering the definition, benefits, common use cases, development workflow, container image deployment, CI/CD pipelines with AWS SAM, security strategies, monitoring tools, and real‑world examples such as Coca‑Cola.

AWS LambdaCloud NativeDevOps
0 likes · 13 min read
Serverless Application DevOps: Latest Practices and Implementation Guide
Tencent Cloud Developer
Tencent Cloud Developer
Oct 8, 2021 · Operations

Unveiling Kafka’s Controller: Architecture, Election, and Monitoring Deep Dive

This article provides a comprehensive technical analysis of Kafka’s Controller component, covering its background, core responsibilities, data storage, election process, version‑specific improvements, monitoring techniques, and key source‑code excerpts to help engineers understand and manage Kafka clusters effectively.

Cluster ManagementControllerDistributed Systems
0 likes · 27 min read
Unveiling Kafka’s Controller: Architecture, Election, and Monitoring Deep Dive
Programmer DD
Programmer DD
Oct 5, 2021 · Operations

Essential DevOps Toolchain: 13 Must‑Have Tool Categories Explained

This article outlines the core technology categories and specific tools—planning, issue tracking, source control, build, testing, CI/CD, configuration management, cloud platforms, container orchestration, monitoring, communication, and knowledge sharing—that together enable teams to implement DevOps practices effectively and deliver value sustainably.

Configuration ManagementDevOpscontinuous integration
0 likes · 30 min read
Essential DevOps Toolchain: 13 Must‑Have Tool Categories Explained
ByteFE
ByteFE
Sep 30, 2021 · Frontend Development

A Practical Guide to Chrome Performance Tools and the Performance API

This article introduces Chrome's built‑in Performance panel, explains how to use the W3C Performance API for custom metric collection, compares third‑party auditing tools, and demonstrates a real‑world optimization case to help front‑end developers diagnose and improve page load speed.

APIChromemonitoring
0 likes · 16 min read
A Practical Guide to Chrome Performance Tools and the Performance API
Top Architect
Top Architect
Sep 30, 2021 · Backend Development

Spring Boot Actuator: Quick Start, Endpoint Overview, and Security Integration

This article introduces Spring Boot Actuator, explains how to create a demo project with Maven or Gradle, details the most important built‑in endpoints such as /health, /metrics, /loggers, /info, /beans, /heapdump, /threaddump and /shutdown, and shows how to secure them with Spring Security, providing configuration snippets and code examples.

ActuatorEndpointsSpring Boot
0 likes · 14 min read
Spring Boot Actuator: Quick Start, Endpoint Overview, and Security Integration
Open Source Linux
Open Source Linux
Sep 27, 2021 · Operations

Step-by-Step Guide to Installing Zabbix 5 on CentOS 7

This article provides a comprehensive, hands‑on tutorial for installing and configuring Zabbix 5 on CentOS 7, covering system overview, key terminology, disabling SELinux and firewalls, setting up repositories, installing server, agent, frontend, MariaDB, database initialization, configuration tweaks, and final web‑UI setup.

CentOSInstallationOperations
0 likes · 9 min read
Step-by-Step Guide to Installing Zabbix 5 on CentOS 7
dbaplus Community
dbaplus Community
Sep 27, 2021 · Operations

6 Powerful Alternatives to Prometheus for Kubernetes Monitoring

Monitoring ensures Kubernetes applications run smoothly, and while Prometheus is a popular open‑source solution, this article examines six viable alternatives—Grafana, cAdvisor, Fluentd, Jaeger, Telepresence, and Zabbix—detailing their key features, strengths, and use‑cases for effective cluster observability.

FluentdGrafanaKubernetes
0 likes · 10 min read
6 Powerful Alternatives to Prometheus for Kubernetes Monitoring
Efficient Ops
Efficient Ops
Sep 26, 2021 · Cloud Native

How to Stabilize Your Kubernetes Clusters: CI/CD, Monitoring, Logging, and Docs

This article analyzes why our Kubernetes clusters were constantly unstable—citing an erratic release process, missing monitoring, logging, documentation, and unclear request routing—and presents a comprehensive solution that includes a Kubernetes‑centric CI/CD pipeline, federated monitoring, centralized logging, a documentation hub, and integrated traffic management.

Cloud NativeDevOpsci/cd
0 likes · 8 min read
How to Stabilize Your Kubernetes Clusters: CI/CD, Monitoring, Logging, and Docs
Liangxu Linux
Liangxu Linux
Sep 19, 2021 · Operations

Master Linux System Info with Inxi: Install, Configure, and Use

This guide explains what the lightweight inxi utility does, how to install it via package managers or source, and demonstrates its various options for displaying system, hardware, network, disk, memory, weather, and color‑customized information on Linux.

LinuxSystem Informationinxi
0 likes · 6 min read
Master Linux System Info with Inxi: Install, Configure, and Use
Baidu Intelligent Testing
Baidu Intelligent Testing
Sep 16, 2021 · Operations

Baidu Game Microservice Monitoring Practice: System Design and Evolution

This article describes Baidu's game microservice monitoring practice, detailing the initial challenges, system design, risk control, intelligent monitoring, multi‑dimensional visualization, smart alerting, and efficient fault localization, illustrating how a systematic approach improves detection speed, coverage, and issue resolution for large‑scale online games.

AlertingGame Developmentmonitoring
0 likes · 12 min read
Baidu Game Microservice Monitoring Practice: System Design and Evolution
Ops Development Stories
Ops Development Stories
Sep 16, 2021 · Cloud Native

Master Kubernetes: A Step‑by‑Step Learning Roadmap for Beginners

This guide walks beginners through a structured learning path for Kubernetes, covering fundamentals, core components, key objects, controllers, storage, networking, resource management, security, cluster operations, backup, logging, monitoring, DevOps practices, and deeper topics like architecture, source code, and operator development.

BackupCloud NativeDevOps
0 likes · 16 min read
Master Kubernetes: A Step‑by‑Step Learning Roadmap for Beginners
Efficient Ops
Efficient Ops
Sep 14, 2021 · Cloud Native

Master Kubernetes: A Step‑by‑Step Learning Roadmap for Beginners

This comprehensive guide walks beginners through Kubernetes fundamentals, core components, key objects, storage, networking, resource management, security, cluster operations, backup, logging, monitoring, DevOps practices, and deep‑dive techniques, providing a clear learning path and practical tips for effective use.

Cloud NativeDevOpscontainer orchestration
0 likes · 16 min read
Master Kubernetes: A Step‑by‑Step Learning Roadmap for Beginners
dbaplus Community
dbaplus Community
Sep 13, 2021 · Operations

How to Stabilize a Failing Kubernetes Cluster: CI/CD, Monitoring, Logging, and Docs

This article analyzes why a company's Kubernetes clusters were constantly on the brink of failure and presents a comprehensive solution covering CI/CD pipeline reconstruction, federated monitoring with Prometheus, centralized logging via Elasticsearch, documentation centralization, and clarified request routing to achieve high reliability.

Kubernetesci/cdcluster stability
0 likes · 9 min read
How to Stabilize a Failing Kubernetes Cluster: CI/CD, Monitoring, Logging, and Docs
WeChat Client Technology Team
WeChat Client Technology Team
Sep 8, 2021 · Mobile Development

Uncovering Hidden Android Thread Pitfalls: Memory Leaks, Monitoring, and Hook Solutions

This article explores obscure Android thread issues—including uncontrolled thread creation, stack memory leaks, and the impact of thread‑priority settings—while presenting monitoring techniques, a pthread hook implementation, and performance considerations to help developers detect and resolve thread‑related crashes.

AndroidHookMemory Management
0 likes · 15 min read
Uncovering Hidden Android Thread Pitfalls: Memory Leaks, Monitoring, and Hook Solutions
Efficient Ops
Efficient Ops
Sep 5, 2021 · Operations

Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive

This article explains how Prometheus’s time‑series database handles massive monitoring data, illustrates practical query examples, and shows why its storage engine and pre‑computation features enable efficient, high‑performance observability for large‑scale services.

PrometheusTSDBTime Series Database
0 likes · 8 min read
Why Prometheus’s TSDB Makes Monitoring Scalable: A Deep Dive
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Sep 5, 2021 · Cloud Native

From Rookie to Cloud‑Native Architect: Building an Enterprise Kubernetes Cluster

Over the past year, the author chronicles a hands‑on journey from a fresh graduate to a cloud‑native specialist, detailing the design and implementation of an enterprise‑grade Kubernetes architecture—including multi‑cluster logging, CI/CD pipelines, Istio service mesh, monitoring, and private‑deployment strategies—while sharing practical lessons learned.

Cloud NativeKubernetesService Mesh
0 likes · 13 min read
From Rookie to Cloud‑Native Architect: Building an Enterprise Kubernetes Cluster
DeWu Technology
DeWu Technology
Sep 3, 2021 · Operations

Live Streaming Service Monitoring and Alert Attribution Practice

The article outlines a systematic approach for quickly attributing live‑streaming service alerts—combining consolidated knowledge, log and trace analysis, and a decision‑tree workflow—to pinpoint root causes such as resource limits or mesh overload, illustrated by a real RT‑jitter case and emphasizing deep architectural understanding.

alert attributionmonitoringtroubleshooting
0 likes · 8 min read
Live Streaming Service Monitoring and Alert Attribution Practice
Top Architect
Top Architect
Sep 2, 2021 · Cloud Native

Designing a Stable Backend Architecture: CI/CD, Federated Monitoring, Logging, Documentation, and Traffic Management on Kubernetes

The article analyzes why a company's clusters were unstable—unstable release process, missing monitoring and logging, insufficient documentation, and unclear request routing—and proposes a comprehensive solution built around Kubernetes‑centric CI/CD, a federated Prometheus monitoring platform, Elasticsearch logging, centralized documentation, and Kong/Istio traffic management.

Backend ArchitectureCloud NativeDocumentation
0 likes · 9 min read
Designing a Stable Backend Architecture: CI/CD, Federated Monitoring, Logging, Documentation, and Traffic Management on Kubernetes
Java Architect Essentials
Java Architect Essentials
Aug 30, 2021 · Databases

How to Monitor and Optimize Redis Performance

This article explains how to use Redis INFO commands to track memory usage, command processing, latency, key eviction and fragmentation, and provides practical tips such as adjusting maxmemory, using hash structures, pipelines, and slowlog to diagnose and improve Redis performance.

LatencyOpsmemory
0 likes · 23 min read
How to Monitor and Optimize Redis Performance
High Availability Architecture
High Availability Architecture
Aug 30, 2021 · Backend Development

Hulk: A Go‑Based Web Service Framework for Short‑Video Backend Development

The article introduces Hulk, a Go service development framework created by the short‑video R&D team to replace PHP monoliths, outlines its background, design principles, component hierarchy, comparison with GDP2, and demonstrates how its built‑in monitoring, configuration, and tooling improve code quality, development speed, and SRE efficiency across Baidu’s short‑video services.

BackendDevOpsFramework
0 likes · 17 min read
Hulk: A Go‑Based Web Service Framework for Short‑Video Backend Development
Ops Development Stories
Ops Development Stories
Aug 27, 2021 · Operations

Inside Prometheus Alerting Rules: How They’re Managed and Executed

This article explains Prometheus' custom Rule system, detailing the structure and components of alerting rules, the rule manager's loading and updating process, group scheduling, evaluation cycles, and the logic for generating, updating, and sending alerts, enabling advanced monitoring extensions.

Alerting RulesGoPrometheus
0 likes · 21 min read
Inside Prometheus Alerting Rules: How They’re Managed and Executed
Open Source Linux
Open Source Linux
Aug 24, 2021 · Operations

Why Prometheus Became the Leading Cloud‑Native Monitoring Solution

This article explains how Prometheus evolved from a Google internal project to a CNCF‑graduated, top‑ranked time‑series database and full‑stack monitoring ecosystem, detailing its history, core features, architecture, and the roles of its components such as Exporters, Pushgateway, Service Discovery, and Alertmanager.

PrometheusTime Series Databasecloud-native
0 likes · 19 min read
Why Prometheus Became the Leading Cloud‑Native Monitoring Solution
dbaplus Community
dbaplus Community
Aug 22, 2021 · Operations

Master Elasticsearch Performance: Memory, CPU, Shards, and Cluster Tuning

This guide presents practical best‑practice configurations for Elasticsearch clusters in production, covering JVM heap sizing, CPU thread‑pool tuning, optimal shard counts, replica strategies, hot‑warm node architecture, node role settings, common troubleshooting tips, cache handling, refresh intervals, and essential monitoring APIs.

ClusterElasticsearchShards
0 likes · 14 min read
Master Elasticsearch Performance: Memory, CPU, Shards, and Cluster Tuning
Youzan Coder
Youzan Coder
Aug 19, 2021 · Mobile Development

Thread Pool Isolation and Monitoring Design for Mobile Applications

The design separates the original I/O pool into dedicated network, I/O, and polling thread pools, adds comprehensive monitoring of task duration and frequency, enforces unified polling rules, and automatically tunes pool parameters, resulting in a 76 % reduction in UI lag and easier troubleshooting.

PollingRxJavamobile performance
0 likes · 12 min read
Thread Pool Isolation and Monitoring Design for Mobile Applications
Baidu Geek Talk
Baidu Geek Talk
Aug 18, 2021 · Backend Development

Hulk: A Go Web Service Framework for Short‑Video Backend Development

Hulk is a Go‑based web service framework created by the short‑video R&D team to replace a PHP monolith, extending the unreleased GDP2 platform with business‑specific wrappers, a four‑layer architecture, and integrated monitoring, tracing, and deployment tools that dramatically boost development speed, runtime performance, and SRE efficiency for high‑traffic short‑video services.

BackendDevOpsFramework
0 likes · 18 min read
Hulk: A Go Web Service Framework for Short‑Video Backend Development
Tencent Cloud Developer
Tencent Cloud Developer
Aug 17, 2021 · Backend Development

Design and Implementation of a Calculation DSL and Engine

The article presents a domain‑specific language that mimics Excel formulas, a stack‑based parser and recursive engine for evaluating calculations, and a multi‑layer architecture—including a dynamic priority scheduler—to efficiently resolve field dependencies, improve maintainability, and enable monitoring across large data systems.

Calculation EngineDSLbackend-development
0 likes · 11 min read
Design and Implementation of a Calculation DSL and Engine
MaGe Linux Operations
MaGe Linux Operations
Aug 14, 2021 · Operations

Boost System Reliability: 4 Proven Practices to Master Observability

This article explains why observability is essential for DevOps, outlines four key practices—including production‑environment monitoring, structured logging, a DevOps‑focused culture, and pre‑deployment observability with remote debugging—to help teams detect, diagnose, and prevent issues throughout the software lifecycle.

CultureDevOpsci/cd
0 likes · 9 min read
Boost System Reliability: 4 Proven Practices to Master Observability
Cloud Native Technology Community
Cloud Native Technology Community
Aug 13, 2021 · Cloud Native

Sysdig 2021 Container Security and Usage Report – Top Open‑Source Solutions, Metrics, and Kubernetes Trends

The Sysdig 2021 report analyzes container usage across thousands of customers, highlighting the most popular open‑source services, the rise of Go and Prometheus, container density and image size trends, alert strategies, and detailed Kubernetes adoption patterns in cloud‑native environments.

ContainersKubernetescloud-native
0 likes · 12 min read
Sysdig 2021 Container Security and Usage Report – Top Open‑Source Solutions, Metrics, and Kubernetes Trends
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 11, 2021 · Big Data

How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform

Volcengine’s Data Quality Platform bridges the gap between data validation and resource‑intensive computation in large‑scale environments, offering unified stream‑batch monitoring, data exploration, comparison, and alerting across Hive, ClickHouse, Kafka, and more, while addressing scalability, latency, and resource optimization challenges.

Big DataData Qualitymonitoring
0 likes · 19 min read
How Volcengine Solves Big Data Quality Challenges with a Unified Stream‑Batch Platform
Top Architect
Top Architect
Aug 10, 2021 · Operations

Building and Using an ELK Real‑Time Log Analysis Platform

This tutorial explains how to set up a real‑time ELK log analysis platform, covering the architecture of Elasticsearch, Logstash and Kibana, detailed installation commands, configuration for Spring Boot and Nginx logs, and how to run the stack continuously with Supervisor.

ELKElasticsearchKibana
0 likes · 18 min read
Building and Using an ELK Real‑Time Log Analysis Platform
Xianyu Technology
Xianyu Technology
Jul 29, 2021 · Mobile Development

How Xianyu Tackles Android ANR: Monitoring, Diagnosis, and Optimization Strategies

This article explains how Xianyu identifies, monitors, and resolves Android ANR issues by analyzing root causes, implementing SIGQUIT‑based detection, inspecting thread stacks, and applying concrete optimizations such as SharedPreferences replacement, network broadcast caching, and delayed component registration, ultimately cutting ANR rates by more than half.

ANRAndroidMobile Development
0 likes · 11 min read
How Xianyu Tackles Android ANR: Monitoring, Diagnosis, and Optimization Strategies
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jul 28, 2021 · Operations

Common Open‑Source Tools for MySQL Operations and Maintenance

This article introduces a curated list of open‑source MySQL operational tools—including online DDL changers, backup and restore utilities, load‑testing frameworks, flashback solutions, slow‑query analyzers, replication consistency checkers, audit platforms, and graphical clients—explaining their principles, usage scenarios, and visual references.

BackupOperationsReplication
0 likes · 8 min read
Common Open‑Source Tools for MySQL Operations and Maintenance
IT Architects Alliance
IT Architects Alliance
Jul 25, 2021 · Backend Development

Comprehensive Guide to Building a Backend Technology Stack for Startups

This article outlines a complete backend technology stack for startups, covering language choices, core components, processes, systemization, and detailed selections for project management, DNS, load balancing, CDN, RPC frameworks, service discovery, databases, messaging, logging, monitoring, configuration, deployment, and operational best practices.

BackendCICDDevOps
0 likes · 28 min read
Comprehensive Guide to Building a Backend Technology Stack for Startups
Architects' Tech Alliance
Architects' Tech Alliance
Jul 24, 2021 · Backend Development

How to Build a Scalable Backend Stack for Startups: Languages, Components, and Best Practices

This guide outlines a comprehensive backend technology stack for startups, covering language choices, core components, development processes, infrastructure services, database options, monitoring, CI/CD, and operational best practices to help teams design, select, and implement a reliable server-side architecture.

BackendOperationsTechnology Stack
0 likes · 31 min read
How to Build a Scalable Backend Stack for Startups: Languages, Components, and Best Practices
GrowingIO Tech Team
GrowingIO Tech Team
Jul 22, 2021 · Databases

How to Diagnose and Fix Common HBase RegionServer Crashes

This article examines frequent HBase RegionServer failures caused by long GC pauses, oversized scans, and HDFS decommissioning, outlines step‑by‑step troubleshooting procedures—including log searches, GC tuning, scan size limits, and monitoring strategies—and provides practical solutions to prevent and resolve these issues.

HBaseRegionServergc
0 likes · 14 min read
How to Diagnose and Fix Common HBase RegionServer Crashes
Tencent Cloud Developer
Tencent Cloud Developer
Jul 22, 2021 · Operations

Observability in Serverless Environments: Monitoring, Logging, Distributed Tracing, and Best Practices

In this talk, Gal Bashan explains how serverless architectures complicate observability and why metrics, logs, and especially distributed tracing with tools like OpenTelemetry, Jaeger, or commercial platforms are essential for gaining end-to-end visibility, automating instrumentation, and maintaining reliable, business-focused services across cloud providers.

Cloud NativeDistributed TracingServerless
0 likes · 12 min read
Observability in Serverless Environments: Monitoring, Logging, Distributed Tracing, and Best Practices
Ops Development Stories
Ops Development Stories
Jul 22, 2021 · Operations

How to Diagnose Linux Server Performance Issues in 60 Seconds with 10 Essential Commands

Learn to quickly pinpoint Linux server bottlenecks by running ten powerful commands—uptime, dmesg, vmstat, mpstat, pidstat, iostat, free, sar, and top—within a minute, interpreting their outputs using the USE method to assess utilization, saturation, and errors across CPU, memory, disk, and network resources.

LinuxSystem AdministrationUSE method
0 likes · 20 min read
How to Diagnose Linux Server Performance Issues in 60 Seconds with 10 Essential Commands
21CTO
21CTO
Jul 21, 2021 · Backend Development

How Our Reactive API Gateway Powers Microservices with RxNetty

This article outlines the design and implementation of a high‑performance, reactive API gateway built on RxNetty, detailing its overall architecture, request routing, conditional routing, API management, rate‑limiting, circuit breaking, security policies, monitoring, tracing, and future enhancements within a microservices ecosystem.

MicroservicesRxNettyapi-gateway
0 likes · 12 min read
How Our Reactive API Gateway Powers Microservices with RxNetty
ITPUB
ITPUB
Jul 21, 2021 · Backend Development

How Our Reactive API Gateway Handles Routing, Rate Limiting, and Security in Microservices

This article explains the overall architecture of a reactive API gateway built on RxNetty, detailing its request dispatch, conditional routing for gray releases, API management, rate‑limiting and circuit‑breaking, security policies, and integrated monitoring and tracing within a microservices ecosystem.

Microservicesapi-gatewaybackend-development
0 likes · 13 min read
How Our Reactive API Gateway Handles Routing, Rate Limiting, and Security in Microservices
Youzan Coder
Youzan Coder
Jul 19, 2021 · Operations

How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance

This article examines the challenges faced by a search middle platform—such as inaccurate impact assessment, unstable underlying clusters, and missing process standards—and details a comprehensive quality‑assurance strategy that includes baseline test suites, stability practices, performance testing, emergency drills, and systematic monitoring to ensure reliable search services.

BackendOperationsPerformance Testing
0 likes · 13 min read
How We Built a Robust Search Middle Platform: From Pain Points to Full‑Scale Quality Assurance
Yuewen Technology
Yuewen Technology
Jul 16, 2021 · Operations

Mastering Log Aggregation: From LogID Generation to Powerful Analysis Tools

This article explores the challenges of log aggregation in micro‑service architectures, introduces a globally unique log identifier (logid) with its required properties, compares various logid generation schemes, and presents end‑to‑end solutions for log distribution, aggregation, and analysis using custom tools such as ylog and watcher.

Distributed Systemslog aggregationlog analysis
0 likes · 26 min read
Mastering Log Aggregation: From LogID Generation to Powerful Analysis Tools
High Availability Architecture
High Availability Architecture
Jul 15, 2021 · Operations

Baidu Game Microservice Monitoring Practice and System Design

This article describes Baidu's comprehensive approach to monitoring game microservices, covering the background, initial monitoring tools, evolution of the monitoring system, systematic design for risk control, intelligent detection, alarm optimization, efficient fault localization, and future outlook for high‑availability architecture.

BaiduGame DevelopmentMicroservices
0 likes · 13 min read
Baidu Game Microservice Monitoring Practice and System Design
Baidu Geek Talk
Baidu Geek Talk
Jul 14, 2021 · Operations

How Baidu Built a Robust Microservice Monitoring System for Game Services

This article details Baidu's comprehensive microservice monitoring practice for its game platform, covering the initial fragmented setup, systematic redesign across risk control, intelligent monitoring, smart alerting, and rapid fault localization, and presents the resulting monitoring architecture, visualizations, and future improvement goals.

AlertingBaiduMicroservices
0 likes · 14 min read
How Baidu Built a Robust Microservice Monitoring System for Game Services
DevOps
DevOps
Jul 12, 2021 · Operations

The First Four Chaos Experiments to Run on Apache Kafka

This article explains how to use chaos engineering with Gremlin to design, execute, and analyze four experiments that test Kafka broker load, message loss, split‑brain scenarios, and ZooKeeper outages, helping improve the reliability and resilience of Kafka deployments.

Distributed SystemsGremlinKafka
0 likes · 18 min read
The First Four Chaos Experiments to Run on Apache Kafka
Xianyu Technology
Xianyu Technology
Jul 9, 2021 · Backend Development

Backend Architecture and Stability for Xianyu Local Services

The article describes Xianyu’s local services architecture, tackling rapid supplier onboarding, heterogeneous quality, and stability by reusing core platform capabilities, defining merchant, audit, and independent business domains, employing high‑concurrency rate limiting, idempotent retries, unified exception handling, status‑change logging, and proactive monitoring with alerts and reporting.

Data ConsistencySystem Designmonitoring
0 likes · 7 min read
Backend Architecture and Stability for Xianyu Local Services
Selected Java Interview Questions
Selected Java Interview Questions
Jul 7, 2021 · Operations

Redis Monitoring Metrics and Commands Guide

This article provides a comprehensive overview of Redis monitoring metrics—including performance, memory, basic activity, persistence, and error indicators—along with recommended monitoring tools, configuration settings, and command-line examples for gathering and interpreting these metrics in production environments.

MetricsOperationsdatabase
0 likes · 7 min read
Redis Monitoring Metrics and Commands Guide
Efficient Ops
Efficient Ops
Jul 5, 2021 · Operations

10 Essential Practices to Prevent DBA and Ops Disasters

Learn ten practical strategies—from safe change rollbacks and cautious destructive commands to robust backups, clear prompts, vigilant monitoring, and disciplined handovers—that help DBAs and operations engineers avoid costly system failures and maintain reliable production environments.

BackupOperationsOracle
0 likes · 6 min read
10 Essential Practices to Prevent DBA and Ops Disasters
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jul 5, 2021 · Databases

Integrating Alibaba Druid Connection Pool with Spring Boot: Configuration and Monitoring Guide

This article provides a comprehensive guide on integrating the Alibaba Druid JDBC connection pool into a Spring Boot application, covering its components, powerful monitoring features, password encryption, SQL parsing, Maven and YAML configuration, filter setup, and how to access the Druid monitoring console.

ConfigurationDatabase Connection PoolDruid
0 likes · 11 min read
Integrating Alibaba Druid Connection Pool with Spring Boot: Configuration and Monitoring Guide
Ops Development Stories
Ops Development Stories
Jun 30, 2021 · Cloud Native

Mastering Kubernetes: Essential Node & Pod Practices for Stable, Secure Deployments

This article outlines essential Kubernetes operational practices—including node maintenance, kernel upgrades, Docker and kubelet tuning, pod resource limits, scheduling strategies, health probes, logging standards, and monitoring setups—to ensure applications run reliably, securely, and efficiently in production environments.

Cloud NativeKubernetesNode Management
0 likes · 18 min read
Mastering Kubernetes: Essential Node & Pod Practices for Stable, Secure Deployments
Architects' Tech Alliance
Architects' Tech Alliance
Jun 28, 2021 · Backend Development

Understanding the Essence of Architecture and Weibo's Large‑Scale System Design

This article explores the fundamental concepts of software architecture, illustrates scaling challenges with examples like Uber and Weibo, and details multi‑tier designs, caching strategies, service decomposition, monitoring, and operational practices for building and maintaining high‑performance, billion‑user backend systems.

BackendScalabilitycaching
0 likes · 20 min read
Understanding the Essence of Architecture and Weibo's Large‑Scale System Design
DataFunTalk
DataFunTalk
Jun 27, 2021 · Big Data

Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization

This presentation by NetEase senior SRE Jin Chuan details the current state of NetEase's big data platform, introduces the internally built EasyOps management system, explains a generic Ansible‑based operation framework, describes Prometheus/Grafana monitoring and alerting, and shares practical lessons on network, storage, and cloud migration for large‑scale Hadoop services.

AnsiblePrometheusSRE
0 likes · 10 min read
Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization
Java Interview Crash Guide
Java Interview Crash Guide
Jun 26, 2021 · Backend Development

Essential Linux and Java Tools for Fast Troubleshooting and Performance Tuning

This guide compiles a comprehensive set of Linux commands and Java diagnostic utilities—including tail, grep, awk, find, tsar, btrace, Greys, Arthas, and JProfiler—offering practical examples and code snippets to help developers quickly identify and resolve performance and stability issues in production environments.

javamonitoringtools
0 likes · 16 min read
Essential Linux and Java Tools for Fast Troubleshooting and Performance Tuning
Architecture Digest
Architecture Digest
Jun 22, 2021 · Operations

Netflix’s Telltale: An Intelligent Monitoring and Alerting System for Application Health

The article details Netflix’s internally built Telltale monitoring platform, explaining its motivation, key features such as multi‑dimensional health assessment, smart alerting, event management, deployment monitoring, and continuous optimization, and how it improves operational efficiency for over a hundred production services.

AlertingNetflixTelltale
0 likes · 12 min read
Netflix’s Telltale: An Intelligent Monitoring and Alerting System for Application Health
Code Ape Tech Column
Code Ape Tech Column
Jun 19, 2021 · Operations

Master Prometheus: From Installation to Advanced Monitoring with Grafana

This comprehensive guide walks you through Prometheus' origins, core features, installation methods, configuration files, PromQL basics, exporter setup, Grafana integration, alerting with Alertmanager, and advanced topics like service discovery, providing a complete roadmap for building a production‑grade monitoring system.

AlertmanagerDockerGrafana
0 likes · 34 min read
Master Prometheus: From Installation to Advanced Monitoring with Grafana
Alibaba Cloud Native
Alibaba Cloud Native
Jun 16, 2021 · Backend Development

How to Build a Scalable Distributed Message Governance Platform for High Availability

This article shares Haro's practical experience in designing and operating a distributed message governance platform that unifies RocketMQ, Kafka, and other middleware, covering metrics, monitoring, alerting, scenario‑based controls, and high‑availability strategies to keep microservices reliable under sudden traffic spikes.

MicroservicesRocketMQmonitoring
0 likes · 14 min read
How to Build a Scalable Distributed Message Governance Platform for High Availability
Efficient Ops
Efficient Ops
Jun 15, 2021 · Operations

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

This article explores the fundamentals of IT monitoring, examines common challenges such as scalability, reliability, and alert fatigue, compares four implementation approaches—from open‑source to fully custom solutions—and presents practical techniques like alert convergence, suppression, and automation to build a robust, adaptable monitoring platform.

Alert ManagementOperationsScalability
0 likes · 19 min read
Mastering IT Monitoring: Strategies, Challenges, and Best Practices
Liangxu Linux
Liangxu Linux
Jun 14, 2021 · Operations

7 Essential Everyday Shell Scripts for Linux System Administration

This article presents seven practical Bash scripts that help Linux administrators quickly gather system status, back up MySQL databases, monitor services, scan network hosts, manage user passwords, and verify MySQL replication, each accompanied by clear code examples and usage instructions.

BackupShellSysadmin
0 likes · 10 min read
7 Essential Everyday Shell Scripts for Linux System Administration
58 Tech
58 Tech
Jun 11, 2021 · Frontend Development

Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions

The article details the design, architecture, and operational challenges of the Beidou frontend monitoring platform at 58 Group, covering SDK management, behavior trace logging, front‑back link integration, performance optimizations, minute‑level alerting, and permission management.

Alertingarchitecturefrontend
0 likes · 22 min read
Beidou Frontend Monitoring System: Architecture, Challenges, and Solutions
Top Architect
Top Architect
Jun 9, 2021 · Operations

Configuring a Perfect JVM GC Log Printing Strategy

This guide explains how to configure comprehensive JVM garbage-collection logging—including basic GC details, object age distribution, heap snapshots, pause times, safepoint statistics, and reference processing—while using timestamped filenames and JVM log rotation to avoid overwriting and manage file size effectively.

JVMgcjava
0 likes · 12 min read
Configuring a Perfect JVM GC Log Printing Strategy
Efficient Ops
Efficient Ops
Jun 8, 2021 · Operations

How Red‑Blue Drills Boost Securities Ops: From Capacity Testing to Full‑Scale Automation

Lin Ying, a senior test manager at Guoxin Securities, shares insights from his GOPS 2021 talk on the securities industry's digital transformation, current IT challenges, and a comprehensive red‑blue exercise strategy that combines full‑link load testing, automated workflows, and proactive monitoring to ensure system stability during market peaks.

DevOpsOperationscapacity testing
0 likes · 13 min read
How Red‑Blue Drills Boost Securities Ops: From Capacity Testing to Full‑Scale Automation