Tagged articles
2179 articles
Page 16 of 22
Java Architect Essentials
Java Architect Essentials
Aug 26, 2020 · Backend Development

A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture

This article walks through the transformation of a simple online supermarket from a monolithic design to a fully fledged microservice system, explaining the motivations, architectural changes, component selection, common pitfalls, and best‑practice solutions such as service decomposition, database sharding, monitoring, tracing, service mesh, resilience patterns, and testing strategies.

MicroservicesResiliencearchitecture
0 likes · 22 min read
A Comprehensive Guide to Evolving a Monolithic Online Store into a Robust Microservice Architecture
Architecture Digest
Architecture Digest
Aug 25, 2020 · Operations

Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes

This article provides a comprehensive guide on using Prometheus for Kubernetes monitoring, covering fundamental principles, exporter selection, Grafana dashboard creation, memory and storage optimization, high‑availability designs, query performance, cardinality management, and integration with alerting and logging systems.

ExportersGrafanaKubernetes
0 likes · 33 min read
Best Practices and Advanced Topics for Prometheus Monitoring in Kubernetes
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 24, 2020 · Operations

Prometheus Data Query Basics and Practical Usage Guide

This article introduces Prometheus' query language PromQL, explains instant and range vector selectors, label matching, offset handling, storage design, common functions and aggregation operators, and provides practical advice for efficient querying and avoiding performance issues.

OperationsPromQLPrometheus
0 likes · 13 min read
Prometheus Data Query Basics and Practical Usage Guide
58 Tech
58 Tech
Aug 19, 2020 · Backend Development

Design and Implementation of a Testing Quality System for the 58.com SSP Advertising Platform

The article details the architecture of 58.com’s SSP advertising platform, identifies three key reliability challenges—data consistency, interface regression, and storage synchronization—and presents a three‑layer testing quality system comprising web‑layer validation, service‑layer automated testing, and data‑layer monitoring with concrete tools and future improvement plans.

SSPadvertising platformautomation
0 likes · 14 min read
Design and Implementation of a Testing Quality System for the 58.com SSP Advertising Platform
Open Source Linux
Open Source Linux
Aug 17, 2020 · Operations

Step-by-Step Guide to Install and Configure Zabbix on CentOS 7

This tutorial walks you through installing Zabbix on CentOS 7, covering prerequisite disabling of SELinux and firewalls, adding repositories, installing server, web, and database components, configuring files, securing MariaDB, starting services, and completing the web‑based setup with language customization.

CentOSInstallationLinux
0 likes · 7 min read
Step-by-Step Guide to Install and Configure Zabbix on CentOS 7
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Aug 16, 2020 · Cloud Native

How to Configure Alertmanager, Add WeChat Alerts, and Enable Automatic Service Discovery in Kubernetes

This guide walks through modifying Alertmanager to use a NodePort service, decoding and editing its secret to add custom receivers and a WeChat template, recreating the secret, and extending Prometheus Operator with additional scrape configs for automatic service discovery, including RBAC adjustments and verification steps.

KubernetesRBACServiceDiscovery
0 likes · 10 min read
How to Configure Alertmanager, Add WeChat Alerts, and Enable Automatic Service Discovery in Kubernetes
Tencent Cloud Developer
Tencent Cloud Developer
Aug 12, 2020 · Databases

How Autonomous Databases Evolve: From Stone Age to AI‑Driven Self‑Healing

This article traces the evolution of database autonomy from manual, knowledge‑driven operations through tool‑assisted and expert‑level stages to cloud‑native intelligent services, and details Tencent's DBbrain platform, its architecture, performance‑optimization, security, monitoring, cost‑based analysis, and future self‑healing capabilities.

AI OpsCloud DatabasesDBbrain
0 likes · 29 min read
How Autonomous Databases Evolve: From Stone Age to AI‑Driven Self‑Healing
Java Architect Essentials
Java Architect Essentials
Aug 11, 2020 · Operations

Four Essential Linux Monitoring Tools for Operations Engineers

This article introduces four widely used Linux monitoring tools—iotop, htop, IPTraf, and Monit—explaining their features, usage scenarios, and how they help operations engineers diagnose performance issues without a GUI, including real‑time I/O tracking, visual CPU/memory graphs, network traffic analysis, and flexible alerting.

IPTrafLinuxMonit
0 likes · 7 min read
Four Essential Linux Monitoring Tools for Operations Engineers
MaGe Linux Operations
MaGe Linux Operations
Aug 8, 2020 · Operations

Step-by-Step Guide to Installing and Configuring Zabbix on CentOS 7

This tutorial walks you through disabling SELinux and the firewall, adding Zabbix and EPEL repositories, installing Zabbix server, web, and database components, configuring files, securing MariaDB, starting services, and completing the web‑based setup to get a fully functional monitoring system.

CentOSInstallationOpen-source
0 likes · 7 min read
Step-by-Step Guide to Installing and Configuring Zabbix on CentOS 7
dbaplus Community
dbaplus Community
Aug 3, 2020 · Operations

How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices

iQIYI’s tech product team designed a unified full‑link automated monitoring platform that integrates link, metric, and log collection with deep analysis, enhancing fault localization, performance insight, and scalability across microservices, while addressing limitations of existing tools like ELK, Prometheus, and Dapper.

Metricsfull‑linklog collection
0 likes · 15 min read
How iQIYI Built a Full‑Link Automated Monitoring Platform for Microservices
Xianyu Technology
Xianyu Technology
Jul 28, 2020 · Operations

ShenTan: Automated Fault Localization System for Online Services

ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.

Big DataFault LocalizationOperations
0 likes · 12 min read
ShenTan: Automated Fault Localization System for Online Services
Top Architect
Top Architect
Jul 27, 2020 · Operations

10 Practical Tips to Boost Web Application Performance Up to 10× with NGINX

This article presents ten actionable recommendations—including reverse‑proxy deployment, load balancing, caching, compression, SSL/TLS tuning, HTTP/2 adoption, software upgrades, Linux and web‑server tuning, and real‑time monitoring—to dramatically improve web application performance, often achieving tenfold speed gains.

NginxWeb Performancecaching
0 likes · 22 min read
10 Practical Tips to Boost Web Application Performance Up to 10× with NGINX
WecTeam
WecTeam
Jul 23, 2020 · Backend Development

How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

This article chronicles the evolution of the WebMonitor front‑end monitoring system, detailing its three‑tier stack, data pipeline upgrades from raw disk sampling to HDFS and Elasticsearch, extensive collector‑side optimizations, Jetty thread and timeout tuning, and the resulting performance gains that lowered response times from minutes to sub‑second levels.

Jettydata pipelinejava
0 likes · 15 min read
How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets
dbaplus Community
dbaplus Community
Jul 20, 2020 · Operations

How to Build Reliable Monitoring for Low‑Frequency Financial Services

After two years transitioning from e‑commerce to finance, the team shares practical monitoring strategies for low‑frequency financial services, contrasting e‑commerce traffic‑based methods with finance‑specific challenges, and detailing point‑based metrics, hourly success‑rate alerts, aspect‑oriented exception handling, white‑list filtering, and Sentinel‑based circuit breaking.

AlertingAspect Oriented ProgrammingCircuit Breaking
0 likes · 16 min read
How to Build Reliable Monitoring for Low‑Frequency Financial Services
Liangxu Linux
Liangxu Linux
Jul 19, 2020 · Operations

How to Diagnose Linux Performance Issues with Flame Graphs and System Tools

This guide explains how to systematically analyze Linux performance problems—including CPU, memory, disk I/O, network, and load—using 5W2H methodology, built‑in monitoring commands, perf, flame‑graph visualizations, and a real‑world Nginx case study to pinpoint and resolve bottlenecks.

flamegraphmonitoringperformance
0 likes · 19 min read
How to Diagnose Linux Performance Issues with Flame Graphs and System Tools
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Jul 17, 2020 · Operations

How We Built a Robust Monitoring System for Construction Drawing Production

This article describes how our team designed and implemented a comprehensive online monitoring system for construction drawing generation, covering business background, technical architecture analysis, metric definition, monitoring methods, and the resulting dashboards that improve quality, stability, and rapid issue resolution.

MetricsOperationsconstruction drawing
0 likes · 10 min read
How We Built a Robust Monitoring System for Construction Drawing Production
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jul 12, 2020 · Operations

Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques

This article shares practical monitoring strategies for financial services with low‑frequency operations, contrasting e‑commerce monitoring methods, outlining the challenges of financial monitoring, and presenting reliable solutions such as success‑rate alerts, aspect‑oriented exception handling with whitelists, and circuit‑breaker degradation using Sentinel.

AlertingAspect Oriented ProgrammingFinancial Services
0 likes · 14 min read
Monitoring Practices for Low‑Frequency Financial Services: Lessons from E‑commerce and Reliable Alerting Techniques
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Jul 9, 2020 · Cloud Native

Deploy and Manage Prometheus Operator on Kubernetes: A Step‑by‑Step Guide

This article explains what the Prometheus Operator is, how it extends Kubernetes with custom resources, lists the CRDs it provides, and walks through a complete deployment—including cloning the source, creating a monitoring namespace, applying RBAC, installing the operator, creating a Prometheus instance, configuring ServiceMonitor, and troubleshooting common permission errors—using concrete YAML manifests and kubectl commands.

KubernetesPrometheus OperatorRBAC
0 likes · 18 min read
Deploy and Manage Prometheus Operator on Kubernetes: A Step‑by‑Step Guide
HaoDF Tech Team
HaoDF Tech Team
Jul 8, 2020 · Operations

How We Rebuilt Our Monitoring System into a Scalable Alert Service

After two months of intensive development, the team launched a new monitoring and alerting platform that transforms a legacy system into a service‑oriented solution, addressing pain points such as inflexible escalation, noisy alerts, and poor ownership while introducing phone alerts, automated escalation, Prometheus integration, and a unified rule engine.

AlertingDevOpsPrometheus
0 likes · 16 min read
How We Rebuilt Our Monitoring System into a Scalable Alert Service
ITPUB
ITPUB
Jul 7, 2020 · Operations

Top 2020 DevOps Tools: A Complete Guide to Building Your CI/CD Stack

This article categorizes the most popular 2020 DevOps tools across development, testing, deployment, runtime, and collaboration, explains why each tool leads its class, lists key advantages and competitors, and offers a practical checklist for assembling a full CI/CD pipeline.

CollaborationDevOpsautomation
0 likes · 24 min read
Top 2020 DevOps Tools: A Complete Guide to Building Your CI/CD Stack
ITPUB
ITPUB
Jul 5, 2020 · Operations

2020’s Best DevOps Tools by Category – From CI/CD to Collaboration

This article categorises the most popular 2020 DevOps tools—development/build, automated testing, deployment, runtime, and collaboration—explains why each tool topped its class, lists key advantages, and compares notable competitors to help teams build a complete CI/CD pipeline.

Collaborationautomationmonitoring
0 likes · 27 min read
2020’s Best DevOps Tools by Category – From CI/CD to Collaboration
dbaplus Community
dbaplus Community
Jul 2, 2020 · Information Security

How 58 Daojia Secures Data in the DT Era: Threats, Practices, and Lessons

This article summarizes Liu Huan's presentation on data security in the DT era, covering the current security landscape, internal and external threats to enterprise data, and 58 Daojia's practical approaches to data discovery, classification, authentication, monitoring, and incident response.

DT eradata securityenterprise security
0 likes · 14 min read
How 58 Daojia Secures Data in the DT Era: Threats, Practices, and Lessons
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Jul 1, 2020 · Cloud Native

How to Install and Configure mysql_exporter on a Kubernetes Master Node

This guide walks through downloading the mysql_exporter package, extracting it on a Kubernetes master, installing the binary, creating a dedicated MySQL user with proper permissions, configuring a password‑less client file, launching the exporter, and updating Prometheus via kubectl so MySQL metrics are exposed on port 9104.

Cloud NativeDevOpsKubernetes
0 likes · 4 min read
How to Install and Configure mysql_exporter on a Kubernetes Master Node
Top Architect
Top Architect
Jul 1, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Key Components

Microservices, introduced in 2012 and popularized by Martin Fowler, decompose applications into small, independent services that communicate via lightweight protocols, enabling modular development, flexible technology choices, independent deployment, and improved scalability, while also introducing challenges such as distributed data consistency, testing complexity, and operational overhead.

Backend ArchitectureConfiguration ManagementMicroservices
0 likes · 16 min read
Understanding Microservices Architecture: Concepts, Benefits, and Key Components
dbaplus Community
dbaplus Community
Jun 28, 2020 · Databases

How to Build a Visual MongoDB Slow Query Dashboard with PHP

This guide explains how to set up a PHP‑based web platform that collects MongoDB slow‑query logs via remote profiling, stores them in MySQL, and visualizes the data, including installation of required PHP extensions, database preparation, configuration, cron scheduling, and enabling profiling on MongoDB.

MongoDBPHPmonitoring
0 likes · 7 min read
How to Build a Visual MongoDB Slow Query Dashboard with PHP
Qunar Tech Salon
Qunar Tech Salon
Jun 23, 2020 · Operations

A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems

This article presents a lightweight gray release approach for complex flight ticket services, comparing traditional hardware and soft‑routing isolation methods, describing the authors' traffic‑based gray identification, business‑focused monitoring, implementation details, and automated safeguards to enable safe incremental deployments.

BackendDeploymentOperations
0 likes · 8 min read
A Simple Gray Release Solution for High‑Concurrency Flight Ticket Systems
Aikesheng Open Source Community
Aikesheng Open Source Community
Jun 22, 2020 · Operations

Introduction to the Prometheus Data Collection Process

This article explains the complete Prometheus data collection workflow, covering key concepts such as targets, samples, and meta labels, detailing the relabeling steps, configuration options, example use‑cases, and the final scrape and storage phases for effective monitoring.

ConfigurationPrometheusdata collection
0 likes · 8 min read
Introduction to the Prometheus Data Collection Process
JD Retail Technology
JD Retail Technology
Jun 17, 2020 · Operations

How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture

The article details JD’s data product teams’ systematic preparation for the 618 shopping festival, covering pressure estimation, capacity expansion, stress testing, emergency downgrade strategies, dual‑data‑center isolation, high‑fidelity end‑to‑end testing, and continuous monitoring to ensure stable, real‑time data services during massive traffic spikes.

Big DataData PlatformJD.com
0 likes · 10 min read
How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture
Xianyu Technology
Xianyu Technology
Jun 17, 2020 · Backend Development

Lottery System Risk Management and SDK Integration

Xianyu mitigated lottery‑related financial loss by centralizing rights management, decoupling UI from business logic, and providing a unified SDK with simple draw APIs, while adding real‑time log backflow, comprehensive accounting and monitoring, cutting configuration time by over 50 % and eliminating UI‑only risk.

BackendLottery SystemSDK
0 likes · 10 min read
Lottery System Risk Management and SDK Integration
Laravel Tech Community
Laravel Tech Community
Jun 16, 2020 · Mobile Development

Kuaishou’s APM Platform and Mobile Performance Optimization: Insights from Yang Kai

In a mobile‑first world where limited device resources and unstable networks threaten user retention, Kuaishou’s performance team built an APM monitoring platform and applied systematic memory, startup, and jank optimizations that cut startup time by 40%, reduced package size by 23 MB, and significantly improved key product metrics.

APMKuaishouMemory Management
0 likes · 9 min read
Kuaishou’s APM Platform and Mobile Performance Optimization: Insights from Yang Kai
Liangxu Linux
Liangxu Linux
Jun 13, 2020 · Operations

Mastering Monitoring: From Basics to Advanced Zabbix Practices

This comprehensive guide explains why monitoring is essential for operations, outlines monitoring goals and methods, reviews a wide range of open‑source tools, details a Zabbix‑based workflow, enumerates key metrics across hardware, system, application, network, security and business layers, and offers practical alerting and interview tips.

AlertingOperationsZabbix
0 likes · 21 min read
Mastering Monitoring: From Basics to Advanced Zabbix Practices
JD Retail Technology
JD Retail Technology
Jun 10, 2020 · Operations

Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring

The logistics R&D team spent 62 days preparing for the 618 promotion by analyzing core processes, applying stress tests, implementing fault‑tolerant architectures, planning capacity, and deploying real‑time monitoring tools to ensure system stability and performance under peak traffic.

OperationsPerformance TestingSystem Design
0 likes · 7 min read
Logistics R&D Preparation for the 618 Promotion: System Readiness, Stress Testing, and Real‑Time Monitoring
Manbang Technology Team
Manbang Technology Team
Jun 8, 2020 · Cloud Native

Design and Implementation of a Zookeeper Operator for Kubernetes

This article outlines the design, functional requirements, CRD definition, architecture, deployment, scaling, monitoring, fault‑tolerance, and upgrade strategies of a Zookeeper operator on Kubernetes, including code examples, service configurations, and integration with Prometheus and OAM standards.

CRDCloud NativeKubernetes
0 likes · 18 min read
Design and Implementation of a Zookeeper Operator for Kubernetes
Efficient Ops
Efficient Ops
Jun 3, 2020 · Operations

Understanding Kubernetes vs VM Monitoring: CPU, Memory, Disk & Network

This article compares monitoring metrics for CPU, memory, disk, and network between traditional KVM-based servers and Kubernetes pods, explaining why their indicators differ, how resource isolation works, and what key metrics users should watch to diagnose performance bottlenecks.

CPUKubernetesmemory
0 likes · 11 min read
Understanding Kubernetes vs VM Monitoring: CPU, Memory, Disk & Network
iQIYI Technical Product Team
iQIYI Technical Product Team
May 29, 2020 · Big Data

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.

Big DataMetricsarchitecture
0 likes · 12 min read
iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation
FunTester
FunTester
May 26, 2020 · Fundamentals

Understanding Load Testing: Key Strategies and Best Practices

This article clarifies common misconceptions about load testing, defines it within performance testing, and provides practical strategies for test volume, load generators, scripting, think time, ramp-up/down, monitoring, diagnosis, and data analysis to ensure reliable performance assessments.

Software TestingTest Strategymonitoring
0 likes · 11 min read
Understanding Load Testing: Key Strategies and Best Practices
dbaplus Community
dbaplus Community
May 25, 2020 · Operations

Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks

This article details how Ctrip optimized the CAT monitoring system—covering its large‑scale deployment, thread‑model redesign, offloading calculations to clients, double‑buffered reporting, and string handling improvements—to dramatically cut CPU usage, GC pressure, and memory consumption while handling billions of messages daily.

Distributed SystemsThread Modelgc
0 likes · 25 min read
Scaling CAT Monitoring at Ctrip: Thread Model, Client Computation & Memory Tweaks
Top Architect
Top Architect
May 21, 2020 · Backend Development

Comprehensive Guide to Java Application Performance Optimization and Diagnosis

This article provides an in‑depth overview of Java application performance optimization, covering a four‑layer model (application, database, framework, JVM), on‑site and post‑mortem analysis methods, OS and JVM diagnostic tools, common code and GC issues, database deadlock handling, and practical tuning recommendations.

Database TuningJVMdiagnostics
0 likes · 23 min read
Comprehensive Guide to Java Application Performance Optimization and Diagnosis
Efficient Ops
Efficient Ops
May 20, 2020 · Operations

How to Build a Sustainable CMDB: Three Essential Phases for Reliable Operations

This article explains how to design, implement, and maintain a robust Configuration Management Database (CMDB) by focusing on simple modeling, establishing data closure loops, and efficiently handling existing inventory, while leveraging Kafka, Flink, Elasticsearch, and Neo4j for fast querying and topology visualization.

CMDBConfiguration Managementautomation
0 likes · 19 min read
How to Build a Sustainable CMDB: Three Essential Phases for Reliable Operations
Efficient Ops
Efficient Ops
May 19, 2020 · Cloud Native

Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning

This article explores the history and principles of Prometheus monitoring, offers guidance on version selection, highlights its limitations, details common Kubernetes exporters, shows Grafana dashboard setups, and provides in‑depth strategies for exporter aggregation, golden metrics, multi‑cluster scraping, GPU monitoring, timezone handling, memory optimization, capacity planning, and rate calculations.

GrafanaKubernetesPrometheus
0 likes · 19 min read
Mastering Prometheus on Kubernetes: Practical Tips, Exporter Guide, and Capacity Planning
HomeTech
HomeTech
May 14, 2020 · Cloud Native

Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome

The article describes Autohome's third‑generation cloud‑native monitoring platform, detailing its background, strategic goals for R&D efficiency, mobile‑first design, Prometheus‑based architecture with multi‑replica storage and InfluxDB remote storage, its operational impact, and future directions such as AI‑driven noise reduction.

Containerscloud-nativemonitoring
0 likes · 7 min read
Design and Implementation of the Next‑Generation Cloud‑Native Monitoring System at Autohome
Programmer DD
Programmer DD
May 12, 2020 · Operations

Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops

This comprehensive guide explains how to enhance RabbitMQ reliability by covering confirmation mechanisms, producer and consumer configurations, queue mirroring, alerting, monitoring metrics, and health‑check commands, providing actionable steps for developers and operations teams to ensure stable message delivery.

Message QueueOperationsRabbitMQ
0 likes · 22 min read
Boost RabbitMQ Reliability: Proven Strategies for Producers, Consumers, and Ops
MaGe Linux Operations
MaGe Linux Operations
May 10, 2020 · Databases

How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana

This guide walks through deploying mysqld_exporter, configuring Prometheus and Grafana, and monitoring essential MySQL metrics such as replication health, query throughput, slow‑query counts, connection usage, and InnoDB buffer‑pool statistics, while also showing how to set up alert rules for proactive database operations.

AlertingExportersGrafana
0 likes · 15 min read
How to Build a Complete MySQL Monitoring Dashboard with Prometheus and Grafana
ITPUB
ITPUB
May 3, 2020 · Operations

Mastering IT Monitoring: Goals, Methods, Tools, and Best Practices

This comprehensive guide explains why monitoring is essential for reliable operations, outlines clear monitoring objectives, walks through practical monitoring methods, compares popular open‑source tools, details a Zabbix‑based workflow, and lists key hardware, system, application, network, security, API, performance, and business metrics to track.

IT infrastructureOperationsZabbix
0 likes · 19 min read
Mastering IT Monitoring: Goals, Methods, Tools, and Best Practices
Laravel Tech Community
Laravel Tech Community
May 2, 2020 · Operations

Comprehensive MySQL and Linux Operations Interview Guide

This guide compiles essential MySQL security steps, master‑slave replication principles, backup scripts, Linux boot overview, common port services, virus mitigation, monitoring tools, nginx optimization, InnoDB lock troubleshooting, replication lag reduction, high‑availability components, data migration utilities, and automation configuration management techniques for operations engineers.

LinuxOperationsautomation
0 likes · 13 min read
Comprehensive MySQL and Linux Operations Interview Guide
Liangxu Linux
Liangxu Linux
Apr 29, 2020 · Operations

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

This guide explains why monitoring is essential for the entire operations lifecycle, outlines key monitoring objectives, describes practical methods and workflows, reviews a range of open‑source tools (including Zabbix, MRTG, Ganglia, Nagios, Smokeping, OpenTSDB), and details metric categories such as hardware, system, application, network, log, security, API, performance and business monitoring.

AlertingMetricsZabbix
0 likes · 22 min read
How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices
vivo Internet Technology
vivo Internet Technology
Apr 29, 2020 · Cloud Native

Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring

Prometheus, a CNCF‑graduated, cloud‑native monitoring system, combines pull‑based target discovery, a label‑rich time‑series data model, and four core metric types—gauge, counter, histogram, and summary—to provide near‑real‑time visibility, short‑term retention, alerting via AlertManager, and integration with Grafana and remote storage for scalable observability.

AlertmanagerCNCFDevOps
0 likes · 11 min read
Prometheus Architecture and Design Principles: A Deep Dive into Cloud-Native Monitoring
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Apr 29, 2020 · Operations

How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance

This article documents the step‑by‑step practices of a six‑person test‑environment availability team that unified middleware, streamlined deployment pipelines, piloted business usage, introduced monitoring and recovery mechanisms, and created a comprehensive SIT environment handbook to improve integration testing stability and operational efficiency.

DeploymentOperationsSIT
0 likes · 19 min read
How Our Team Built a Stable SIT Environment: Lessons in Test Environment Governance
UCloud Tech
UCloud Tech
Apr 28, 2020 · Cloud Native

How We Built a Highly Available Kubernetes Platform for Multi‑Cluster Deployments

This article explains why Kubernetes was chosen, describes the overall architecture, high‑availability master design, multi‑IDC cluster deployment, logging, monitoring, service exposure, image building, lifecycle hooks, CI/CD, multi‑cluster management, encountered challenges, and future plans for operators and automated scaling.

KubernetesMulti-Clusterci/cd
0 likes · 11 min read
How We Built a Highly Available Kubernetes Platform for Multi‑Cluster Deployments
dbaplus Community
dbaplus Community
Apr 22, 2020 · Operations

How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs

This article recounts 58 Daojia’s four‑year journey from migrating its IDC infrastructure to public cloud, the challenges encountered, and how the team designed and evolved a multi‑generation operations platform that centralizes asset, cost, domain, and monitoring management, ultimately improving efficiency and reducing expenses.

Cost Managementasset managementcloud migration
0 likes · 14 min read
How 58 Daojia Built a Cloud‑Native Ops Platform to Streamline Migration and Cut Costs
21CTO
21CTO
Apr 16, 2020 · Backend Development

How JD’s API Gateway Handles Tens of Millions of Concurrent Requests

This article explains how JD Retail built a high‑performance, secure, and observable API gateway that supports massive traffic, implements asynchronous processing for high concurrency, provides fine‑grained traffic control, gray‑release capabilities, and automated operations to serve native, web, and mini‑program clients.

api-gatewayautomationgray release
0 likes · 10 min read
How JD’s API Gateway Handles Tens of Millions of Concurrent Requests
FunTester
FunTester
Apr 14, 2020 · Operations

Spot Performance Problems Without Writing a Single Line of Code

Experienced developers can often identify performance bottlenecks simply by reviewing code implementations, configuration settings such as timeouts, intervals, database and Redis parameters, as well as service monitoring data, container and JVM configurations, allowing them to avoid unnecessary test scripts and code changes.

ConfigurationDevOpsOperations
0 likes · 2 min read
Spot Performance Problems Without Writing a Single Line of Code
Cloud Native Technology Community
Cloud Native Technology Community
Apr 8, 2020 · Operations

Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring

This article provides a detailed analysis of Thanos' architecture, explaining each core component—Query, Sidecar, Store Gateway, Ruler, Compact, and the upcoming Receiver—how they enable global view, high availability, and long‑term storage for distributed Prometheus deployments, and discusses design trade‑offs and optimization strategies.

Cloud NativeLong‑term StoragePrometheus
0 likes · 12 min read
Decoding Thanos Architecture: From Query to Compact for Scalable Monitoring
Ops Development Stories
Ops Development Stories
Apr 8, 2020 · Operations

Deploy Zabbix Monitoring with Docker and Docker‑Compose on CentOS

This guide walks through preparing a CentOS 7 host, installing Docker, configuring a Zabbix server and MySQL containers, and optionally using docker‑compose to set up Zabbix components, including the web UI and agent, with detailed commands and volume mappings for persistent monitoring.

CentOSDockerDocker Compose
0 likes · 18 min read
Deploy Zabbix Monitoring with Docker and Docker‑Compose on CentOS
DevOps
DevOps
Apr 8, 2020 · Operations

Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations

This article presents a comprehensive DevOps case study of Bilibili, covering its cultural background, community ecosystem, user‑centric demand management, migration to high‑performance microservices, and the implementation of logging, monitoring, and real‑time data platforms to support rapid, reliable delivery.

BilibiliData PlatformDevOps
0 likes · 17 min read
Bilibili DevOps Case Study: Culture, Community, User‑Driven Demand Management, High‑Performance Microservices, and Data Operations
Efficient Ops
Efficient Ops
Apr 6, 2020 · Databases

How to Build a MySQL Monitoring Platform with Prometheus and Grafana

This article walks through setting up a production‑grade MySQL monitoring solution using Prometheus and Grafana, covering exporter installation, MySQL user configuration, systemd service setup, Prometheus job definition, key MySQL performance metrics, and basic alerting rules.

GrafanaMetricsPrometheus
0 likes · 15 min read
How to Build a MySQL Monitoring Platform with Prometheus and Grafana
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 5, 2020 · Backend Development

Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security

This tutorial walks through what Spring Boot Actuator is, how to quickly create a demo project, configure endpoint exposure, explore essential endpoints such as health, metrics, loggers, and shutdown, and secure them with Spring Security, providing code snippets and configuration examples.

ActuatorEndpointsSpring Boot
0 likes · 14 min read
Master Spring Boot Actuator: Quick Start, Key Endpoints, and Security
360 Quality & Efficiency
360 Quality & Efficiency
Apr 3, 2020 · Operations

Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana

This article introduces the core concepts and architecture of the open‑source Prometheus monitoring system, explains its data model and metric types, and provides a step‑by‑step guide to install a Prometheus server, collect host metrics with Node Exporter, and visualize them using Grafana.

GrafanaMetricsPrometheus
0 likes · 10 min read
Prometheus Monitoring System: Concepts, Architecture, and Hands‑On Deployment with Node Exporter and Grafana
Efficient Ops
Efficient Ops
Apr 1, 2020 · Operations

How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide

This article explains why traditional server and service monitoring (e.g., Zabbix) may miss business outages, then walks through setting up Nagios on Debian to monitor web page URLs, API health checks, and related services, including configuration files, plugins, and a desktop alert tool, Nagstamon.

LinuxNagiosOps
0 likes · 18 min read
How to Use Nagios for Business-Level Service Monitoring: A Step-by-Step Guide
Alibaba Terminal Technology
Alibaba Terminal Technology
Apr 1, 2020 · Frontend Development

How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps

This article explains the concept of frontend safety production, outlines its evolution from basic monitoring to a systematic, cloud‑enabled framework, and details the core capabilities—pre‑change CI checks, gray‑release gating, and real‑time monitoring—required to ensure high‑quality, risk‑free frontend deployments.

CIautomationfrontend
0 likes · 12 min read
How to Build a Robust Frontend Safety Production System for High‑Reliability Web Apps
Java Captain
Java Captain
Apr 1, 2020 · Operations

Comprehensive Guide to Online Environment Deployment and Operations Practices

This article provides a thorough overview of planning, provisioning, and managing online production environments—including user sizing, bandwidth estimation, database design, OS versus container deployment, middleware selection, security, monitoring, SSH shortcuts, file transfer tools, automation scripts, Docker setup, and log viewing techniques—aimed at giving developers a complete operational perspective.

DeploymentDockerOperations
0 likes · 16 min read
Comprehensive Guide to Online Environment Deployment and Operations Practices
FunTester
FunTester
Mar 31, 2020 · Operations

Interface Performance Testing – Tools, Scripts, and Guides

This article compiles a comprehensive list of resources—including tools, scripts, and tutorials—for conducting interface performance testing on Linux and other platforms, covering topics such as netdata localization, timewatch utility, load testing strategies, JVM heap dumps, and visualizing test data.

APILinuxmonitoring
0 likes · 6 min read
Interface Performance Testing – Tools, Scripts, and Guides
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 30, 2020 · Operations

Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling

The article explains how Facebook manages dynamic runtime configuration for millions of services—covering feature gating, experiments, traffic control, topology balancing, monitoring, machine‑learning model updates, and internal behavior—using a suite of tools such as Configerator, Gatekeeper, Package Vessel, Sitevars, and MobileConfig.

AB testingcloud operationsconfiguration-management
0 likes · 8 min read
Dynamic Runtime Configuration Management at Facebook: Use Cases and Tooling
Efficient Ops
Efficient Ops
Mar 26, 2020 · Operations

Why SRE Exists and How It Solves Reliability Challenges

This article explains why Site Reliability Engineering (SRE) emerged, outlines its core responsibilities, required skill set, and how it addresses reliability challenges through decoupling, SLO‑driven monitoring, and scenario‑based drills, while highlighting key observations and focus areas for modern operations teams.

SLOSREmonitoring
0 likes · 13 min read
Why SRE Exists and How It Solves Reliability Challenges
Ops Development Stories
Ops Development Stories
Mar 26, 2020 · Operations

How to Auto‑Discover and Monitor Redis Ports with Zabbix

This guide explains how to use Zabbix's auto‑discovery feature to automatically find Redis instances on a server, create shell or Python scripts for port detection, configure Zabbix agent keys, set up server‑side templates, discovery rules, item prototypes, graphs, and triggers, and finally apply the template to monitored hosts.

Auto-discoveryPythonShell
0 likes · 9 min read
How to Auto‑Discover and Monitor Redis Ports with Zabbix
Didi Tech
Didi Tech
Mar 21, 2020 · Operations

Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring

Nightingale, Didi’s open‑source enterprise monitoring platform, builds on Open‑Falcon but adds a hierarchical object tree, in‑memory indexing, Gorilla‑compressed time‑series storage, a hybrid push‑pull alert engine, built‑in log monitoring, and a unified monapi module, delivering scalable, cloud‑native observability for both container and bare‑metal workloads.

Cloud NativeOpen-Falconarchitecture
0 likes · 10 min read
Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring
Open Source Linux
Open Source Linux
Mar 19, 2020 · Operations

Essential Ops Playbook: Avoid Costly Mistakes in Server Management

This guide shares practical Linux server operation rules, emphasizing thorough testing, careful use of destructive commands, strict access control, regular backups, security hardening, continuous monitoring, and disciplined performance tuning to prevent costly outages and data loss.

Backupmonitoringperformance tuning
0 likes · 13 min read
Essential Ops Playbook: Avoid Costly Mistakes in Server Management