Tagged articles

2179 articles

Page 17 of 22

Mar 8, 2020 · Operations

Prometheus vs Zabbix: Install, Configure & Visualize with Grafana

This article compares Prometheus with Zabbix, walks through downloading and installing Prometheus, explains the key sections of prometheus.yml, shows how to add a node_exporter for machine metrics, and demonstrates integrating Grafana to create rich monitoring dashboards.

GrafanaLinuxPrometheus

0 likes · 11 min read

Prometheus vs Zabbix: Install, Configure & Visualize with Grafana

Didi Tech

Mar 5, 2020 · R&D Management

Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem

At Didi, lean‑production ideas are woven into DevOps by establishing coding standards with SemVer and the NUWA framework, introducing traffic‑recording replay and a sim‑sidecar for realistic testing, extending monitoring with fine‑grained metrics, and unifying these practices into an ecosystem that cuts waste, speeds releases, and boosts overall software quality.

Frameworklean developmentmonitoring

0 likes · 7 min read

Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem

Efficient Ops

Mar 4, 2020 · Operations

Master Zabbix: From Installation to Advanced Custom Monitoring

This guide explains why monitoring is essential, describes the concept of availability "X nines," walks through Zabbix installation, web interface setup, host and template configuration, custom monitoring, alerting with OneAlert, visualization, distributed monitoring, SNMP integration, and provides practical command examples for managing large server fleets.

LinuxZabbixautomation

0 likes · 20 min read

Master Zabbix: From Installation to Advanced Custom Monitoring

Tencent IMWeb Frontend Team

Mar 4, 2020 · Frontend Development

How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges

During the COVID‑19 pandemic, Tencent Classroom’s front‑end team faced unprecedented traffic spikes, forcing rapid decisions on domain stability, video streaming, data platforms, messaging, monitoring, and deployment pipelines, while sharing lessons on scaling, resilience, and collaborative development under extreme pressure.

DeploymentTencent ClassroomVideo Streaming

0 likes · 13 min read

How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges

Programmer DD

Mar 4, 2020 · Frontend Development

Customize Grafana Themes Without Rebuilding the Source Code

This guide walks you through a step‑by‑step method to add and switch custom Grafana themes using the Boom Theme panel plugin and ready‑made theme packs from GitHub, enabling theme changes across dashboards without modifying Grafana's source code.

GrafanaTheme Customizationfrontend development

0 likes · 5 min read

Customize Grafana Themes Without Rebuilding the Source Code

Wukong Talks Architecture

Mar 3, 2020 · Databases

Using Druid DataSource in Spring Boot: Configuration, Monitoring, and Troubleshooting

This article explains what Druid is, how to add the Druid dependency, configure it in Spring Boot's application.yml, set up monitoring with a custom DruidConfig class, and resolve common errors such as property binding failures and login issues.

ConfigurationDatabase Connection PoolDruid

0 likes · 7 min read

Using Druid DataSource in Spring Boot: Configuration, Monitoring, and Troubleshooting

Beike Product & Technology

Feb 27, 2020 · Big Data

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

This article presents the evolution, architecture, and operational metrics of Beike Zhaofang's Hermes real‑time computing platform built on Apache Flink, detailing its business scale, SQL editors, task growth, monitoring, use cases, and future development directions.

Apache FlinkBig DataReal-time Streaming

0 likes · 10 min read

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

Ops Development Stories

Feb 20, 2020 · Operations

Monitor OPNsense with Zabbix: Complete Template Installation Guide

This guide walks through downloading the pfSense Zabbix template, installing the os‑zabbix‑agent plugin on OPNsense, configuring custom agent parameters, testing connectivity, and setting up the host and template in Zabbix Server to monitor OPNsense metrics.

AgentOPNsenseZabbix

0 likes · 4 min read

Monitor OPNsense with Zabbix: Complete Template Installation Guide

Qunar Tech Salon

Feb 20, 2020 · Operations

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

JD CloudOperationsalert convergence

0 likes · 12 min read

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

Product Technology Team

Feb 19, 2020 · Frontend Development

How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling

This article details how Zhenkun's frontend team responded to fast business growth by unifying their tech stack—introducing a private npm registry, a custom CLI scaffolding tool, Node.js backend, mock services, standardized webpack builds, DevOps automation, static resource delivery, monitoring, visual editors, UI component libraries, and automated testing—to boost development efficiency and maintainability across multiple locations.

DevOpsautomationfrontend

0 likes · 15 min read

How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling

Didi Tech

Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations

0 likes · 13 min read

Didi's National Carpool Day: Technical Insights into Stability Assurance

Alibaba Cloud Developer

Feb 18, 2020 · Cloud Native

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

This article explains why online applications experience crashes during traffic spikes, outlines the complexity of modern cloud‑based service architectures, and shares Alibaba engineers’ practical notes on high‑availability design, capacity planning, full‑link stress testing, monitoring, traffic control, routine inspections, and chaos‑engineering drills using tools such as AHAS, PTS, Sentinel and Advisor.

Alibaba Cloudcapacity planningchaos engineering

0 likes · 12 min read

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

Efficient Ops

Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT Operationscapacity planningincident response

0 likes · 15 min read

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

Alibaba Cloud Developer

Feb 17, 2020 · Operations

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.

Backend ArchitectureDatabase OptimizationMicroservices

0 likes · 24 min read

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

ITPUB

Feb 10, 2020 · Operations

Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis

This guide compiles a practical collection of Linux command‑line tricks and Java troubleshooting tools—such as tail, grep, awk, find, tsar, btrace, Greys, jstack, jmap and more—complete with usage examples, code snippets and visual outputs to help engineers quickly diagnose and resolve production problems.

debuggingmonitoringtools

0 likes · 17 min read

Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis

Architects' Tech Alliance

Feb 4, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the transformation of an online supermarket from a simple monolithic website to a fully fledged microservice architecture, highlighting the motivations, design decisions, common pitfalls, and essential components such as monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing strategies, and service mesh adoption.

DeploymentMicroservicesService Mesh

0 likes · 22 min read

Microservice Architecture Evolution: From Monolith to Service Mesh

Big Data Technology Architecture

Jan 31, 2020 · Big Data

Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies

This article summarizes NetEase Hangzhou Research Institute expert Fan Xinxin's presentation on HBase, covering its role in the big‑data ecosystem, core production scenarios, RIT and HBCK troubleshooting techniques, and systematic monitoring and log‑analysis methods for diagnosing HBase issues.

HBCKHBaseRIT

0 likes · 11 min read

Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies

Java Backend Technology

Jan 23, 2020 · Backend Development

Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels

This tutorial walks you through using Spring Boot Actuator to monitor microservice applications, covering quick setup, essential endpoints such as health, metrics, loggers, and shutdown, customizing health indicators, dynamically changing log levels at runtime, and securing actuator endpoints with Spring Security.

ActuatorMetricsMicroservices

0 likes · 14 min read

Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels

dbaplus Community

Jan 22, 2020 · Backend Development

How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server

This article details a practical experiment that reproduces the load of 100 billion WeChat red‑packet (shake‑and‑grab) requests by simulating 1 million concurrent users on a single machine, achieving peak QPS of 60 k and demonstrating the architectural choices, hardware setup, and monitoring techniques required for such high‑throughput backend systems.

GoLoad TestingQPS

0 likes · 18 min read

How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server

Alibaba Cloud Native

Jan 22, 2020 · Backend Development

Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More

This comprehensive guide explains the benefits of microservices and walks through core building blocks such as RPC, service discovery, configuration management, task scheduling, distributed locking, unified monitoring, caching strategies, message queues, distributed transactions, CAP theory, seckill handling, Docker isolation, and modern CI/CD deployment pipelines.

BackendConfiguration ManagementMicroservices

0 likes · 24 min read

Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More

Top Architect

Jan 21, 2020 · Operations

Comprehensive Guide to Java Application Performance Optimization and Troubleshooting

This article provides a detailed, step‑by‑step guide for diagnosing and fixing performance problems in Java applications, covering code‑level pitfalls, CPU and memory analysis, disk and network I/O bottlenecks, and a collection of practical command‑line tools for rapid troubleshooting.

JVMjavamonitoring

0 likes · 21 min read

Comprehensive Guide to Java Application Performance Optimization and Troubleshooting

Architect's Tech Stack

Jan 17, 2020 · Backend Development

Spring Boot Actuator: Quick Start, Key Endpoints, Monitoring and Security Integration

This article walks through using Spring Boot Actuator to monitor micro‑service applications, covering quick project setup, essential endpoints such as health, metrics, loggers and shutdown, custom health indicator implementation, dynamic log level changes, and securing actuator endpoints with Spring Security.

Endpointsjavamonitoring

0 likes · 13 min read

Spring Boot Actuator: Quick Start, Key Endpoints, Monitoring and Security Integration

JD Retail Technology

Jan 16, 2020 · Backend Development

Architecture and Key Technologies of a Scalable Message Push Platform

The document outlines the design, key components, data flow, and operational strategies of a large‑scale message push platform, detailing its architecture, request handling, long‑connection management, retry mechanisms, data statistics, monitoring, and future expansion plans.

Backend ArchitectureData AnalyticsLong Connections

0 likes · 15 min read

Architecture and Key Technologies of a Scalable Message Push Platform

DevOps Cloud Academy

Jan 16, 2020 · Cloud Native

Deploying Prometheus, Grafana, and Node Exporter on Kubernetes Using YAML Manifests

This guide walks through deploying node‑exporter, Prometheus, and Grafana on a Kubernetes cluster with YAML manifests, configuring services, RBAC, and Grafana dashboards to monitor cluster metrics, and includes verification steps and code examples.

Cloud NativeDevOpsGrafana

0 likes · 7 min read

Deploying Prometheus, Grafana, and Node Exporter on Kubernetes Using YAML Manifests

Architecture Digest

Jan 14, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully split microservice system, highlighting the motivations, architectural changes, common pitfalls, and practical solutions such as monitoring, tracing, service discovery, circuit breaking, testing, and the eventual adoption of a service mesh.

MicroservicesService Mesharchitecture

0 likes · 22 min read

Architecture Digest

Jan 12, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Core Components

This article explains the fundamentals of microservices architecture, detailing its definition, core principles such as small independent services and lightweight communication, the advantages and drawbacks, suitable organizational contexts, and the essential technical components like service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration.

Microservicesarchitecturegateway

0 likes · 15 min read

Understanding Microservices Architecture: Concepts, Benefits, and Core Components

JD Retail Technology

Jan 8, 2020 · Operations

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

This article explains how e‑commerce promotions differ from offline sales by offering lower participation thresholds and flexible discount tactics, outlines methods for estimating and handling traffic spikes, and provides detailed strategies for system capacity planning, load testing, monitoring, and incident response to ensure stable large‑scale promotional events.

Load Testingcapacity planninge‑commerce

0 likes · 23 min read

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

360 Tech Engineering

Jan 7, 2020 · Operations

Introduction to Prometheus and Grafana for Monitoring and Alerting

This article provides a comprehensive overview of using Prometheus and Grafana for metric collection, storage, querying with PromQL, visualization, and alerting, including exporter integration, metric types, high‑availability setups, and practical examples for modern microservice architectures.

GrafanaMetricsPrometheus

0 likes · 10 min read

Introduction to Prometheus and Grafana for Monitoring and Alerting

Aikesheng Open Source Community

Jan 6, 2020 · Databases

Introduction to the DBLE Management Console and Reload Command

This article introduces the DBLE management console, explains its dual role in administration and monitoring, demonstrates how the reload command hot‑applies configuration changes, and provides guidance on using select/show commands for database inspection.

ConfigurationDBLEDatabase Management

0 likes · 3 min read

Introduction to the DBLE Management Console and Reload Command

Aikesheng Open Source Community

Jan 2, 2020 · Operations

Monitoring Alibaba Cloud RDS with Prometheus, Grafana, and Custom Exporters

This guide explains how to monitor Alibaba Cloud RDS instances by deploying Prometheus and Grafana, using the official mysqld_exporter, a custom aliyun-exporter, rebuilding Docker images, configuring supervisor and Prometheus service discovery, and automating the entire workflow while noting limitations.

Alibaba CloudDockerExporter

0 likes · 8 min read

Monitoring Alibaba Cloud RDS with Prometheus, Grafana, and Custom Exporters

Efficient Ops

Dec 29, 2019 · Operations

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

This article presents a comprehensive guide to Linux performance analysis, covering CPU, memory, disk I/O, network, system load, flame‑graph techniques, and a real‑world Nginx case study, enabling engineers to quickly locate and resolve bottlenecks.

CPU profilingLinuxSystem optimization

0 likes · 19 min read

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

Tencent Cloud Developer

Dec 27, 2019 · Cloud Computing

Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned

Tencent Classroom migrated roughly four million videos (about 1,500 TB) to Tencent Cloud in a two‑phase rollout that integrated cloud upload, transcoding, encrypted HLS playback with anti‑leech and DRM, added AI‑based content moderation, resolved SDK and multi‑region issues, and built a custom mini‑program player, ultimately boosting upload success rates, playback reliability, and security.

DRMHLS encryptionTencent Cloud

0 likes · 13 min read

Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned

Qunar Tech Salon

Dec 27, 2019 · Operations

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

This article describes Qunar Ticket’s comprehensive test‑environment governance framework, including the “Mirror‑Inspect” monitoring service, configuration and data synchronization strategies, and automated allocation management, highlighting how these practices reduced environment‑related project delays from up to 20% to below 8%.

Configuration ManagementOperationsmonitoring

0 likes · 11 min read

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

Ops Development Stories

Dec 26, 2019 · Operations

How to Integrate ELK with Zabbix for Real‑Time Log Alerting

This guide explains how to combine ELK (Elasticsearch, Logstash, Kibana) with Zabbix using the logstash-output-zabbix plugin, configure Logstash pipelines to filter error keywords, and set up Zabbix templates and triggers for instant log‑based alerts.

AlertingELKLog Management

0 likes · 15 min read

How to Integrate ELK with Zabbix for Real‑Time Log Alerting

dbaplus Community

Dec 25, 2019 · Backend Development

How NetEase Cloud Music Built a Custom High‑Availability Message Queue on RocketMQ

This article details NetEase Cloud Music's journey from evaluating RabbitMQ, Kafka, and RocketMQ to designing a fully controllable, high‑availability message queue with failover, tracing, monitoring, and numerous custom extensions that now serve hundreds of services and billions of messages daily.

Distributed SystemsMessage QueueRocketMQ

0 likes · 15 min read

How NetEase Cloud Music Built a Custom High‑Availability Message Queue on RocketMQ

Aikesheng Open Source Community

Dec 25, 2019 · Operations

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

This guide explains the background, key features, architecture, and step‑by‑step deployment of Thanos—including Sidecar, Store, Query, Compact, Bucket, Rule, and Check components—to provide a unified, high‑availability Prometheus monitoring view with unlimited historical data storage using object storage.

Cloud NativeDeploymentLong‑term Storage

0 likes · 9 min read

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

HomeTech

Dec 25, 2019 · Operations

Automation in Brand Advertising Testing and Monitoring to Enhance Efficiency and Quality

This project addresses challenges in brand advertising testing by implementing automated testing, monitoring, and data construction solutions, significantly improving efficiency, reducing manual effort, and enhancing product quality through real-time issue detection and resolution.

Operationsautomationdata construction

0 likes · 5 min read

360 Tech Engineering

Dec 23, 2019 · Cloud Native

Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters

The article explains how Thanos combined with Prometheus provides a cloud‑native, highly available solution for long‑term metric storage and fast querying to address the exponential growth of monitoring data in large OpenStack and Ceph deployments.

Cloud NativeOpenStackPrometheus

0 likes · 7 min read

Using Thanos and Prometheus for Scalable Monitoring in OpenStack and Ceph Clusters

Ops Development Stories

Dec 23, 2019 · Operations

How to Send Zabbix Alerts with Images to DingTalk via Python

This guide explains how to extract an item ID from Zabbix alerts, capture the corresponding chart image, upload it to a public server, format the alert as markdown, and deliver it through a DingTalk robot webhook using a Python script.

AlertDingTalkPython

0 likes · 8 min read

How to Send Zabbix Alerts with Images to DingTalk via Python

Efficient Ops

Dec 22, 2019 · Operations

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.

Alarm ManagementOperationsaiops

0 likes · 15 min read

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

Alibaba Cloud Developer

Dec 20, 2019 · Operations

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

This article details a step‑by‑step investigation of repeated follower process alerts in a Paxos‑based distributed coordination service, revealing a Java GC pause‑induced memory leak in the front‑end Proxy and describing the rapid mitigation actions taken to restore system stability.

Distributed Systemsincident responsejava-gc

0 likes · 12 min read

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

Efficient Ops

Dec 19, 2019 · Operations

AIOps in Banking: Veteran’s Secrets to Smarter Operations

In this interview, veteran Bank of China software center analyst Yuan Chunliang shares two decades of experience, detailing how the bank’s shift to distributed core banking systems sparked the development of AIOps practices such as no‑threshold intelligent monitoring, multi‑indicator analytics, and AI‑driven ticket automation to boost operational efficiency and reduce risk.

Banking TechnologyIT Operationsaiops

0 likes · 14 min read

AIOps in Banking: Veteran’s Secrets to Smarter Operations

Programmer DD

Dec 19, 2019 · Backend Development

Why Microservices Matter: Core Principles, Benefits, and Architecture Explained

This article introduces the fundamental concepts of microservices, covering their definition, advantages, design principles, core components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration, while also discussing suitable organizational structures and practical implementation details.

Microservicescontainer orchestrationgateway

0 likes · 21 min read

Why Microservices Matter: Core Principles, Benefits, and Architecture Explained

Sohu Tech Products

Dec 18, 2019 · Backend Development

Node.js Performance Optimization: Common Techniques, Key Metrics, and Bottlenecks

This article answers a developer's question about Node.js performance optimization by outlining major optimization areas, listing practical techniques such as using streams, clustering, and load balancing, and describing typical bottlenecks and essential performance metrics to monitor.

Backendmonitoringnodejs

0 likes · 3 min read

Node.js Performance Optimization: Common Techniques, Key Metrics, and Bottlenecks

MaGe Linux Operations

Dec 18, 2019 · Operations

Mastering Modern IT Operations: Roles, Practices, and Evolution

This article outlines the comprehensive responsibilities and evolution of IT operations, covering system, application, database, security, and platform management, detailing tasks such as infrastructure building, monitoring, optimization, automation, and the shift from manual processes to self‑scheduling systems.

IT OperationsInfrastructureSystem Administration

0 likes · 20 min read

Mastering Modern IT Operations: Roles, Practices, and Evolution

dbaplus Community

Dec 17, 2019 · Artificial Intelligence

How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders

This article walks through the evolution of a ride‑hailing platform’s dispatch system—from a single‑database prototype to a data‑driven, AI‑powered architecture—detailing architectural choices, big‑data pipelines, model training, real‑time scheduling strategies, and monitoring practices for handling 400,000 daily orders.

AIDispatchRide Hailing

0 likes · 11 min read

How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders

360 Tech Engineering

Dec 17, 2019 · Backend Development

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

This article explains how Java memory leaks can occur despite automatic garbage collection, describes JVM GC‑Root analysis, outlines practical monitoring with Spring Boot Actuator, Prometheus, and Grafana, and provides step‑by‑step debugging commands and code adjustments to locate and fix the leak.

Garbage CollectionJVMSpring Boot

0 likes · 10 min read

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

360 Zhihui Cloud Developer

Dec 17, 2019 · Operations

How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges

This article explains how the Thanos and Prometheus combination provides long‑term, highly available monitoring for massive OpenStack and Ceph clusters, detailing its features, architecture, key components, practical deployment issues, and the operational problems it resolves.

CephOpenStackPrometheus

0 likes · 8 min read

How Thanos + Prometheus Solve Large‑Scale OpenStack Monitoring Challenges

WecTeam

Dec 17, 2019 · Frontend Development

How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance

By combining server‑side rendering, critical‑render‑path tuning, resource minification, image format upgrades, and RAIL‑based multi‑dimensional monitoring, JD dramatically reduced its WeChat shopping homepage’s first‑screen load time, offering a practical roadmap for front‑end performance optimization.

RAIL modelcritical render pathfrontend

0 likes · 17 min read

How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance

360 Quality & Efficiency

Dec 13, 2019 · Operations

Using Zabbix to Monitor Service Ports and Configure Email Alerts

This article explains how to use Zabbix for simple service‑port monitoring, covering installation, host and item creation, trigger and graph setup, and email notification configuration, providing a practical guide for developers who need lightweight operational monitoring without writing custom code.

Email NotificationOperationsService Port

0 likes · 8 min read

Using Zabbix to Monitor Service Ports and Configure Email Alerts

Ops Development Stories

Dec 7, 2019 · Operations

Automate Zabbix Monitoring: Fetch Host Metrics and Export to CSV with Python

This guide demonstrates how to use Zabbix's API with Python to retrieve host information, item IDs, historical and trend data, process the metrics, and automatically write them into an Excel/CSV file, enabling scheduled monitoring reports.

APICSVPython

0 likes · 8 min read

Automate Zabbix Monitoring: Fetch Host Metrics and Export to CSV with Python

Ctrip Technology

Dec 5, 2019 · Backend Development

Node.js Engineering Practices at Ctrip: From Zero to One, Best Practices and Operations

This article details how Ctrip builds, deploys, tests, releases, and operates Node.js applications—including engineering processes, core middleware, Docker-based deployment, multi‑process communication, monitoring, and full‑link tracing—while sharing practical lessons learned from real‑world production use.

DevOpsDockerEngineering

0 likes · 14 min read

Node.js Engineering Practices at Ctrip: From Zero to One, Best Practices and Operations

360 Tech Engineering

Dec 5, 2019 · Databases

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.

Cluster ArchitectureInfluxDBmonitoring

0 likes · 8 min read

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

Meitu Technology

Dec 4, 2019 · Backend Development

Design and Implementation of lmstfy: A Redis‑Based Task Queue Service

lmstfy is a stateless, Redis‑backed task‑queue service from Meitu that provides delayed execution, automatic retries, priority handling, expiration, and a RESTful HTTP API, while supporting horizontal scaling via namespace‑based token routing, rich Prometheus metrics, and future disk‑based storage extensions.

Distributed SystemsTask Queuebackend service

0 likes · 15 min read

Design and Implementation of lmstfy: A Redis‑Based Task Queue Service

Java High-Performance Architecture

Dec 2, 2019 · Databases

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides an automated high‑availability solution for Redis by monitoring master health, broadcasting SDOWN/ODOWN messages, electing a new master based on priority, offset and runid, and allowing clients to discover the current master via sentinel commands, all explained with configuration examples and diagrams.

Configurationhigh availabilitymonitoring

0 likes · 6 min read

How Redis Sentinel Ensures Automatic Failover and High Availability

Ops Development Stories

Nov 30, 2019 · Databases

How to Deploy Zabbix 4.4 with TimescaleDB on CentOS 7 – Step‑by‑Step Guide

This guide walks through installing Zabbix 4.4.0 on CentOS 7, configuring PostgreSQL, adding the TimescaleDB time‑series extension, setting up the Zabbix database, and tuning Linux, Nginx, and PHP so the monitoring platform runs smoothly with high‑performance time‑series storage.

CentOSLinuxTimescaleDB

0 likes · 11 min read

How to Deploy Zabbix 4.4 with TimescaleDB on CentOS 7 – Step‑by‑Step Guide

Efficient Ops

Nov 28, 2019 · Operations

Master Modern IT Operations: Skill Maps, ELK Architectures & Big Data Monitoring

This article explores the evolving landscape of IT operations, detailing role specializations, comprehensive skill maps for system, web, big data, and container ops, and compares three ELK logging architectures while emphasizing a data‑driven approach to monitoring and incident response.

Big DataELKIT Operations

0 likes · 11 min read

Master Modern IT Operations: Skill Maps, ELK Architectures & Big Data Monitoring

dbaplus Community

Nov 27, 2019 · Operations

Scaling Ele.me’s Monitoring: From StatsD to a Unified LinDB‑Powered Platform

This article recounts Huang Jie’s presentation on the evolution of Ele.me’s monitoring system, detailing its three development stages, the challenges faced, the layered monitoring architecture, the design of a unified platform supporting both PC and mobile, and the underlying LinDB time‑series database.

EMonitorLinDBSystem Design

0 likes · 10 min read

Scaling Ele.me’s Monitoring: From StatsD to a Unified LinDB‑Powered Platform

MaGe Linux Operations

Nov 26, 2019 · Operations

Master Prometheus: From Basics to Advanced Configuration and Alerts

This article introduces Prometheus, an open‑source monitoring system, explains its core components such as server, exporters, and Alertmanager, provides step‑by‑step installation and configuration instructions, demonstrates alert rule setup, and shows integration with tools like Grafana, Telegraf, Spring Boot and Canal.

AlertmanagerDevOpsGrafana

0 likes · 10 min read

Master Prometheus: From Basics to Advanced Configuration and Alerts

Huajiao Technology

Nov 26, 2019 · Backend Development

How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks

This article details the design, requirements, architecture, and operational dashboard of Pepperbus, a unified bus system that standardizes asynchronous task handling for PHP, Java, and Go services at Huajiao, highlighting its storage plug‑in model, Redis‑based protocol, and monitoring capabilities.

AsynchronousDashboardPHP

0 likes · 8 min read

How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks

dbaplus Community

Nov 25, 2019 · Operations

From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure

This article outlines how the Weibo advertising team evolved its operations from hand‑crafted scripts to a fully automated, AI‑enhanced platform, covering service governance, multi‑datacenter deployment, a custom automation system (Kunkka), effective alerting, full‑link tracing, and a massive metric monitoring solution built on big‑data technologies.

DevOpsaiopsmonitoring

0 likes · 15 min read

From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure

DevOps Coach

Nov 24, 2019 · Cloud Native

Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide

This article explains how cloud‑native applications can achieve full observability using the Elastic Stack by outlining the four essential steps—health checks, metrics, logs, and tracing—while discussing the underlying challenges, implementation patterns, and practical recommendations for reliable monitoring.

APMcloud-nativeelastic-stack

0 likes · 14 min read

Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide

Programmer DD

Nov 23, 2019 · Operations

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

LinuxOperationsmonitoring

0 likes · 11 min read

Essential Checklist for Rapid Server Troubleshooting

21CTO

Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance

0 likes · 13 min read

How SRE Designs Highly Available Software Systems at Scale

UCloud Tech

Nov 14, 2019 · Cloud Native

How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study

This article details LeXin Medical's journey from a manually built Kubernetes cluster to the UCloud UK8S platform, covering the challenges of self‑hosting, the tools and processes used for migration, and the resulting improvements in logging, monitoring, CI/CD, and overall operational efficiency.

Cloud NativeDevOpsKubernetes

0 likes · 10 min read

How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study

Huajiao Technology

Nov 12, 2019 · Operations

How to Build a Scalable API Automation Framework for Search Services

This article explains the design, core features, implementation details, and real‑world deployment of the Auto_ApiTest tool for automating API testing in a large‑scale search platform, covering data management, configuration, code examples, CI integration, monitoring, and measurable outcomes.

API testingPythonautomation

0 likes · 17 min read

How to Build a Scalable API Automation Framework for Search Services

dbaplus Community

Nov 11, 2019 · Operations

How EMonitor Outperforms CAT: A Deep Dive into Meituan’s Monitoring Evolution

This article compares Meituan’s in‑house EMonitor with the open‑source CAT platform, outlines their core monitoring models, sampling pipelines, custom metrics and integration capabilities, and traces the evolution of monitoring stages from log‑based to intelligent root‑cause analysis.

CATDistributed SystemsEMonitor

0 likes · 16 min read

How EMonitor Outperforms CAT: A Deep Dive into Meituan’s Monitoring Evolution

MaGe Linux Operations

Nov 10, 2019 · Operations

100 Essential Linux Ops Articles Curated by a Tech‑First Education Hub

The "马哥Linux运维" public account, built on a technology‑first philosophy, shares high‑quality, non‑clickbait content and has compiled the 100 most‑read Linux operations articles from the past three years, offering a comprehensive resource for sysadmins and DevOps engineers.

DevOpsautomationmonitoring

0 likes · 8 min read

100 Essential Linux Ops Articles Curated by a Tech‑First Education Hub

Qunhe Technology Quality Tech

Nov 9, 2019 · Operations

How We Cut BIM Drawing Failures from 0.01% to 0.0005% with Automated Monitoring

The BIM construction‑drawing team built an automated monitoring and validation tool using Spring Boot, REST‑Assured and JIRA APIs, turning a tedious manual bug‑fix workflow into a streamlined process that reduced online drawing‑failure rates from 0.01% to virtually zero.

BIMJiraOperations

0 likes · 5 min read

How We Cut BIM Drawing Failures from 0.01% to 0.0005% with Automated Monitoring

DataFunTalk

Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka

0 likes · 14 min read

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

Ops Development Stories

Nov 6, 2019 · Operations

How to Send Zabbix 4.2 Alerts with Embedded Images via Email and WeChat Using Python

This guide shows how to use Python 2.7 to extend Zabbix 4.2 alerts by attaching the current graph image to email and WeChat notifications, covering environment setup, script details, Zabbix media type configuration, and testing the final result.

PythonWeChatZabbix

0 likes · 16 min read

How to Send Zabbix 4.2 Alerts with Embedded Images via Email and WeChat Using Python

NetEase Game Operations Platform

Nov 2, 2019 · Operations

Understanding Linux CPU Usage, Scheduling, and Performance Monitoring

This article explains how Linux reports CPU usage with tools like top, the meaning of the fields in /proc/stat, how utilization percentages are calculated, the concepts of run queues, load average, context switching, multi‑core scheduling, and how to use perf and taskset for deeper performance analysis.

CPULinuxOps

0 likes · 15 min read

Understanding Linux CPU Usage, Scheduling, and Performance Monitoring

360 Quality & Efficiency

Nov 1, 2019 · Mobile Development

Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support

This article demonstrates how to leverage uiautomator1.0 for Android automation by creating a shell‑based Context, accessing PackageManager, managing SQLite databases, monitoring app activities and processes, and implementing Chinese text input through AccessibilityNodeInfo.

Androidautomationdatabase

0 likes · 4 min read

Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support

System Architect Go

Oct 30, 2019 · Databases

InfluxDB Monitoring, Backup, and Restore Guide

This article explains InfluxDB's built‑in monitoring system, internal measurements, useful commands, HTTP endpoints, and provides detailed instructions for performing full backups and restores, including configuration tweaks, command syntax, and important considerations about formats and data ranges.

BackupInfluxDBRestore

0 likes · 5 min read

InfluxDB Monitoring, Backup, and Restore Guide

Tencent Cloud Developer

Oct 25, 2019 · Backend Development

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Tencent Video’s front‑end Node.js services achieve massive concurrency stability through a layered architecture that combines GSLB‑directed CDN, TGW, Nginx, and clustered workers, reinforced by process guardians, three‑tier disaster‑recovery fallbacks, multi‑level caching with lock mechanisms, and comprehensive logging and alerting.

AvailabilityNode.jshigh concurrency

0 likes · 11 min read

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Ctrip Technology

Oct 17, 2019 · Backend Development

CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions

This article describes how Ctrip introduced a customized Dubbo framework called CDubbo, covering the motivations for adopting Dubbo, the initial implementation of service governance and monitoring, and subsequent extensions such as callback enhancement, serialization support, circuit‑breaking, testing tools, and a bastion testing gateway.

DubboMicroservicesRPC

0 likes · 13 min read

CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions

dbaplus Community

Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

On-CallSREalert optimization

0 likes · 15 min read

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

Alibaba Cloud Infrastructure

Oct 16, 2019 · Operations

Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo

At the 2019 Hangzhou Cloud Expo, Alibaba and Intel experts presented a series of intelligent operation solutions for large‑scale cloud infrastructure—including automated server repair, network change verification, application operation brain, monitoring advancements, power‑optimization, and data‑center management—demonstrating how AI‑driven techniques improve stability, cost, and efficiency.

Intelligent Operationsautomationcloud computing

0 likes · 7 min read

Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo

dbaplus Community

Oct 15, 2019 · Big Data

How to Build Real‑Time Data Pipelines for E‑Commerce Promotions

This article examines the surge in real‑time data demands for e‑commerce promotions, outlines how to collect, compute, and deliver streaming data, compares batch and stream processing, lists typical use cases, and discusses the challenges of building scalable, low‑latency pipelines.

Data StreamingReal-Timemonitoring

0 likes · 11 min read

How to Build Real‑Time Data Pipelines for E‑Commerce Promotions

Efficient Ops

Oct 14, 2019 · Operations

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

This article shares a practical case study of implementing AIOps in an online‑education company, covering the background pain points of massive monitoring data, the designed architecture with real‑time processing and machine‑learning pipelines, and the challenges and opportunities of intelligent operations.

Big DataIT Operationsaiops

0 likes · 14 min read

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

Ops Development Stories

Oct 11, 2019 · Cloud Native

Deploy a Complete Prometheus Monitoring Stack on Kubernetes (Step‑by‑Step)

This guide walks through the architecture of Prometheus, the key Kubernetes monitoring metrics, and step‑by‑step instructions to deploy Prometheus, Grafana, and Alertmanager on a K8s cluster, configure RBAC, set up ConfigMaps, expose services, import dashboards, and test alert notifications via email.

AlertmanagerDevOpsGrafana

0 likes · 27 min read

Deploy a Complete Prometheus Monitoring Stack on Kubernetes (Step‑by‑Step)

37 Interactive Technology Team

Sep 27, 2019 · Operations

Centralized Management of Cron Jobs: Challenges and Solutions

The article outlines how a company built a centralized cron‑job platform—using Python’s crontab library, SaltStack deployment, ELK log aggregation, and automated email alerts—to integrate existing tasks, provide reliable CRUD operations, enable fast log querying, and detect failures, cutting operational overhead while managing thousands of scheduled jobs across multiple servers.

Log ManagementOperationsPython

0 likes · 8 min read

Centralized Management of Cron Jobs: Challenges and Solutions

DevOps Cloud Academy

Sep 27, 2019 · Cloud Native

Configuring Prometheus Operator ServiceMonitor on OpenShift after Migrating from Mesos+Marathon

This article explains how to migrate a Mesos+Marathon environment to OpenShift and configure Prometheus Operator ServiceMonitor resources, including service creation, ServiceMonitor definition, and verification steps, with full YAML examples and screenshots of the monitoring UI.

Cloud NativeKubernetesOpenShift

0 likes · 6 min read

Configuring Prometheus Operator ServiceMonitor on OpenShift after Migrating from Mesos+Marathon

GF Securities FinTech

Sep 23, 2019 · Backend Development

Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering

This article details how a high‑traffic trading app migrated from Node.js to Go, outlining Go's advantages, drawbacks, and the team's engineering practices—including environment management, dependency handling, efficiency tools, standardized libraries, testing, monitoring, and distributed tracing—to achieve robust, high‑performance backend services.

Backend EngineeringGoci/cd

0 likes · 16 min read

Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering

Architecture Digest

Sep 23, 2019 · Operations

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.

backend operationsfault tolerancehigh availability

0 likes · 23 min read

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

Programmer DD

Sep 20, 2019 · Operations

Master Prometheus: Key Features, Architecture, and Query Essentials

This article introduces Prometheus, an open‑source cloud‑native monitoring and alerting system, covering its main characteristics, core components, architecture diagram, typical use cases, query language syntax, built‑in functions, time‑series types, and practical tips for reliable operation.

AlertingOperationsPromQL

0 likes · 9 min read

Master Prometheus: Key Features, Architecture, and Query Essentials

HomeTech

Sep 19, 2019 · Industry Insights

How Autohome Scaled Its 818 Global Car Night to Millions of QPS: A Technical Deep Dive

The article details how Autohome tackled a severe market downturn by launching the 818 Global Car Night, describing the background, massive technical challenges, infrastructure scaling, high‑availability architecture, full‑link stress testing, monitoring, security measures, and the lessons learned for future large‑scale online events.

Performance TestingScalabilitycloud computing

0 likes · 30 min read

How Autohome Scaled Its 818 Global Car Night to Millions of QPS: A Technical Deep Dive

Java Captain

Sep 19, 2019 · Backend Development

A Comprehensive Overview of Microservice Architecture and Its Evolution

This article presents a detailed, step‑by‑step illustration of microservice architecture, covering its motivations, component breakdown, migration from monoliths, common pitfalls, monitoring, tracing, logging, gateway, service discovery, resilience patterns, testing strategies, frameworks, and the emerging service‑mesh approach.

Service Meshfault tolerancemonitoring

0 likes · 23 min read

A Comprehensive Overview of Microservice Architecture and Its Evolution

Architects' Tech Alliance

Sep 17, 2019 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh and Best Practices

This article walks through the transition of an online supermarket from a simple monolithic web application to a fully fledged microservice architecture, highlighting the challenges, design decisions, component choices, monitoring, tracing, testing, and operational practices needed for a robust, scalable system.

DeploymentMicroservicesarchitecture

0 likes · 24 min read

Microservice Architecture Evolution: From Monolith to Service Mesh and Best Practices

dbaplus Community

Sep 16, 2019 · Operations

How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained

This article explains the three main monitoring approaches—log collection, distributed tracing, and metric gathering—in microservice architectures, outlines the layered monitoring model, lists key system, application, and user metrics, and reviews popular open‑source time‑series monitoring tools such as Prometheus, OpenTSDB, and InfluxDB.

MetricsMicroservicesPrometheus

0 likes · 10 min read

How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained

Ops Development Stories

Sep 10, 2019 · Operations

How to Configure Zabbix Monitoring for Windows Server with NAT and iptables

This guide walks through setting up a Zabbix server on ESXi, enabling NAT and port forwarding with iptables, installing the Zabbix agent on Windows Server 2012, and creating Windows‑specific monitoring items such as IIS process status and performance counters.

Network NATPerformance CountersZabbix

0 likes · 6 min read

How to Configure Zabbix Monitoring for Windows Server with NAT and iptables

FunTester

Sep 8, 2019 · Backend Development

How to Add Real‑Time Alert Notifications for API Test Failures in Java

This article explains how to detect server‑induced empty JSON responses during API automation, integrate the free AlertOver service for instant failure alerts, and provides complete Java code for a robust getHttpResponse method and an AlertOver utility class to send system, function, business, and reminder messages.

API testingBackendalert notification

0 likes · 9 min read

How to Add Real‑Time Alert Notifications for API Test Failures in Java

DevOps Cloud Academy

Sep 6, 2019 · Operations

Step-by-Step Installation and Configuration of Prometheus, Alertmanager, Node Exporter, and Grafana for Monitoring and Alerting

This guide walks through downloading, installing, configuring, and verifying Prometheus, Alertmanager, Node Exporter, and Grafana on a Linux server, including service setup, YAML configuration files, and a simple test to trigger and receive an alert via email.

AlertmanagerGrafanaInstallation

0 likes · 6 min read

Step-by-Step Installation and Configuration of Prometheus, Alertmanager, Node Exporter, and Grafana for Monitoring and Alerting

360 Tech Engineering

Sep 6, 2019 · Operations

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

This article introduces a StackStorm‑driven ChatOps framework that consolidates monitoring alerts, applies rule‑based root‑cause analysis, and automatically executes self‑healing actions, outlining its architecture, components, workflow definitions, and practical deployment results within an enterprise operations environment.

ChatOpsOperations AutomationStackStorm

0 likes · 6 min read

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

Aotu Lab

Sep 6, 2019 · Frontend Development

How We Revamped Our Homepage with TypeScript, Webpack, and Accessibility Enhancements

The article details a comprehensive homepage redesign that introduced strict TypeScript type checking, migrated to a customized Webpack build, added Nightwatch.js automated tests, upgraded monitoring with BadJS and performance metrics, implemented skeleton screens, and improved accessibility for visually impaired users.

Automated TestingFrontend OptimizationTypeScript

0 likes · 16 min read

How We Revamped Our Homepage with TypeScript, Webpack, and Accessibility Enhancements

DevOps Cloud Academy

Sep 5, 2019 · Operations

An Overview of the Prometheus Monitoring System

Prometheus, an open‑source monitoring and alerting toolkit originally developed by SoundCloud and now a CNCF project, offers multidimensional data models, flexible queries, pull‑based data collection, various metric types (counter, gauge, summary, histogram), local and remote storage, service discovery, and integrates with Grafana for visualization.

Cloud NativeMetricsOperations

0 likes · 8 min read

An Overview of the Prometheus Monitoring System

Liangxu Linux

Sep 4, 2019 · Operations

Automate Linux Memory & Swap Monitoring with Email Alerts

This guide walks through installing the msmtp email client, configuring mutt, using the free command to capture memory and swap statistics, writing Bash scripts to log and email the data, and scheduling the tasks with cron so alerts are sent when swap usage exceeds 80 %.

EmailSystem Administrationmonitoring

0 likes · 8 min read

Automate Linux Memory & Swap Monitoring with Email Alerts

MaGe Linux Operations

Sep 4, 2019 · Operations

Essential Linux Ops Tools: From Nethogs to Fail2ban with Installation Guides

This article presents a curated collection of practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—along with download links, installation commands, usage tips, and illustrative screenshots to help system administrators enhance monitoring, performance testing, and security.

monitoring

0 likes · 13 min read

Essential Linux Ops Tools: From Nethogs to Fail2ban with Installation Guides

Youzan Coder

Sep 4, 2019 · Cloud Native

How Youzan Built a Highly Available Kubernetes Platform for Massive E‑commerce

This article explains why Youzan chose Kubernetes, describes their multi‑IDC, multi‑cluster architecture with high‑availability master components, logging and monitoring solutions, custom service exposure, image building process, lifecycle hooks, continuous delivery pipeline, operational challenges faced, and future plans such as operators and auto‑scaling.

KubernetesMulti-Clusterci/cd

0 likes · 11 min read

How Youzan Built a Highly Available Kubernetes Platform for Massive E‑commerce

Big Data Technology Architecture

Sep 4, 2019 · Big Data

Deploying a Telegraf + InfluxDB + Grafana Monitoring Platform

This article walks through the complete deployment of a time‑series monitoring solution using the TICK stack—installing and configuring InfluxDB, Telegraf, and Grafana—to collect, store, and visualize key metrics such as CPU, memory, network, and disk I/O for a big‑data platform.

GrafanaInfluxDBTime Series

0 likes · 9 min read

Deploying a Telegraf + InfluxDB + Grafana Monitoring Platform