Tagged articles
2179 articles
Page 17 of 22
Efficient Ops
Efficient Ops
Mar 8, 2020 · Operations

Prometheus vs Zabbix: Install, Configure & Visualize with Grafana

This article compares Prometheus with Zabbix, walks through downloading and installing Prometheus, explains the key sections of prometheus.yml, shows how to add a node_exporter for machine metrics, and demonstrates integrating Grafana to create rich monitoring dashboards.

GrafanaLinuxPrometheus
0 likes · 11 min read
Prometheus vs Zabbix: Install, Configure & Visualize with Grafana
Didi Tech
Didi Tech
Mar 5, 2020 · R&D Management

Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem

At Didi, lean‑production ideas are woven into DevOps by establishing coding standards with SemVer and the NUWA framework, introducing traffic‑recording replay and a sim‑sidecar for realistic testing, extending monitoring with fine‑grained metrics, and unifying these practices into an ecosystem that cuts waste, speeds releases, and boosts overall software quality.

Frameworklean developmentmonitoring
0 likes · 7 min read
Lean Development Practices and DevOps Implementation at Didi: Coding, Testing, Monitoring, and Ecosystem
Efficient Ops
Efficient Ops
Mar 4, 2020 · Operations

Master Zabbix: From Installation to Advanced Custom Monitoring

This guide explains why monitoring is essential, describes the concept of availability "X nines," walks through Zabbix installation, web interface setup, host and template configuration, custom monitoring, alerting with OneAlert, visualization, distributed monitoring, SNMP integration, and provides practical command examples for managing large server fleets.

LinuxZabbixautomation
0 likes · 20 min read
Master Zabbix: From Installation to Advanced Custom Monitoring
Tencent IMWeb Frontend Team
Tencent IMWeb Frontend Team
Mar 4, 2020 · Frontend Development

How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges

During the COVID‑19 pandemic, Tencent Classroom’s front‑end team faced unprecedented traffic spikes, forcing rapid decisions on domain stability, video streaming, data platforms, messaging, monitoring, and deployment pipelines, while sharing lessons on scaling, resilience, and collaborative development under extreme pressure.

DeploymentTencent ClassroomVideo Streaming
0 likes · 13 min read
How Tencent Classroom’s Front‑End Team Survived Pandemic Traffic Surges
Programmer DD
Programmer DD
Mar 4, 2020 · Frontend Development

Customize Grafana Themes Without Rebuilding the Source Code

This guide walks you through a step‑by‑step method to add and switch custom Grafana themes using the Boom Theme panel plugin and ready‑made theme packs from GitHub, enabling theme changes across dashboards without modifying Grafana's source code.

GrafanaTheme Customizationfrontend development
0 likes · 5 min read
Customize Grafana Themes Without Rebuilding the Source Code
Qunar Tech Salon
Qunar Tech Salon
Feb 20, 2020 · Operations

Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud

This article explains why monitoring is essential for operations, outlines the four‑layer monitoring standard (infrastructure, liveliness, performance, business), breaks down functional modules and data flows, and showcases JD Cloud's practical design, alarm‑convergence project, and future AI‑driven observability directions.

JD CloudOperationsalert convergence
0 likes · 12 min read
Design and Implementation of Business‑Driven Monitoring Systems at JD Cloud
Product Technology Team
Product Technology Team
Feb 19, 2020 · Frontend Development

How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling

This article details how Zhenkun's frontend team responded to fast business growth by unifying their tech stack—introducing a private npm registry, a custom CLI scaffolding tool, Node.js backend, mock services, standardized webpack builds, DevOps automation, static resource delivery, monitoring, visual editors, UI component libraries, and automated testing—to boost development efficiency and maintainability across multiple locations.

DevOpsautomationfrontend
0 likes · 15 min read
How Zhenkun Built a Unified Frontend Tech Stack for Rapid Scaling
Didi Tech
Didi Tech
Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations
0 likes · 13 min read
Didi's National Carpool Day: Technical Insights into Stability Assurance
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 18, 2020 · Cloud Native

Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook

This article explains why online applications experience crashes during traffic spikes, outlines the complexity of modern cloud‑based service architectures, and shares Alibaba engineers’ practical notes on high‑availability design, capacity planning, full‑link stress testing, monitoring, traffic control, routine inspections, and chaos‑engineering drills using tools such as AHAS, PTS, Sentinel and Advisor.

Alibaba Cloudcapacity planningchaos engineering
0 likes · 12 min read
Why Do Your Apps Crash? Alibaba’s High‑Availability Architecture Playbook
Efficient Ops
Efficient Ops
Feb 17, 2020 · Operations

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

IT Operationscapacity planningincident response
0 likes · 15 min read
How Top IT Ops Teams Ensure Seamless Large-Scale Business Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 17, 2020 · Operations

How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability

This article details Hema's approach to guaranteeing system stability for its offline and delivery operations, covering the complete smart‑dispatch architecture, exhaustive dependency analysis, database and middleware safeguards, monitoring strategies, gray‑release practices, testing methods, and emergency response procedures that together enabled a year of zero failures.

Backend ArchitectureDatabase OptimizationMicroservices
0 likes · 24 min read
How Hema Achieved Zero‑Failure Smart Scheduling: Lessons in System Stability
ITPUB
ITPUB
Feb 10, 2020 · Operations

Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis

This guide compiles a practical collection of Linux command‑line tricks and Java troubleshooting tools—such as tail, grep, awk, find, tsar, btrace, Greys, jstack, jmap and more—complete with usage examples, code snippets and visual outputs to help engineers quickly diagnose and resolve production problems.

debuggingmonitoringtools
0 likes · 17 min read
Essential Linux and Java Debugging Commands for Rapid Issue Diagnosis
Architects' Tech Alliance
Architects' Tech Alliance
Feb 4, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the transformation of an online supermarket from a simple monolithic website to a fully fledged microservice architecture, highlighting the motivations, design decisions, common pitfalls, and essential components such as monitoring, tracing, logging, gateways, service discovery, circuit breaking, testing strategies, and service mesh adoption.

DeploymentMicroservicesService Mesh
0 likes · 22 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Big Data Technology Architecture
Big Data Technology Architecture
Jan 31, 2020 · Big Data

Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies

This article summarizes NetEase Hangzhou Research Institute expert Fan Xinxin's presentation on HBase, covering its role in the big‑data ecosystem, core production scenarios, RIT and HBCK troubleshooting techniques, and systematic monitoring and log‑analysis methods for diagnosing HBase issues.

HBCKHBaseRIT
0 likes · 11 min read
Practical Experience with HBase at NetEase: Architecture, Core Use Cases, HBCK & RIT Troubleshooting, and Diagnosis Strategies
Java Backend Technology
Java Backend Technology
Jan 23, 2020 · Backend Development

Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels

This tutorial walks you through using Spring Boot Actuator to monitor microservice applications, covering quick setup, essential endpoints such as health, metrics, loggers, and shutdown, customizing health indicators, dynamically changing log levels at runtime, and securing actuator endpoints with Spring Security.

ActuatorMetricsMicroservices
0 likes · 14 min read
Master Spring Boot Actuator: Real‑Time Monitoring, Metrics, and Dynamic Log Levels
dbaplus Community
dbaplus Community
Jan 22, 2020 · Backend Development

How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server

This article details a practical experiment that reproduces the load of 100 billion WeChat red‑packet (shake‑and‑grab) requests by simulating 1 million concurrent users on a single machine, achieving peak QPS of 60 k and demonstrating the architectural choices, hardware setup, and monitoring techniques required for such high‑throughput backend systems.

GoLoad TestingQPS
0 likes · 18 min read
How to Simulate 100 Billion WeChat Red‑Packet Requests on a Single Server
Alibaba Cloud Native
Alibaba Cloud Native
Jan 22, 2020 · Backend Development

Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More

This comprehensive guide explains the benefits of microservices and walks through core building blocks such as RPC, service discovery, configuration management, task scheduling, distributed locking, unified monitoring, caching strategies, message queues, distributed transactions, CAP theory, seckill handling, Docker isolation, and modern CI/CD deployment pipelines.

BackendConfiguration ManagementMicroservices
0 likes · 24 min read
Mastering Microservices: RPC, Service Discovery, Config, Scheduling & More
JD Retail Technology
JD Retail Technology
Jan 16, 2020 · Backend Development

Architecture and Key Technologies of a Scalable Message Push Platform

The document outlines the design, key components, data flow, and operational strategies of a large‑scale message push platform, detailing its architecture, request handling, long‑connection management, retry mechanisms, data statistics, monitoring, and future expansion plans.

Backend ArchitectureData AnalyticsLong Connections
0 likes · 15 min read
Architecture and Key Technologies of a Scalable Message Push Platform
Architecture Digest
Architecture Digest
Jan 14, 2020 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh

This article walks through the evolution of an online supermarket from a simple monolithic website to a fully split microservice system, highlighting the motivations, architectural changes, common pitfalls, and practical solutions such as monitoring, tracing, service discovery, circuit breaking, testing, and the eventual adoption of a service mesh.

MicroservicesService Mesharchitecture
0 likes · 22 min read
Microservice Architecture Evolution: From Monolith to Service Mesh
Architecture Digest
Architecture Digest
Jan 12, 2020 · Backend Development

Understanding Microservices Architecture: Concepts, Benefits, and Core Components

This article explains the fundamentals of microservices architecture, detailing its definition, core principles such as small independent services and lightweight communication, the advantages and drawbacks, suitable organizational contexts, and the essential technical components like service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration.

Microservicesarchitecturegateway
0 likes · 15 min read
Understanding Microservices Architecture: Concepts, Benefits, and Core Components
JD Retail Technology
JD Retail Technology
Jan 8, 2020 · Operations

Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation

This article explains how e‑commerce promotions differ from offline sales by offering lower participation thresholds and flexible discount tactics, outlines methods for estimating and handling traffic spikes, and provides detailed strategies for system capacity planning, load testing, monitoring, and incident response to ensure stable large‑scale promotional events.

Load Testingcapacity planninge‑commerce
0 likes · 23 min read
Comprehensive Guide to E‑commerce Promotion Traffic Management and System Preparation
360 Tech Engineering
360 Tech Engineering
Jan 7, 2020 · Operations

Introduction to Prometheus and Grafana for Monitoring and Alerting

This article provides a comprehensive overview of using Prometheus and Grafana for metric collection, storage, querying with PromQL, visualization, and alerting, including exporter integration, metric types, high‑availability setups, and practical examples for modern microservice architectures.

GrafanaMetricsPrometheus
0 likes · 10 min read
Introduction to Prometheus and Grafana for Monitoring and Alerting
Efficient Ops
Efficient Ops
Dec 29, 2019 · Operations

Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis

This article presents a comprehensive guide to Linux performance analysis, covering CPU, memory, disk I/O, network, system load, flame‑graph techniques, and a real‑world Nginx case study, enabling engineers to quickly locate and resolve bottlenecks.

CPU profilingLinuxSystem optimization
0 likes · 19 min read
Master Linux Performance: Tools & Flame Graphs for Fast Issue Diagnosis
Tencent Cloud Developer
Tencent Cloud Developer
Dec 27, 2019 · Cloud Computing

Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned

Tencent Classroom migrated roughly four million videos (about 1,500 TB) to Tencent Cloud in a two‑phase rollout that integrated cloud upload, transcoding, encrypted HLS playback with anti‑leech and DRM, added AI‑based content moderation, resolved SDK and multi‑region issues, and built a custom mini‑program player, ultimately boosting upload success rates, playback reliability, and security.

DRMHLS encryptionTencent Cloud
0 likes · 13 min read
Tencent Classroom Video Migration to Tencent Cloud: Architecture, Implementation, and Lessons Learned
Qunar Tech Salon
Qunar Tech Salon
Dec 27, 2019 · Operations

Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework

This article describes Qunar Ticket’s comprehensive test‑environment governance framework, including the “Mirror‑Inspect” monitoring service, configuration and data synchronization strategies, and automated allocation management, highlighting how these practices reduced environment‑related project delays from up to 20% to below 8%.

Configuration ManagementOperationsmonitoring
0 likes · 11 min read
Qunar Ticket Test‑Environment Governance and Automated Monitoring Framework
Aikesheng Open Source Community
Aikesheng Open Source Community
Dec 25, 2019 · Operations

Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage

This guide explains the background, key features, architecture, and step‑by‑step deployment of Thanos—including Sidecar, Store, Query, Compact, Bucket, Rule, and Check components—to provide a unified, high‑availability Prometheus monitoring view with unlimited historical data storage using object storage.

Cloud NativeDeploymentLong‑term Storage
0 likes · 9 min read
Deploying Thanos for Unified Prometheus Monitoring and Long‑Term Storage
Efficient Ops
Efficient Ops
Dec 22, 2019 · Operations

How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale

This article examines Baidu’s Noah monitoring and alarm platform, detailing its end‑to‑end fault‑handling workflow, the three‑component architecture, and the practical challenges of deploying AIOps—such as long algorithm iteration cycles, complex alarm management, and alarm storms—while highlighting scalability and commercial considerations.

Alarm ManagementOperationsaiops
0 likes · 15 min read
How Baidu’s Noah Monitoring System Tackles AIOps Challenges at Scale
Efficient Ops
Efficient Ops
Dec 19, 2019 · Operations

AIOps in Banking: Veteran’s Secrets to Smarter Operations

In this interview, veteran Bank of China software center analyst Yuan Chunliang shares two decades of experience, detailing how the bank’s shift to distributed core banking systems sparked the development of AIOps practices such as no‑threshold intelligent monitoring, multi‑indicator analytics, and AI‑driven ticket automation to boost operational efficiency and reduce risk.

Banking TechnologyIT Operationsaiops
0 likes · 14 min read
AIOps in Banking: Veteran’s Secrets to Smarter Operations
Programmer DD
Programmer DD
Dec 19, 2019 · Backend Development

Why Microservices Matter: Core Principles, Benefits, and Architecture Explained

This article introduces the fundamental concepts of microservices, covering their definition, advantages, design principles, core components such as service discovery, gateways, configuration centers, monitoring, circuit breaking, and container orchestration, while also discussing suitable organizational structures and practical implementation details.

Microservicescontainer orchestrationgateway
0 likes · 21 min read
Why Microservices Matter: Core Principles, Benefits, and Architecture Explained
MaGe Linux Operations
MaGe Linux Operations
Dec 18, 2019 · Operations

Mastering Modern IT Operations: Roles, Practices, and Evolution

This article outlines the comprehensive responsibilities and evolution of IT operations, covering system, application, database, security, and platform management, detailing tasks such as infrastructure building, monitoring, optimization, automation, and the shift from manual processes to self‑scheduling systems.

IT OperationsInfrastructureSystem Administration
0 likes · 20 min read
Mastering Modern IT Operations: Roles, Practices, and Evolution
dbaplus Community
dbaplus Community
Dec 17, 2019 · Artificial Intelligence

How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders

This article walks through the evolution of a ride‑hailing platform’s dispatch system—from a single‑database prototype to a data‑driven, AI‑powered architecture—detailing architectural choices, big‑data pipelines, model training, real‑time scheduling strategies, and monitoring practices for handling 400,000 daily orders.

AIDispatchRide Hailing
0 likes · 11 min read
How to Build a Scalable Intelligent Dispatch System for 400K Daily Orders
360 Tech Engineering
360 Tech Engineering
Dec 17, 2019 · Backend Development

Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes

This article explains how Java memory leaks can occur despite automatic garbage collection, describes JVM GC‑Root analysis, outlines practical monitoring with Spring Boot Actuator, Prometheus, and Grafana, and provides step‑by‑step debugging commands and code adjustments to locate and fix the leak.

Garbage CollectionJVMSpring Boot
0 likes · 10 min read
Diagnosing Java Memory Leaks: JVM GC Roots, Monitoring, and Code Fixes
WecTeam
WecTeam
Dec 17, 2019 · Frontend Development

How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance

By combining server‑side rendering, critical‑render‑path tuning, resource minification, image format upgrades, and RAIL‑based multi‑dimensional monitoring, JD dramatically reduced its WeChat shopping homepage’s first‑screen load time, offering a practical roadmap for front‑end performance optimization.

RAIL modelcritical render pathfrontend
0 likes · 17 min read
How JD Optimized Its WeChat Shopping Homepage for Lightning‑Fast Performance
360 Quality & Efficiency
360 Quality & Efficiency
Dec 13, 2019 · Operations

Using Zabbix to Monitor Service Ports and Configure Email Alerts

This article explains how to use Zabbix for simple service‑port monitoring, covering installation, host and item creation, trigger and graph setup, and email notification configuration, providing a practical guide for developers who need lightweight operational monitoring without writing custom code.

Email NotificationOperationsService Port
0 likes · 8 min read
Using Zabbix to Monitor Service Ports and Configure Email Alerts
360 Tech Engineering
360 Tech Engineering
Dec 5, 2019 · Databases

Design and Implementation of a High‑Availability InfluxDB Cluster at 360

This article introduces the fundamentals of time‑series databases, explains why InfluxDB was chosen, describes the TSM storage engine and shard concepts, outlines the internal 360 InfluxDB‑HA architecture, compares its performance with a single node, and provides integration and future‑development guidelines.

Cluster ArchitectureInfluxDBmonitoring
0 likes · 8 min read
Design and Implementation of a High‑Availability InfluxDB Cluster at 360
Meitu Technology
Meitu Technology
Dec 4, 2019 · Backend Development

Design and Implementation of lmstfy: A Redis‑Based Task Queue Service

lmstfy is a stateless, Redis‑backed task‑queue service from Meitu that provides delayed execution, automatic retries, priority handling, expiration, and a RESTful HTTP API, while supporting horizontal scaling via namespace‑based token routing, rich Prometheus metrics, and future disk‑based storage extensions.

Distributed SystemsTask Queuebackend service
0 likes · 15 min read
Design and Implementation of lmstfy: A Redis‑Based Task Queue Service
Java High-Performance Architecture
Java High-Performance Architecture
Dec 2, 2019 · Databases

How Redis Sentinel Ensures Automatic Failover and High Availability

Redis Sentinel provides an automated high‑availability solution for Redis by monitoring master health, broadcasting SDOWN/ODOWN messages, electing a new master based on priority, offset and runid, and allowing clients to discover the current master via sentinel commands, all explained with configuration examples and diagrams.

Configurationhigh availabilitymonitoring
0 likes · 6 min read
How Redis Sentinel Ensures Automatic Failover and High Availability
MaGe Linux Operations
MaGe Linux Operations
Nov 26, 2019 · Operations

Master Prometheus: From Basics to Advanced Configuration and Alerts

This article introduces Prometheus, an open‑source monitoring system, explains its core components such as server, exporters, and Alertmanager, provides step‑by‑step installation and configuration instructions, demonstrates alert rule setup, and shows integration with tools like Grafana, Telegraf, Spring Boot and Canal.

AlertmanagerDevOpsGrafana
0 likes · 10 min read
Master Prometheus: From Basics to Advanced Configuration and Alerts
Huajiao Technology
Huajiao Technology
Nov 26, 2019 · Backend Development

How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks

This article details the design, requirements, architecture, and operational dashboard of Pepperbus, a unified bus system that standardizes asynchronous task handling for PHP, Java, and Go services at Huajiao, highlighting its storage plug‑in model, Redis‑based protocol, and monitoring capabilities.

AsynchronousDashboardPHP
0 likes · 8 min read
How Pepperbus Unifies Asynchronous Task Management Across Diverse Tech Stacks
dbaplus Community
dbaplus Community
Nov 25, 2019 · Operations

From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure

This article outlines how the Weibo advertising team evolved its operations from hand‑crafted scripts to a fully automated, AI‑enhanced platform, covering service governance, multi‑datacenter deployment, a custom automation system (Kunkka), effective alerting, full‑link tracing, and a massive metric monitoring solution built on big‑data technologies.

DevOpsaiopsmonitoring
0 likes · 15 min read
From Manual Ops to AI‑Powered Monitoring: Scaling Weibo Ads Infrastructure
DevOps Coach
DevOps Coach
Nov 24, 2019 · Cloud Native

Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide

This article explains how cloud‑native applications can achieve full observability using the Elastic Stack by outlining the four essential steps—health checks, metrics, logs, and tracing—while discussing the underlying challenges, implementation patterns, and practical recommendations for reliable monitoring.

APMcloud-nativeelastic-stack
0 likes · 14 min read
Mastering Observability in Cloud‑Native Apps with Elastic Stack: A Four‑Step Guide
Programmer DD
Programmer DD
Nov 23, 2019 · Operations

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

LinuxOperationsmonitoring
0 likes · 11 min read
Essential Checklist for Rapid Server Troubleshooting
21CTO
21CTO
Nov 15, 2019 · Operations

How SRE Designs Highly Available Software Systems at Scale

This article presents Google SRE expert Ramón Medrano Llamas’s comprehensive guide on designing, operating, and maintaining large‑scale, highly available software systems, covering SRE fundamentals, daily workflows, scalability strategies, fault‑tolerant architecture, monitoring, and operational best practices.

SREScalable Systemsfault tolerance
0 likes · 13 min read
How SRE Designs Highly Available Software Systems at Scale
UCloud Tech
UCloud Tech
Nov 14, 2019 · Cloud Native

How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study

This article details LeXin Medical's journey from a manually built Kubernetes cluster to the UCloud UK8S platform, covering the challenges of self‑hosting, the tools and processes used for migration, and the resulting improvements in logging, monitoring, CI/CD, and overall operational efficiency.

Cloud NativeDevOpsKubernetes
0 likes · 10 min read
How LeXin Medical Streamlined Kubernetes with UCloud UK8S: A Migration Case Study
Huajiao Technology
Huajiao Technology
Nov 12, 2019 · Operations

How to Build a Scalable API Automation Framework for Search Services

This article explains the design, core features, implementation details, and real‑world deployment of the Auto_ApiTest tool for automating API testing in a large‑scale search platform, covering data management, configuration, code examples, CI integration, monitoring, and measurable outcomes.

API testingPythonautomation
0 likes · 17 min read
How to Build a Scalable API Automation Framework for Search Services
DataFunTalk
DataFunTalk
Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka
0 likes · 14 min read
Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans
360 Quality & Efficiency
360 Quality & Efficiency
Nov 1, 2019 · Mobile Development

Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support

This article demonstrates how to leverage uiautomator1.0 for Android automation by creating a shell‑based Context, accessing PackageManager, managing SQLite databases, monitoring app activities and processes, and implementing Chinese text input through AccessibilityNodeInfo.

Androidautomationdatabase
0 likes · 4 min read
Using uiautomator1.0 for Android Automation: Shell Context, PackageManager, Database, Activity & Process Monitoring, and Chinese Input Support
System Architect Go
System Architect Go
Oct 30, 2019 · Databases

InfluxDB Monitoring, Backup, and Restore Guide

This article explains InfluxDB's built‑in monitoring system, internal measurements, useful commands, HTTP endpoints, and provides detailed instructions for performing full backups and restores, including configuration tweaks, command syntax, and important considerations about formats and data ranges.

BackupInfluxDBRestore
0 likes · 5 min read
InfluxDB Monitoring, Backup, and Restore Guide
Tencent Cloud Developer
Tencent Cloud Developer
Oct 25, 2019 · Backend Development

High-Concurrency Practices for Tencent Video Front-End Node.js Services

Tencent Video’s front‑end Node.js services achieve massive concurrency stability through a layered architecture that combines GSLB‑directed CDN, TGW, Nginx, and clustered workers, reinforced by process guardians, three‑tier disaster‑recovery fallbacks, multi‑level caching with lock mechanisms, and comprehensive logging and alerting.

AvailabilityNode.jshigh concurrency
0 likes · 11 min read
High-Concurrency Practices for Tencent Video Front-End Node.js Services
Ctrip Technology
Ctrip Technology
Oct 17, 2019 · Backend Development

CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions

This article describes how Ctrip introduced a customized Dubbo framework called CDubbo, covering the motivations for adopting Dubbo, the initial implementation of service governance and monitoring, and subsequent extensions such as callback enhancement, serialization support, circuit‑breaking, testing tools, and a bastion testing gateway.

DubboMicroservicesRPC
0 likes · 13 min read
CDubbo: Ctrip’s Customized Dubbo Framework – Architecture, Governance, Monitoring, and Extensions
dbaplus Community
dbaplus Community
Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

On-CallSREalert optimization
0 likes · 15 min read
How to Cut Alert Noise: Practical SRE Strategies for Ops Teams
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 16, 2019 · Operations

Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo

At the 2019 Hangzhou Cloud Expo, Alibaba and Intel experts presented a series of intelligent operation solutions for large‑scale cloud infrastructure—including automated server repair, network change verification, application operation brain, monitoring advancements, power‑optimization, and data‑center management—demonstrating how AI‑driven techniques improve stability, cost, and efficiency.

Intelligent Operationsautomationcloud computing
0 likes · 7 min read
Intelligent Operations for Large-Scale Cloud Infrastructure: Insights from Alibaba and Intel at the 2019 Hangzhou Cloud Expo
dbaplus Community
dbaplus Community
Oct 15, 2019 · Big Data

How to Build Real‑Time Data Pipelines for E‑Commerce Promotions

This article examines the surge in real‑time data demands for e‑commerce promotions, outlines how to collect, compute, and deliver streaming data, compares batch and stream processing, lists typical use cases, and discusses the challenges of building scalable, low‑latency pipelines.

Data StreamingReal-Timemonitoring
0 likes · 11 min read
How to Build Real‑Time Data Pipelines for E‑Commerce Promotions
Efficient Ops
Efficient Ops
Oct 14, 2019 · Operations

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

This article shares a practical case study of implementing AIOps in an online‑education company, covering the background pain points of massive monitoring data, the designed architecture with real‑time processing and machine‑learning pipelines, and the challenges and opportunities of intelligent operations.

Big DataIT Operationsaiops
0 likes · 14 min read
How AIOps Transforms IT Operations: Real-World Architecture and Lessons
37 Interactive Technology Team
37 Interactive Technology Team
Sep 27, 2019 · Operations

Centralized Management of Cron Jobs: Challenges and Solutions

The article outlines how a company built a centralized cron‑job platform—using Python’s crontab library, SaltStack deployment, ELK log aggregation, and automated email alerts—to integrate existing tasks, provide reliable CRUD operations, enable fast log querying, and detect failures, cutting operational overhead while managing thousands of scheduled jobs across multiple servers.

Log ManagementOperationsPython
0 likes · 8 min read
Centralized Management of Cron Jobs: Challenges and Solutions
GF Securities FinTech
GF Securities FinTech
Sep 23, 2019 · Backend Development

Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering

This article details how a high‑traffic trading app migrated from Node.js to Go, outlining Go's advantages, drawbacks, and the team's engineering practices—including environment management, dependency handling, efficiency tools, standardized libraries, testing, monitoring, and distributed tracing—to achieve robust, high‑performance backend services.

Backend EngineeringGoci/cd
0 likes · 16 min read
Why Our Team Switched from Node.js to Go: Lessons in Backend Engineering
Architecture Digest
Architecture Digest
Sep 23, 2019 · Operations

Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System

The article describes how a high‑traffic payment platform achieves 99.999% availability by avoiding single points of failure, applying fail‑fast principles, implementing resource limits, building real‑time monitoring and alerting, and automating fault detection, routing, and recovery to ensure continuous 7×24 operation.

backend operationsfault tolerancehigh availability
0 likes · 23 min read
Improving Application Availability: Practices, Monitoring, and Fault‑Tolerance in a Large‑Scale Payment System
Programmer DD
Programmer DD
Sep 20, 2019 · Operations

Master Prometheus: Key Features, Architecture, and Query Essentials

This article introduces Prometheus, an open‑source cloud‑native monitoring and alerting system, covering its main characteristics, core components, architecture diagram, typical use cases, query language syntax, built‑in functions, time‑series types, and practical tips for reliable operation.

AlertingOperationsPromQL
0 likes · 9 min read
Master Prometheus: Key Features, Architecture, and Query Essentials
HomeTech
HomeTech
Sep 19, 2019 · Industry Insights

How Autohome Scaled Its 818 Global Car Night to Millions of QPS: A Technical Deep Dive

The article details how Autohome tackled a severe market downturn by launching the 818 Global Car Night, describing the background, massive technical challenges, infrastructure scaling, high‑availability architecture, full‑link stress testing, monitoring, security measures, and the lessons learned for future large‑scale online events.

Performance TestingScalabilitycloud computing
0 likes · 30 min read
How Autohome Scaled Its 818 Global Car Night to Millions of QPS: A Technical Deep Dive
Java Captain
Java Captain
Sep 19, 2019 · Backend Development

A Comprehensive Overview of Microservice Architecture and Its Evolution

This article presents a detailed, step‑by‑step illustration of microservice architecture, covering its motivations, component breakdown, migration from monoliths, common pitfalls, monitoring, tracing, logging, gateway, service discovery, resilience patterns, testing strategies, frameworks, and the emerging service‑mesh approach.

Service Meshfault tolerancemonitoring
0 likes · 23 min read
A Comprehensive Overview of Microservice Architecture and Its Evolution
Architects' Tech Alliance
Architects' Tech Alliance
Sep 17, 2019 · Backend Development

Microservice Architecture Evolution: From Monolith to Service Mesh and Best Practices

This article walks through the transition of an online supermarket from a simple monolithic web application to a fully fledged microservice architecture, highlighting the challenges, design decisions, component choices, monitoring, tracing, testing, and operational practices needed for a robust, scalable system.

DeploymentMicroservicesarchitecture
0 likes · 24 min read
Microservice Architecture Evolution: From Monolith to Service Mesh and Best Practices
dbaplus Community
dbaplus Community
Sep 16, 2019 · Operations

How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained

This article explains the three main monitoring approaches—log collection, distributed tracing, and metric gathering—in microservice architectures, outlines the layered monitoring model, lists key system, application, and user metrics, and reviews popular open‑source time‑series monitoring tools such as Prometheus, OpenTSDB, and InfluxDB.

MetricsMicroservicesPrometheus
0 likes · 10 min read
How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained
FunTester
FunTester
Sep 8, 2019 · Backend Development

How to Add Real‑Time Alert Notifications for API Test Failures in Java

This article explains how to detect server‑induced empty JSON responses during API automation, integrate the free AlertOver service for instant failure alerts, and provides complete Java code for a robust getHttpResponse method and an AlertOver utility class to send system, function, business, and reminder messages.

API testingBackendalert notification
0 likes · 9 min read
How to Add Real‑Time Alert Notifications for API Test Failures in Java
360 Tech Engineering
360 Tech Engineering
Sep 6, 2019 · Operations

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

This article introduces a StackStorm‑driven ChatOps framework that consolidates monitoring alerts, applies rule‑based root‑cause analysis, and automatically executes self‑healing actions, outlining its architecture, components, workflow definitions, and practical deployment results within an enterprise operations environment.

ChatOpsOperations AutomationStackStorm
0 likes · 6 min read
StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing
Aotu Lab
Aotu Lab
Sep 6, 2019 · Frontend Development

How We Revamped Our Homepage with TypeScript, Webpack, and Accessibility Enhancements

The article details a comprehensive homepage redesign that introduced strict TypeScript type checking, migrated to a customized Webpack build, added Nightwatch.js automated tests, upgraded monitoring with BadJS and performance metrics, implemented skeleton screens, and improved accessibility for visually impaired users.

Automated TestingFrontend OptimizationTypeScript
0 likes · 16 min read
How We Revamped Our Homepage with TypeScript, Webpack, and Accessibility Enhancements
DevOps Cloud Academy
DevOps Cloud Academy
Sep 5, 2019 · Operations

An Overview of the Prometheus Monitoring System

Prometheus, an open‑source monitoring and alerting toolkit originally developed by SoundCloud and now a CNCF project, offers multidimensional data models, flexible queries, pull‑based data collection, various metric types (counter, gauge, summary, histogram), local and remote storage, service discovery, and integrates with Grafana for visualization.

Cloud NativeMetricsOperations
0 likes · 8 min read
An Overview of the Prometheus Monitoring System
Liangxu Linux
Liangxu Linux
Sep 4, 2019 · Operations

Automate Linux Memory & Swap Monitoring with Email Alerts

This guide walks through installing the msmtp email client, configuring mutt, using the free command to capture memory and swap statistics, writing Bash scripts to log and email the data, and scheduling the tasks with cron so alerts are sent when swap usage exceeds 80 %.

EmailSystem Administrationmonitoring
0 likes · 8 min read
Automate Linux Memory & Swap Monitoring with Email Alerts
MaGe Linux Operations
MaGe Linux Operations
Sep 4, 2019 · Operations

Essential Linux Ops Tools: From Nethogs to Fail2ban with Installation Guides

This article presents a curated collection of practical Linux operation tools—including Nethogs, IOZone, IOTop, IPtraf, IFTop, HTop, NMON, MultiTail, Fail2ban, Tmux, Agedu, NMap, and Httperf—along with download links, installation commands, usage tips, and illustrative screenshots to help system administrators enhance monitoring, performance testing, and security.

monitoring
0 likes · 13 min read
Essential Linux Ops Tools: From Nethogs to Fail2ban with Installation Guides
Youzan Coder
Youzan Coder
Sep 4, 2019 · Cloud Native

How Youzan Built a Highly Available Kubernetes Platform for Massive E‑commerce

This article explains why Youzan chose Kubernetes, describes their multi‑IDC, multi‑cluster architecture with high‑availability master components, logging and monitoring solutions, custom service exposure, image building process, lifecycle hooks, continuous delivery pipeline, operational challenges faced, and future plans such as operators and auto‑scaling.

KubernetesMulti-Clusterci/cd
0 likes · 11 min read
How Youzan Built a Highly Available Kubernetes Platform for Massive E‑commerce