Topic

monitoring

Collection size
1642 articles
Page 1 of 83
vivo Internet Technology
vivo Internet Technology
Dec 18, 2024 · Big Data

Kafka Streams: Architecture, Configuration, and Monitoring Use Cases

Kafka Streams is a client library that enables low‑latency, fault‑tolerant real‑time processing of Kafka data through configurable topologies, time semantics, and state stores, and the article explains its architecture, essential configurations, monitoring‑focused ETL example, performance tuning, and strategies for handling partition skew.

Big DataETLJava
0 likes · 25 min read
Kafka Streams: Architecture, Configuration, and Monitoring Use Cases
Architecture Digest
Architecture Digest
Jun 19, 2024 · Operations

Log Collection Solution: Filebeat + Graylog – Configuration and Deployment Guide

This article provides a comprehensive guide on building a unified log collection system using Filebeat and Graylog, covering the tools' concepts, configuration files, component functions, Docker deployment, and practical code examples for efficient log monitoring across multiple environments.

DockerELKFilebeat
0 likes · 14 min read
Log Collection Solution: Filebeat + Graylog – Configuration and Deployment Guide
Linux Ops Smart Journey
Linux Ops Smart Journey
Jun 11, 2025 · Cloud Native

Master Cloud‑Native Monitoring: Deploy Prometheus Operator with Helm

This guide explains why traditional monitoring falls short in cloud‑native environments and shows step‑by‑step how to install and configure the Prometheus Operator on Kubernetes using Helm, including custom image settings, storage configuration, and verification of the deployed services.

Cloud NativeHelmKubernetes
0 likes · 7 min read
Master Cloud‑Native Monitoring: Deploy Prometheus Operator with Helm
Tencent Cloud Developer
Tencent Cloud Developer
Mar 19, 2025 · Cloud Native

Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure

Monitoring Kubernetes is essential to detect resource contention, component failures, and network issues; it involves tracking core component metrics such as API server latency, etcd write times, scheduler delays, as well as node‑level CPU, memory, disk, and network statistics, pod health, and custom application metrics exposed via Prometheus exporters for comprehensive observability.

Cloud NativeExportersKubernetes
0 likes · 23 min read
Kubernetes Monitoring: Why It’s Needed, Core Components, and Metric Exposure
Meitu Technology
Meitu Technology
Jun 12, 2019 · Cloud Computing

Meitu's Cloud-Based Image Beautification and Large-Scale Video Processing Architecture

Meitu replaced on-device beautification and video processing with a cloud-native architecture that routes requests by region, uses a dedicated upload SDK for detailed monitoring, employs edge-computing, a configuration-driven plug-in framework and Kubernetes-based elastic scaling, enabling fast, reliable, globally-distributed image and video services.

Edge ComputingMeitucloud computing
0 likes · 12 min read
Meitu's Cloud-Based Image Beautification and Large-Scale Video Processing Architecture
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Aug 30, 2024 · Cloud Native

Middleware Containerization and Cloud‑Native Transformation at OPPO

OPPO transformed its sprawling, manually‑provisioned middleware clusters into a cloud‑native, containerized platform by building custom Kubernetes controllers, IP‑preserving StatefulSets, resource‑isolated containers, automated monitoring and self‑healing workflows, enabling rapid provisioning, efficient utilization, fault‑tolerant scaling and future serverless and service‑mesh integration.

Cloud NativeContainerizationFault Tolerance
0 likes · 20 min read
Middleware Containerization and Cloud‑Native Transformation at OPPO
JD Retail Technology
JD Retail Technology
Nov 13, 2024 · R&D Management

Guidelines for New Project Managers: Initiation, Planning, Execution, and Monitoring

This article shares practical advice for novice project managers, covering the four process groups—initiation, planning, execution, and monitoring—through real‑world examples, stakeholder identification, risk handling, change control, and communication techniques to help them deliver value and grow their teams.

executioninitiationmonitoring
0 likes · 25 min read
Guidelines for New Project Managers: Initiation, Planning, Execution, and Monitoring
Beike Product & Technology
Beike Product & Technology
Jul 31, 2020 · Mobile Development

Design and Evolution of a Mobile Live‑Streaming Platform at Beike

This article describes how Beike built, refined, and scaled a mobile live‑streaming platform—detailing early challenges, architectural pain points of version 1.0, and the systematic improvements introduced in version 2.0 such as clear boundaries, functional aggregation, layered platform design, dynamic configuration, monitoring, and zero‑cost integration to support diverse business scenarios.

SDKdynamic-configurationlive streaming
0 likes · 11 min read
Design and Evolution of a Mobile Live‑Streaming Platform at Beike
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
Jul 10, 2023 · Mobile Development

Mobile Application Quality System – Standard Operating Procedure (SOP)

This document outlines a comprehensive Standard Operating Procedure for building and maintaining a mobile application quality system, covering background, pre‑emptive planning, coding standards, branch management, code review, AI‑assisted tools, monitoring, issue handling, and continuous improvement to ensure stable, high‑quality mobile products.

AI toolsSOPcode management
0 likes · 27 min read
Mobile Application Quality System – Standard Operating Procedure (SOP)
Refining Core Development Skills
Refining Core Development Skills
Oct 19, 2020 · Operations

Linux Network Packet Monitoring and Tuning: Tools, RingBuffer, Interrupts, and SoftIRQ Optimization

This article explains how to monitor and tune Linux network packet reception using tools such as ethtool, ifconfig, and procfs, covering RingBuffer inspection, hardware and soft interrupt analysis, multi‑queue configuration, interrupt coalescing, and GRO settings to improve throughput and reduce packet loss.

KernelLinuxNetwork
0 likes · 17 min read
Linux Network Packet Monitoring and Tuning: Tools, RingBuffer, Interrupts, and SoftIRQ Optimization
JD Tech
JD Tech
Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big Datadashboardmonitoring
0 likes · 10 min read
Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards
Ctrip Technology
Ctrip Technology
Sep 23, 2024 · Frontend Development

Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes

This article details the design and deployment of an intelligent alert attribution system for Ctrip Hotel's front‑end, describing the background challenges, the unified data pool, weighted alert rules, three attribution algorithms, achieved improvements in accuracy and troubleshooting speed, and future enhancement plans.

AlertAttributionData Pipeline
0 likes · 18 min read
Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes
Ctrip Technology
Ctrip Technology
Oct 10, 2018 · Operations

Design and Implementation of Ctrip's Fourth-Generation Full-Link Performance Testing System

This article outlines the evolution of Ctrip’s performance testing approaches across three generations, analyzes their limitations, and presents the design, architecture, data construction, request tracing, monitoring, and operational considerations of the fourth-generation full‑link testing platform, including case studies and future outlook.

Capacity Planningfull-link testingload testing
0 likes · 14 min read
Design and Implementation of Ctrip's Fourth-Generation Full-Link Performance Testing System
58 Tech
58 Tech
Nov 27, 2024 · Operations

Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned

This article details how 58 Group’s cloud authentication service introduced an observability framework—optimizing logs, employing distributed tracing, defining SLO/SLA metrics, and implementing burn‑rate alerts—to improve fault detection, reduce false alarms, and achieve faster root‑cause analysis across the system.

Distributed TracingError BudgetSLO
0 likes · 16 min read
Building an Observability System for Cloud Authentication: Practices, Metrics, and Lessons Learned
IT Architects Alliance
IT Architects Alliance
Jan 6, 2025 · Operations

Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies

The article explores how distributed systems achieve high reliability through redundant design, precise fault detection and recovery, data replication and synchronization, coordinated fault tolerance and load balancing, distributed transaction handling, comprehensive monitoring, elastic scaling, security safeguards, and robust disaster‑recovery planning.

Distributed SystemsFault Tolerancemonitoring
0 likes · 18 min read
Ensuring High Reliability in Distributed Systems: Redundancy, Fault Detection, Replication, and Resilience Strategies
Java Architect Essentials
Java Architect Essentials
Oct 27, 2024 · Operations

Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization

This article explains how to use Prometheus together with Spring Boot Actuator and Micrometer to collect, expose, and visualize application metrics, including step‑by‑step dependency configuration, YAML settings, Docker deployment of Prometheus and Grafana, and adding custom metrics for comprehensive monitoring.

ActuatorGrafanaMicrometer
0 likes · 10 min read
Integrating Prometheus with Spring Boot for Real‑time Monitoring and Grafana Visualization
Code Ape Tech Column
Code Ape Tech Column
Jul 26, 2024 · Operations

Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation

This article presents a comprehensive collection of Bash scripts that perform tasks such as verifying file consistency across servers, scheduled log cleaning, network traffic monitoring, numeric analysis in files, automated FTP downloads, interactive number games, Nginx 502 detection, variable assignments, bulk file renaming, IP address validation, and various system administration operations.

AutomationSystem Administrationbash
0 likes · 24 min read
Bash Scripts for File Consistency Checks, Log Monitoring, and System Automation
FunTester
FunTester
Mar 2, 2025 · Operations

Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems

The article examines typical fault propagation scenarios such as avalanche effects, cascading failures, resource exhaustion, data pollution, and dependency cycles in distributed systems, and outlines proactive measures like rate limiting, circuit breaking, isolation, monitoring, and chaos engineering to prevent small issues from escalating into large-scale outages.

Chaos EngineeringCircuit BreakerDistributed Systems
0 likes · 11 min read
Common Fault Propagation Patterns and Prevention Strategies in Distributed Systems
Architect
Architect
Dec 31, 2024 · Operations

Integrating Prometheus with Spring Boot and Visualizing Metrics Using Grafana

This guide explains how to monitor a Spring Boot application using Prometheus, configure Spring Boot Actuator, run Prometheus (including Docker deployment), set up Grafana for visualizing metrics, and create custom metrics with Micrometer, providing step‑by‑step instructions and code examples.

ActuatorDockerGrafana
0 likes · 10 min read
Integrating Prometheus with Spring Boot and Visualizing Metrics Using Grafana
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

AutomationBig Datacluster management
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters