monitoring | BestHub

Collection size

1767 articles

Page 12 of 89

Didi Tech

Aug 9, 2023 · Backend Development

Upgrading Didi Elasticsearch to JDK 17 with ZGC: Challenges, Solutions, and Performance Gains

Didi upgraded its self‑developed Elasticsearch from JDK 11/G1 to JDK 17, adopting ZGC for latency‑critical clusters and tuned G1 for throughput, which eliminated long GC pauses, reduced query latency by up to 96%, cut CPU usage, and dramatically improved stability across multiple production clusters.

BackendElasticsearchGC Optimization

0 likes · 14 min read

Upgrading Didi Elasticsearch to JDK 17 with ZGC: Challenges, Solutions, and Performance Gains

Didi Tech

Jul 11, 2023 · Operations

DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations

Didi’s ride‑hailing R&D team addresses efficiency and stability challenges of a large micro‑service ecosystem by unifying a Go stack, common framework, and data models, using eBPF traffic recording for automated regression testing, and applying AIOps alert filtering, knowledge‑graph root‑cause analysis, and a localization robot for rapid fault recovery, while targeting full CI/CD automation with static analysis, service‑mesh observability, and chaos engineering.

AIOpsCloudNativeDevOps

0 likes · 22 min read

DevOps Practices and Challenges at Didi Ride‑Hailing: From Development to Operations

Didi Tech

Jan 14, 2021 · Cloud Computing

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Didi’s Logi‑KafkaManager is a multi‑tenant Kafka cloud platform that consolidates dozens of clusters into a secure, isolated gateway‑driven service offering intuitive web‑based topic management, real‑time metrics visualization, automated diagnostics, quota governance and safe scaling, delivering high internal satisfaction and enterprise commercialization.

Big DataCloud platformData Security

0 likes · 17 min read

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Didi Tech

Jun 3, 2020 · Backend Development

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

backend stabilitydeployment best practicesincident management

0 likes · 21 min read

Stability Guidelines and Anti‑Patterns for Backend Services

Didi Tech

Feb 18, 2020 · Backend Development

Didi Ride‑Sharing Dispatch Engine: Architecture, Challenges, and Stability Measures for Carpool Day

During Didi’s 2019 Carpool Day promotion, a surge of up to 6.6‑times normal matching traffic forced a redesign of its dispatch engine, introducing near‑time assignment, filtered logic moves, configurable timeouts, extensive stress testing, monitoring, and rapid on‑call procedures that cut downstream pressure by over half.

System Architecturecapacity planningcarpool

0 likes · 11 min read

Didi Ride‑Sharing Dispatch Engine: Architecture, Challenges, and Stability Measures for Carpool Day

Didi Tech

Feb 18, 2020 · Operations

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi's National Carpool Day on Dec 3 2019 attracted 3.1M passengers; stability ensured via six pillars: organized task force, capacity forecasting and rapid container scaling, comprehensive monitoring with fire‑fighting map, robust contingency platform, strict process standards, and coordinated third‑party preparation.

Carpool DayDidiOperations

0 likes · 13 min read

Didi's National Carpool Day: Technical Insights into Stability Assurance

Didi Tech

Jan 7, 2019 · Operations

Data‑Driven Risk Quantification Platform for SRE at Didi

Didi’s data‑driven Risk Quantification Platform assigns numeric Change Credit and Monitoring Health scores to deployments, alerts and core services, turning operational best‑practice adoption into a competitive game that has raised scores, cut incident rates despite higher change volume, and paves the way for broader risk‑management across the organization.

Risk QuantificationSREdata-driven operations

0 likes · 9 min read

Data‑Driven Risk Quantification Platform for SRE at Didi

iQIYI Technical Product Team

May 24, 2024 · Operations

High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)

iQIYI’s Video Relay Service ensures uninterrupted video playback by employing a two‑region, three‑center hybrid cloud architecture, multi‑layer storage, cross‑AZ retry mechanisms, protective rate‑limiting and degradation paths, layered monitoring, and rigorous stress‑testing and chaos engineering to achieve high availability and disaster recovery.

Backend ArchitectureCloud NativeDisaster Recovery

0 likes · 18 min read

High Availability and Disaster Recovery Practices of iQIYI's Video Relay Service (VRS)

iQIYI Technical Product Team

May 12, 2023 · Operations

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

The article explains that high metric cardinality in Prometheus causes long query times and timeouts, and demonstrates how using recording rules to pre‑compute aggregates dramatically reduces cardinality and latency, while recommending scrape interval tuning and metric design best practices to keep charts responsive.

Performance TuningPrometheusQuery Optimization

0 likes · 10 min read

Performance Troubleshooting and Optimization of Prometheus Monitoring Queries

iQIYI Technical Product Team

Mar 12, 2021 · Operations

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

To meet the LEDAO platform’s need for rapid anomaly detection, full‑stack observability, and reliable alerting across more than 100 microservices, iQIYI evaluated OpenFalcon, Prometheus and CAT, selected CAT, deployed separate mainland and overseas clusters, added configurable access, health‑check and integrated alert channels, enabling five‑minute service onboarding, near‑zero‑intrusion instrumentation, and real‑time business‑level monitoring.

AlertingDevOpsObservability

0 likes · 12 min read

Implementation and Practice of LEDAO‑CAT Monitoring System for iQIYI Content Platform

iQIYI Technical Product Team

Nov 13, 2020 · Operations

Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform

iQIYI’s Consul‑based service registry, tightly integrated with its QAE container platform and API gateway, suffered a multi‑DC outage caused by network jitter and a metrics‑library lock‑contention bug, which was resolved by upgrading Go, go‑metrics, and Raft, adding extensive monitoring, redundant DC registration, and dedicated per‑gateway Consul clusters to ensure continued stability and scalability.

ConsulMicroservicesOperations

0 likes · 17 min read

Building and Optimizing a Consul‑Based Service Registry for iQIYI's Microservice Platform

iQIYI Technical Product Team

Sep 18, 2020 · Operations

Full-Chain Load Testing Practices for iQIYI Payment System

iQIYI’s payment team built a full‑chain load‑testing framework that isolates data, mocks dependencies, constructs realistic multi‑service traffic, and executes protected tests to expose bottlenecks, guide scaling and optimizations, and ultimately ensure reliable payment services during traffic spikes, while planning a unified automation platform.

Payment SystemPerformance engineeringcapacity planning

0 likes · 13 min read

Full-Chain Load Testing Practices for iQIYI Payment System

iQIYI Technical Product Team

May 29, 2020 · Big Data

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.

ArchitectureMetricsTracing

0 likes · 12 min read

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQIYI Technical Product Team

Apr 17, 2020 · Mobile Development

Building iQIYI's Mobile Middle Platform: Architecture, Decoupling, and SaaS Enablement

iQIYI’s Mobile Middle Platform decouples its multiple apps into a reusable, SaaS‑enabled architecture that centralizes services through the QMAS portal, provides ready‑made scaffolding and cross‑platform frameworks, and ensures high‑availability via comprehensive monitoring and a custom network foundation, dramatically accelerating development and unifying user experience.

CI/CDComponent DecouplingHigh Availability

0 likes · 13 min read

Building iQIYI's Mobile Middle Platform: Architecture, Decoupling, and SaaS Enablement

iQIYI Technical Product Team

Apr 26, 2019 · Operations

Design and Implementation of iQIYI CDN Inspection System

iQIYI built a three‑component CDN Inspection System that automatically generates tasks, centrally processes and analyzes results, and runs edge measurements to monitor millions of hybrid CDN servers in real time, detecting configuration errors, file mismatches and traffic anomalies, enabling proactive remediation and 100 % local coverage.

CDNOperationscloud computing

0 likes · 11 min read

Design and Implementation of iQIYI CDN Inspection System

iQIYI Technical Product Team

Mar 15, 2019 · Cloud Computing

Design and Architecture of QLive Large‑Scale Live Streaming Service

The QLive service powers iQIYI’s massive live‑streaming events—such as the Spring Festival Gala—by combining vertical and horizontal scaling, a three‑layer architecture with dual data‑center isolation, multi‑level caching, circuit‑breaker/degradation controls, and a Flume‑Kafka‑Hive monitoring pipeline to sustain over 400 k QPS and 99.9999 % availability.

Live Streamingcachingfault tolerance

0 likes · 9 min read

Design and Architecture of QLive Large‑Scale Live Streaming Service

37 Interactive Technology Team

Feb 8, 2024 · Operations

What Are Kubernetes Events and How to Collect Them

Kubernetes events record state changes such as pod scheduling, image pulling, and failures, which can be inspected via kubectl but are retained only an hour, so tools like kube-eventer or kubernetes-event-exporter collect them for long‑term analysis, enabling monitoring of Warning types, failure reasons, and visualization through Grafana dashboards.

Cloud NativeEventsGrafana

0 likes · 9 min read

What Are Kubernetes Events and How to Collect Them

37 Interactive Technology Team

May 25, 2018 · Operations

Optimization and Redesign of Open-Falcon Monitoring System for the 37 Monitoring Platform

The project redesigns the Open‑Falcon monitoring system for the 37 platform by integrating it with the existing CMDB, adding distributed‑lock high‑availability for judge and alarm modules, optimizing cross‑region agent data transmission, fixing timezone inconsistencies, and enabling redundant query/graph services, thereby unifying disparate monitoring tools into a scalable, reliable solution.

ArchitectureCMDBHigh Availability

0 likes · 11 min read

Optimization and Redesign of Open-Falcon Monitoring System for the 37 Monitoring Platform

HelloTech

Jan 31, 2023 · Operations

Stability Assurance Practices for Large‑Scale Promotional Events

The article outlines a comprehensive stability‑assurance framework for large‑scale promotional events—detailing planning, capacity and pressure‑test rehearsals, strict change‑freeze, internal gray releases, coordinated on‑call response, thorough link and capacity analysis, monitoring, emergency procedures, cross‑team collaboration, external partner coordination, and post‑event review to ensure resilient system performance.

Large-Scale Eventscapacity planningchange control

0 likes · 17 min read

Stability Assurance Practices for Large‑Scale Promotional Events

Bilibili Tech

Aug 9, 2024 · Operations

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink

The new Monitoring 2.0 architecture separates collection, compute and storage, adopts VictoriaMetrics for compact time‑series storage and a zone‑based scheduler, introduces push‑based ingestion, uses Flink for real‑time pre‑aggregation and automatic PromQL rewrite, delivering ten‑fold query speedups, sub‑300 ms p90 latency, and dramatically higher write and query throughput.

FlinkMetricsObservability

0 likes · 29 min read

Design and Optimization of Monitoring 2.0 Architecture with VictoriaMetrics and Flink