How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

This article chronicles the evolution of the WebMonitor front‑end monitoring system, detailing its three‑tier stack, data pipeline upgrades from raw disk sampling to HDFS and Elasticsearch, extensive collector‑side optimizations, Jetty thread and timeout tuning, and the resulting performance gains that lowered response times from minutes to sub‑second levels.

WecTeam
WecTeam
WecTeam
How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

Introduction

WebMonitor is a front‑end monitoring system that collects reports from WeChat Mini‑Programs, H5 pages, the JD Joy App, and some PC pages. During high‑traffic events such as the early‑2020 mask‑buying rush, the service suffered minute‑level outages, prompting a two‑year overhaul of its collector side.

WebMonitor Basic Architecture

The system, called the WebMonitor Stack , consists of three layers: Collector , Manager , and Data . Data flows through these layers.

Collector Layer

The collector sits behind Nginx and receives front‑end reports. It is built from two Flume agents.

The first Flume agent parses incoming requests, extracts business (biz) and badJs data, and forwards the parsed data to Athena and UMP before passing it downstream.

The second Flume agent persists data to HDFS and Kafka, enabling queries via Impala+HDFS or Elasticsearch (ES).

Manager Layer

The manager provides an “Old School” UI for maintaining biz/badJs point‑info, querying errors, and delivering configuration updates to collectors.

Data Layer

Data storage has evolved through three eras to handle >4 TB daily volume:

Black‑iron era : shock‑style sampling (<1% retained) stored locally, aggregated via rsync on a single node – no indexing, no parallelism.

Silver era : Data written to HDFS and queried in parallel with Impala – still no indexing.

Golden era : Data streamed to Kafka then indexed in ES – supports parallel queries.

Query latency improved dramatically:

Black‑iron: 1–2 min

Silver: 30 s

Golden: <5 s

Collector‑Side Architecture Optimization

The old collector flow processed four steps sequentially: receive → parse (compute) → report (IO) → store (high IO). The synchronous IO caused severe bottlenecks under massive load.

The new flow introduces three key changes:

Discard high‑IO operations in the synchronous path , e.g., avoid disk writes during the critical request handling.

Decouple monitoring upload from data persistence via an in‑memory Event Bus Channel , making the three stages (receive, upload, persist) fully asynchronous.

Persist data to a distributed file system (HDFS) instead of a single machine, eliminating the need for periodic rsync.

Performance comparison (average latency, TP99, max):

Old flow: 2.1 ms / 300 ms / 2000 ms

New flow: 0.7 ms / 200 ms / 2000 ms

Monitoring Upload Optimization

With a daily peak of 3 million QPM, the original UMP upload suffered from lack of aggregation for failed methods and identical keys, creating a bottleneck.

An initial attempt moved the upload entirely to a downstream Kafka consumer, but introduced a long chain of dependencies and low resource utilization.

The final design adds a Sink after the collector’s channel, performing aggregation of both success and failure reports within the collector process, thus eliminating per‑request IO spikes.

Jetty Server Thread Optimization

Jetty serves as a lightweight Java servlet container for WebMonitor. Its default thread pool (max 200, min 8) is suited for typical services that perform external calls, but WebMonitor is a “zero‑external‑call” HTTP service.

Adjustments made:

Reduce worker threads to 32 (or fewer on an 8‑core machine) to protect both the service and upstream Nginx under extreme load.

Configure connect threads via a thread pool to avoid a single point of failure.

public QueuedThreadPool () {</code><code>    this(200);</code><code>}</code><code>public QueuedThreadPool (@Name("maxThread") int maxThreads) {</code><code>    this(maxThreads, Math.min(8, maxThreads));</code><code>}

Jetty Server Timeout Optimization

Three timeout strategies were evaluated:

No Timeout : minimal memory load but high connection churn.

Short Timeout (100 ms) : protects Nginx from slow services but harms connection reuse.

Long Timeout (8000 ms, matching Nginx keep‑alive) : best for high availability, chosen as the final setting.

Unexpected Optimization: CPU Core Detection in Docker

The common Java call Runtime.getRuntime().availableProcessors() returned the host’s core count instead of the container’s limit in older JDK versions, leading to over‑provisioned thread pools.

JDK 1.8.0_190 fixed this detection, ensuring thread pools match the true CPU resources inside containers.

Final Performance Results

After all optimizations:

Old flow: 30 k QPM, Avg 2.1 ms, TP99 300 ms, Max 2000 ms

New flow: 41 k QPM, Avg 0.7 ms, TP99 200 ms, Max 2000 ms

Optimized overall: 81 k QPM, Avg 0.8 ms, TP99 3 ms, Max 25 ms

Java Backend Engineer Takeaways

80 % of performance issues stem from GC; focus on the root causes.

Over‑provisioned threads hurt performance; size thread pools wisely.

Never apply generic recommendations blindly; tailor solutions to your workload.

Use tools like Arthas and hardware monitors to pinpoint bottlenecks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaMonitoringdata pipelineJetty
WecTeam
Written by

WecTeam

WecTeam (维C团) is the front‑end technology team of JD.com’s Jingxi business unit, focusing on front‑end engineering, web performance optimization, mini‑program and app development, serverless, multi‑platform reuse, and visual building.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.