How Ele.me’s Rider App Achieves End‑to‑End Business Availability Monitoring
This article details Ele.me Logistics' mobile‑app monitoring architecture—E‑Monitor, TimeBomb, Dogger, and EDW—explaining how each layer collects, visualizes, and analyzes business‑level availability data, and showcases a real‑world debugging case that leveraged the stack to resolve an HTTP/2 connectivity bug.
E‑Monitor: Global Business Monitoring
Client‑side business monitoring is required because backend API metrics cannot fully reflect user experience. A request flow consists of data preparation, network transmission, callbacks, parsing, and rendering; any failure impacts the user. Ele.me uses the Skynet SDK for reliable data collection and upload, storing metrics in LinDB—a high‑write, high‑query time‑series database. The E‑Monitor visualization layer provides real‑time dashboards and threshold‑based alerts.
TimeBomb: Exception Event Monitoring
TimeBomb monitors abnormal events that do not meet defined time or count thresholds, supplementing global business monitoring. Parameters such as retry counts, intervals, and sampling rates can be configured remotely. Typical use cases include repeated login failures, delivery confirmations, and location errors. The UI shows anomaly curves per tag and detailed logs for selected nodes.
Dogger: Single‑Point Log Monitoring
Dogger consists of the Trojan log‑upload SDK and the Dogger‑Service log‑analysis backend. Trojan injects AOP‑based instrumentation, uses mmap for fast file writes, and compresses logs with gzip (up to 50× reduction). It captures user actions, page lifecycles, network requests, battery, memory, and thread information.
Log parsing is handled by Dogger‑Service, which offers three modules:
ActionChart : visualizes daily rider page transitions, battery, memory, network switches, location frequency, and request rates.
Origin : fast file parsing, time‑based search, tag filtering, and keyword highlighting to locate precise log entries.
Statistics : tag‑based data mining for metrics such as battery, network, traffic, stalls, requests, lifecycle, memory, and location.
EDW: Offline Data Warehouse
The offline reporting layer (EDW) complements real‑time dashboards by providing fine‑grained, day‑long analyses. It aggregates per‑order health, failure ratios, and complex conditional metrics, enabling a “god‑view” of the entire business line. Ele.me’s self‑built EDW platform supports instant queries, data extraction, computation, and monitoring for traffic, location quality, multi‑device usage, offline delivery, push quality, and order anomalies.
Practical Case Study: Debugging a Network‑Layer Issue
Step 1 : Grafana showed the Android rider order request success rate dropping to 98.69% (target >99%). DNS failures were observed in weak‑network scenarios but deemed normal. The failing request IDs were absent from the backend trace system, and inspection of the Skynet network interceptor revealed it was placed at the very end, causing requests to fail before reaching the server.
Step 2 : EDW aggregation identified IOException as the dominant failure cause.
Step 3 : Upgraded OkHttp to 3.11 and added an EventListener to capture request lifecycle events. Logs showed that some requests, even under good network conditions, threw errors or timed out after responseHeaderStart. Source code review uncovered a bug in OkHttp’s HTTP/2 connection‑pool reuse logic.
private void establishProtocol(ConnectionSpecSelector connectionSpecSelector,
int pingIntervalMillis, Call call, EventListener eventListener) throws IOException {
if (route.address().sslSocketFactory() == null) {
if (route.address().protocols().contains(Protocol.H2_PRIOR_KNOWLEDGE)) {
socket = rawSocket;
protocol = Protocol.H2_PRIOR_KNOWLEDGE;
startHttp2(pingIntervalMillis);
return;
}
socket = rawSocket;
protocol = Protocol.HTTP_1_1;
return;
}
eventListener.secureConnectStart(call);
connectTls(connectionSpecSelector);
eventListener.secureConnectEnd(call, handshake);
if (protocol == Protocol.HTTP_2) {
startHttp2(pingIntervalMillis);
}
}The bug stemmed from noNewStreams being false for HTTP/2, preventing the connection from being removed from the pool:
boolean connectionBecameIdle(RealConnection connection) {
if (connection.noNewStreams || maxIdleConnections == 0) {
connections.remove(connection);
return true;
} else {
// keep connection
}
}Because the backend switched to the SoPush service (which uses HTTP/2), the issue aligned with the observed anomaly window. Versions prior to 3.10 lacked robust HTTP/2 ping validation; even 3.11 could still exhibit the problem. The fix was to force Android to use HTTP/1.1 for the rider app, later re‑enabling HTTP/2 after integrating the corporate network library. After releasing the updated internal build, the success‑rate curve returned to normal.
Conclusion
Ele.me Logistics’ multi‑layer monitoring system—E‑Monitor, TimeBomb, Dogger, and EDW—provides comprehensive visibility from global dashboards to per‑rider logs, enabling rapid detection, root‑cause analysis, and remediation of client‑side availability issues. Ongoing challenges include data redundancy, monitoring‑induced performance overhead, and continuous iteration of the monitoring stack.
Trojan SDK open‑source: https://github.com/eleme/Trojan
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
