Mobile Development 16 min read

How Ele.me’s Rider App Achieves End‑to‑End Business Availability Monitoring

This article details Ele.me Logistics' mobile‑app monitoring architecture—E‑Monitor, TimeBomb, Dogger, and EDW—explaining how each layer collects, visualizes, and analyzes business‑level availability data, and showcases a real‑world debugging case that leveraged the stack to resolve an HTTP/2 connectivity bug.

dbaplus Community
dbaplus Community
dbaplus Community
How Ele.me’s Rider App Achieves End‑to‑End Business Availability Monitoring

E‑Monitor: Global Business Monitoring

Client‑side business monitoring is required because backend API metrics cannot fully reflect user experience. A request flow consists of data preparation, network transmission, callbacks, parsing, and rendering; any failure impacts the user. Ele.me uses the Skynet SDK for reliable data collection and upload, storing metrics in LinDB—a high‑write, high‑query time‑series database. The E‑Monitor visualization layer provides real‑time dashboards and threshold‑based alerts.

E‑Monitor architecture diagram
E‑Monitor architecture diagram

TimeBomb: Exception Event Monitoring

TimeBomb monitors abnormal events that do not meet defined time or count thresholds, supplementing global business monitoring. Parameters such as retry counts, intervals, and sampling rates can be configured remotely. Typical use cases include repeated login failures, delivery confirmations, and location errors. The UI shows anomaly curves per tag and detailed logs for selected nodes.

TimeBomb alert strategy
TimeBomb alert strategy

Dogger: Single‑Point Log Monitoring

Dogger consists of the Trojan log‑upload SDK and the Dogger‑Service log‑analysis backend. Trojan injects AOP‑based instrumentation, uses mmap for fast file writes, and compresses logs with gzip (up to 50× reduction). It captures user actions, page lifecycles, network requests, battery, memory, and thread information.

Trojan architecture
Trojan architecture

Log parsing is handled by Dogger‑Service, which offers three modules:

ActionChart : visualizes daily rider page transitions, battery, memory, network switches, location frequency, and request rates.

Origin : fast file parsing, time‑based search, tag filtering, and keyword highlighting to locate precise log entries.

Statistics : tag‑based data mining for metrics such as battery, network, traffic, stalls, requests, lifecycle, memory, and location.

Statistics example
Statistics example

EDW: Offline Data Warehouse

The offline reporting layer (EDW) complements real‑time dashboards by providing fine‑grained, day‑long analyses. It aggregates per‑order health, failure ratios, and complex conditional metrics, enabling a “god‑view” of the entire business line. Ele.me’s self‑built EDW platform supports instant queries, data extraction, computation, and monitoring for traffic, location quality, multi‑device usage, offline delivery, push quality, and order anomalies.

Traffic report example
Traffic report example

Practical Case Study: Debugging a Network‑Layer Issue

Step 1 : Grafana showed the Android rider order request success rate dropping to 98.69% (target >99%). DNS failures were observed in weak‑network scenarios but deemed normal. The failing request IDs were absent from the backend trace system, and inspection of the Skynet network interceptor revealed it was placed at the very end, causing requests to fail before reaching the server.

Step 2 : EDW aggregation identified IOException as the dominant failure cause.

Step 3 : Upgraded OkHttp to 3.11 and added an EventListener to capture request lifecycle events. Logs showed that some requests, even under good network conditions, threw errors or timed out after responseHeaderStart. Source code review uncovered a bug in OkHttp’s HTTP/2 connection‑pool reuse logic.

private void establishProtocol(ConnectionSpecSelector connectionSpecSelector,
    int pingIntervalMillis, Call call, EventListener eventListener) throws IOException {
  if (route.address().sslSocketFactory() == null) {
    if (route.address().protocols().contains(Protocol.H2_PRIOR_KNOWLEDGE)) {
      socket = rawSocket;
      protocol = Protocol.H2_PRIOR_KNOWLEDGE;
      startHttp2(pingIntervalMillis);
      return;
    }
    socket = rawSocket;
    protocol = Protocol.HTTP_1_1;
    return;
  }
  eventListener.secureConnectStart(call);
  connectTls(connectionSpecSelector);
  eventListener.secureConnectEnd(call, handshake);
  if (protocol == Protocol.HTTP_2) {
    startHttp2(pingIntervalMillis);
  }
}

The bug stemmed from noNewStreams being false for HTTP/2, preventing the connection from being removed from the pool:

boolean connectionBecameIdle(RealConnection connection) {
  if (connection.noNewStreams || maxIdleConnections == 0) {
    connections.remove(connection);
    return true;
  } else {
    // keep connection
  }
}

Because the backend switched to the SoPush service (which uses HTTP/2), the issue aligned with the observed anomaly window. Versions prior to 3.10 lacked robust HTTP/2 ping validation; even 3.11 could still exhibit the problem. The fix was to force Android to use HTTP/1.1 for the rider app, later re‑enabling HTTP/2 after integrating the corporate network library. After releasing the updated internal build, the success‑rate curve returned to normal.

Conclusion

Ele.me Logistics’ multi‑layer monitoring system—E‑Monitor, TimeBomb, Dogger, and EDW—provides comprehensive visibility from global dashboards to per‑rider logs, enabling rapid detection, root‑cause analysis, and remediation of client‑side availability issues. Ongoing challenges include data redundancy, monitoring‑induced performance overhead, and continuous iteration of the monitoring stack.

Trojan SDK open‑source: https://github.com/eleme/Trojan

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationAndroidlog analysismobile monitoringOkHttpbusiness availability
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.