Big Data 11 min read

Design and Implementation of a Low-Latency App Exception Monitoring Platform Using Spark Streaming, Kafka, and Elasticsearch

The paper presents a production‑grade, low‑cost mobile‑app exception monitoring platform built on Spark Streaming, Kafka, and Elasticsearch that achieves high availability through exactly‑once processing and checkpointing, minute‑level latency by decoupling raw and symbolized logs, high throughput via reservoir sampling, and dynamic scalability without code changes.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Design and Implementation of a Low-Latency App Exception Monitoring Platform Using Spark Streaming, Kafka, and Elasticsearch

The article describes a production‑grade exception monitoring platform for mobile apps, built on open‑source components to achieve low cost, high availability, low latency, high performance, and easy extensibility.

Low Cost : Small teams often use third‑party services, but medium‑to‑large teams prefer an in‑house solution. By leveraging existing open‑source projects (Spark, Kafka, Elasticsearch) the platform was built with minimal custom code (under 700 lines).

High Availability : The combination of Spark Streaming and Kafka provides an "Exactly‑Once" guarantee for processed exception data. Checkpointing allows the job to resume from the last successful state after a crash, and offsets are managed via ZooKeeper when using the Direct Kafka approach.

Low Latency : To keep end‑to‑end delay at the minute level, the system stores raw (unsymbolized) crash logs first, then overwrites them with symbolized data once available, reducing the latency for the majority of dimensions.

Input Challenges : iOS crash logs are binary and need symbolization, which is resource‑intensive. The platform decouples symbolized and unsymbolized streams, processing the latter quickly while awaiting the former.

Output Challenges : Writing to Elasticsearch is limited to ~10k rows/s. The solution uses reservoir sampling (proportional to the ES write bottleneck) and a batch job to back‑fill missed records, ensuring steady throughput.

High Performance : Real‑time, second‑level, detailed aggregation queries are required. Various OLAP options were evaluated; Elasticsearch was chosen for its ability to handle both detailed and aggregated queries efficiently.

Scalability : Dynamic dimension expansion allows new log fields (e.g., city) to be queried without code changes. This is achieved with Elasticsearch dynamic templates, as shown below:

{
  "mappings": {
    "es_type_name": {
      "dynamic_templates": [
        {
          "template_1": {
            "match": "*log*",
            "match_mapping_type": "string",
            "mapping": { "type": "string" }
          }
        },
        {
          "template_2": {
            "match": "*",
            "match_mapping_type": "string",
            "mapping": { "type": "string", "index": "not_analyzed" }
          }
        }
      ]
    }
  }
}

Resources : The production environment runs Kafka 0.8.2.0, Spark 1.5.2, and Elasticsearch 2.1.1 on a distributed, scalable cluster provided by Meituan‑Dianping.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataElasticsearchKafkaLow latencySpark StreamingException Monitoring
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.