How Meituan’s Logan Real‑Time Log System Boosts Debugging Across Mobile, Web, and IoT
This article details the design, architecture, and implementation of Meituan's Logan real‑time logging platform, covering its workflow, multi‑terminal collection SDK, ingestion, Flink‑based processing, consumption layers, stability measures, and future roadmap, illustrating how it improves fault diagnosis and system reliability.
Background
1.1 Logan Overview
Logan is Meituan’s unified log service for terminals, supporting mobile apps, web, mini‑programs, IoT, etc. It provides log collection, storage, upload, query and analysis, helping developers locate issues and improve fault‑diagnosis efficiency. It is also one of the earliest open‑source large‑scale front‑end log systems with high write performance, security and loss‑less guarantees.
1.2 Logan Workflow
A simplified workflow diagram (see Figure 1) shows three main steps: active log reporting, log decryption & parsing, and log query & retrieval.
Active reporting : terminals upload logs via HTTPS to Logan’s receiving service, which stores raw log files in object storage.
Decryption & parsing : when developers request logs, the encrypted files are downloaded, decrypted, parsed and delivered to the log storage system.
Query & retrieval : the platform supports filtering by log type, tags, process, keywords, time, and visualisation of specific log types.
1.3 Why Real‑Time Logs?
As business complexity grows, the “local storage + active reporting” model shows limitations: delayed reporting in web/mini‑programs, lack of real‑time analysis and alerts, and missing end‑to‑end traceability across multiple terminals.
Reporting limited in some scenarios : users may leave a web page before they can report logs, causing missed troubleshooting windows.
No real‑time analysis or alerting : customers request monitoring of abnormal logs with instant alerts.
No full‑link tracing : logs are scattered across systems, making manual correlation cumbersome.
To address these pain points, Logan Real‑Time Log aims to provide a unified, high‑performance real‑time logging service.
1.4 What Is Logan Real‑Time Log?
It is a solution for mobile apps, web, mini‑programs and IoT, offering high scalability, performance and reliability. Capabilities include log collection, upload, processing, consumption, delivery, query and analysis.
Design and Implementation
2.1 Overall Architecture
The architecture consists of five layers: Collection, Ingestion, Processing, Consumption, and Log Platform.
Collection layer : gathers, encrypts, compresses, aggregates and reports logs from terminals.
Ingestion layer : provides log reporting APIs, receives data and forwards it to the processing layer.
Processing layer : decrypts, splits, enriches and cleans log data.
Consumption layer : filters, formats and delivers logs.
Log platform : offers query, analysis, configuration and alerting.
2.2 Collection SDK
The SDK runs on multiple terminals (WeChat, MMP, Web, MRN). Core logic is shared; platform‑specific code is isolated. Key modules:
Configuration management : fetches and refreshes reporting limits, sampling rates, feature switches, supports gray releases.
Encryption : uses ECDH + AES; Web uses native browser crypto API, other platforms use pure JS.
Storage management : local disk cache prevents log loss when upload fails.
Queue management : groups logs; discards excess in weak‑network or high‑volume scenarios to avoid memory bloat.
When the SDK initializes, it creates Logger, Encryptor, Storage instances, pulls configuration, checks for previously cached logs and attempts re‑upload. Normal log writes are encrypted and added to the current report group, which is flushed on time, size or navigation triggers (see Figure 5).
2.3 Data Ingestion Layer
The ingestion layer must support public domain reporting, high concurrency, minute‑level latency, and delivery to Kafka. Meituan’s unified log collection channel satisfies these requirements, forwarding logs to a Kafka topic.
2.4 Data Processing Layer
Three candidates were evaluated: Java monolith, Storm, and Flink. Flink was chosen for its lower latency and higher throughput.
Flink is an industry‑leading stream processing engine with high throughput, low latency, strong reliability and precise computation.
Logs from the ingestion layer are written to a summary topic, then processed by Flink jobs that parse JSON, decrypt content, split by service dimension, and apply custom enrichment before routing to downstream topics (see Figure 7).
Metadata parsing : convert raw logs to JSON.
Content decryption : asymmetric key exchange yields symmetric key for decryption.
Service‑level splitting : route logs to business‑specific topics.
Custom enrichment : apply user‑defined templates to generate new topics.
2.5 Data Consumption Layer
Beyond basic log query, higher‑order needs include metric monitoring, end‑to‑end tracing and offline analysis. Standardised logs are delivered to Kafka streams, where they can be consumed by third‑party systems.
Full‑link tracing : combine front‑end and back‑end logs for end‑to‑end visibility.
Metric aggregation & alerting : treat logs as a real‑time data stream for monitoring.
Offline analysis : export logs to Hive for long‑term storage and batch analytics.
2.6 Log Platform
The platform provides multi‑dimensional search (user ID, tags, keywords) and uses Elasticsearch for storage, with an abstraction layer to support other engines.
Elasticsearch is a distributed open‑source search and analytics engine with low cost of entry, high scalability and near‑real‑time capabilities, suitable for large‑scale log search.
Stability Guarantees
3.1 Core Monitoring
Key SLA metrics include availability, success rate, latency, and throughput (see Table 2).
3.2 Blue‑Green Deployment
Blue‑Green deployment runs two identical jobs; after the new job stabilises, traffic switches to it, ensuring seamless updates and avoiding log‑consumption delays.
Achievements
By Q3 2022, more than 20 business systems (e.g., Meituan mini‑program, merchant selection, catering SaaS) had adopted Logan Real‑Time Log, reducing average complaint‑resolution time from 10 minutes to under 3 minutes, saving 10‑15 minutes per issue during internal testing, and uncovering over 1,000 compliance issues across 300+ pages and 500+ APIs.
Future Plans
Feature completion : support more terminal types, add log cleaning, statistics, alerts, and full‑link tracing.
Performance boost : target millions of QPS and 99.9 % upload success rate.
Stability enhancement : implement rate‑limiting, circuit‑breaker, and emergency response mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
