How Meituan Optimized High‑Traffic Backend Performance: Real‑World Strategies and Case Studies

This article shares Meituan's practical performance‑optimization techniques—including code analysis, database tuning, caching strategies, asynchronous processing, JVM adjustments, multithreading, and monitoring—illustrated with real case studies that reduced job runtimes from over 40 minutes to under 15 minutes.

21CTO
21CTO
21CTO
How Meituan Optimized High‑Traffic Backend Performance: Real‑World Strategies and Case Studies

Why Meituan's Performance Optimization Matters

Meituan is the largest O2O platform in China and, despite handling massive concurrency and traffic, its app is praised for user‑experience. The author, Xiaoming, aims to provide actionable optimization schemes and reference cases so that others can avoid starting from scratch.

Goals of the Sharing

Provide practical, reusable performance‑optimization solutions with concrete examples.

Broaden perspective beyond performance, offering common thinking patterns and selection criteria for solutions.

Common Performance‑Optimization Strategy Categories

Code Analysis First

Many engineers jump straight to caching, async, or JVM tuning, but the first step should be analyzing the code to locate bottlenecks. Simple code issues—excessive loops, unnecessary condition checks, duplicated logic—can often be fixed directly.

Database Optimization

Database tuning can be divided into three parts:

SQL tuning: use slow‑query logs, EXPLAIN, profiling tools, and MySQL index principles.

Architecture‑level tuning: read/write splitting, multi‑slave load balancing, horizontal/vertical sharding; trigger when monitoring alerts (e.g., Zabbix) indicate bottlenecks.

Connection‑pool tuning: adjust pool parameters based on monitoring data and business load, iteratively testing to find optimal settings.

Cache Strategies

Cache can be local (HashMap, ConcurrentHashMap, Ehcache, Guava) or a service (Redis, Tair, Memcached). Choose local cache for small, infrequently updated data; otherwise use a cache service. Typical scenarios:

Repeated queries of the same data within a short period with infrequent updates – use local cache.

High‑concurrency hot‑data queries that overload the DB – use a cache service.

Selection considerations:

Small, stable data sets → local cache (Ehcache for eviction policies, HashMap for simplicity, ConcurrentHashMap for concurrency).

Larger or dynamic data → cache service; prefer Tair for operational ease, fall back to Redis when Tair lacks needed features.

Cache Update Timing and Reliability

Two strategies are used for POI cache updates: real‑time updates via message consumption and a fallback 5‑minute expiration that reloads from DB if the cache misses. This dual approach ensures reliability.

Cache Miss (“Cache Penetration”) Handling

When a hot key expires and many concurrent requests hit the DB, a mutex (e.g., Redis SETNX) is used to ensure only one request loads the DB while others wait and retry.

public String get(String key) {
    String value = redis.get(key);
    if (value == null) { // cache miss
        if (redis.setnx(key_mutex, 1, 3 * 60) == 1) { // acquire mutex for 3 minutes
            value = db.get(key);
            redis.set(key, value, expire_secs);
            redis.del(key_mutex);
        } else {
            // another thread is loading the DB, retry after a short sleep
            sleep(50);
            return get(key);
        }
    } else {
        return value;
    }
}

Asynchronous Processing

Tasks that are not immediately needed by the user can be processed asynchronously, reducing response time, preventing thread‑pool exhaustion, and avoiding CPU overload. Common approaches:

Spawn a separate thread or use a thread pool to handle the task after the response is returned.

Use a BlockingQueue for massive data batches, processing them in bulk.

Leverage a message‑queue (MQ) so downstream systems handle the work reliably.

NoSQL as a Cache‑Like Store

When data does not require relational features, has high write frequency, and does not need strong consistency, NoSQL (e.g., HBase) can replace MySQL to offload write pressure, such as storing massive exception logs.

JVM Tuning

Monitor GC time, GC count, memory usage per generation, CPU load, and thread count via monitoring systems (or custom agents). Adjust young‑generation size, GC thresholds, and heap ratios based on observed metrics. Use tools like jmap and MAT to locate memory leaks.

Multithreading and Distributed Execution

Use multithreading for CPU‑bound tasks on a single machine, employing thread pools for performance and flow control. When a single machine cannot meet demand, adopt a distributed scheduler‑executor architecture with RPC, heartbeats, and possibly a cluster framework.

Metrics, Monitoring, and Alerting System

Although not a direct optimization, a robust metrics system is essential for locating problems and measuring improvement. Typical metrics include:

Interface QPS, response time, call volume (per node and per service cluster).

Node‑level CPU, load, memory, network traffic; plus service‑specific metrics for databases, caches, etc.

Data collection is usually asynchronous, sending to Flume or directly to a monitoring server. Processing can be batch (MapReduce/Hive) or real‑time (Storm/Spark), with results stored in MySQL or HBase and visualized via dashboards.

Real‑World Case Studies

Case 1: Merchant‑Control‑Area Refresh Job

Goal: Refresh merchant‑to‑control‑area relationships hourly, keeping runtime under 20 minutes.

Original flow fetched all merchant delivery ranges and control areas, then performed nested loops to intersect ranges, deduplicate merchant IDs, batch‑load merchants, and update relationships. Runtime was ~40 minutes.

First‑phase optimization introduced an R‑tree spatial index to quickly find intersecting delivery ranges, reducing runtime to <20 minutes.

Second‑phase optimization replaced DB batch fetches with cache mget calls and refined conditional updates, cutting runtime further to ~15 minutes.

Case 2: POI Cache Design and Implementation

Problem: Rapid growth of POI read traffic in Q4 2014 caused DB overload.

Solution: Use Tair as a cache service. Initial design combined MQ‑driven updates with a 5‑minute expiration fallback. Later, Databus replaced expiration, providing real‑time cache invalidation and eliminating Tair‑disk fallback latency.

Result: DB read traffic dropped dramatically, response times improved, and cache consistency was maintained.

Case 3: Backend Operations Dashboard Performance

Multiple pages (welcome, organization‑tree, order‑building) suffered from high latency due to excessive small‑SQL queries and heavy data processing.

Solutions included batch API calls, asynchronous RPC, pre‑computing results cached in Redis, merging many small SQLs into larger ones, and local caching of reference data.

After deployment, each page showed significant latency reductions, as demonstrated by before‑after charts.

Author: Xiaoming From: Meituan Technical Team
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Case StudymonitoringPerformance Optimizationcachingjvm-tuningasynchronous processingDatabase Tuning
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.