Big Data 19 min read

How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data

This article details Gaode Maps' end‑to‑end lakehouse solution for massive, high‑frequency trajectory data, covering the challenges of real‑time visibility, query performance, and storage cost, and explaining how a hot‑warm‑cold tiering architecture built on Apache Flink, Paimon, StarRocks, Redis and Lindorm delivers millisecond‑level queries while cutting storage expenses.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data

Background and Core Challenges

Gaode Maps collects billions of vehicle trajectory points that must be visible in sub‑second latency, support millions of writes per second during peak periods, and be retained for long‑term historical analysis. The data exhibits three key characteristics: high real‑time freshness, massive concurrent access, and long‑term storage cost pressure.

Access‑Span Analysis

Analysis of daily user queries shows that 0–1 day data accounts for ~67% of accesses (hot data), 1–60 days accounts for the remaining majority (warm data), and >60 days accounts for ~16% (cold data). Although cold data is accessed less frequently, its volume is large and cost‑sensitive.

Tiered Storage Architecture

Hot‑A (Redis) : Stores the most recent day of trajectory points as key‑value pairs (user + point) for ultra‑low‑latency queries such as real‑time location and “cat‑and‑mouse” games.

Hot‑B (Lindorm) : Holds data for the next three days, organized by user + time‑slice + trajectory‑segment, supporting near‑real‑time queries and internal investigations.

Warm (Apache Paimon + StarRocks) : Persists 3–60 day data. Paimon provides upserts and deletion‑vector support; StarRocks offers sub‑100 ms query response.

Cold (Apache Paimon + StarRocks) : Stores data older than 60 days with additional compression and aggregation to minimise storage cost.

Data Processing Pipeline

Apache Flink consumes raw trajectory streams, enriches them, and routes records to the appropriate storage tier based on the access‑span metric. All historical data (≥3 days) is written to a unified Paimon table whose primary key is the trajectory ID, enabling efficient upserts and deletion‑vector handling. Partitioning is performed on the trajectory start date to avoid cross‑day writes and to enable effective partition pruning.

Compression Technique

A customized Polyline encoding compresses latitude, longitude, time, speed, direction, and altitude. The algorithm quantises coordinates, encodes deltas, and uses variable‑length ASCII encoding, achieving 43‑50% size reduction (average 47%). This reduces storage cost by millions of dollars annually.

Performance and Cost Validation

Benchmarks demonstrate that the Flink + Paimon pipeline sustains million‑level write throughput, while StarRocks delivers sub‑second point‑lookup latency on a trillion‑scale dataset. Parameter tuning (file‑block size 32 MB, manifest cache 4 GB, disabled thread‑pool serialization) further improves stability and query speed.

Optimization Practices

Storage compression : Polyline encoding of trajectory fields.

File management : Periodic small‑file merging and orphan‑file cleanup via the data‑lake portal.

Query optimization : Date‑based partitioning, partition pruning, and Deletion Vectors to skip stale rows.

Parameter tuning : Reduced Paimon block size, increased manifest cache, disabled thread‑pool serialization for high QPS.

Stability isolation : Separate Flink‑StarRocks clusters for C‑end traffic and internal investigation workloads.

Overall Architecture Diagram

Architecture diagram
Architecture diagram

Future Directions

The lakehouse framework will be extended to other Gaode services, integrating additional log streams and enriching downstream BI and AI pipelines. Feature extraction across multiple data sources will feed AI agents for intelligent business assistance.

Apache FlinkStarRockslakehouseData TieringTrajectory DataApache Paimon
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.