How Gaode Maps Built a Real‑Time Lakehouse for Billion‑Scale Trajectory Data
This article details Gaode Maps' end‑to‑end lakehouse solution for handling high‑frequency, high‑volume trajectory data, covering the challenges of real‑time visibility, multi‑scenario queries, storage cost, and data silos, and describing the layered storage architecture, performance validation, and future expansion plans.
Background and Challenges
Gaode Maps processes massive trajectory data characterized by high real‑time requirements, high concurrency, and long‑term storage. Typical user scenarios include the "Footprint Map" that records each navigation trip, the "Work Map" showing when and where a trip started and ended, and the "Cat‑and‑Mouse" game where users share live locations.
Key challenges identified are:
Real‑time visibility : Ingest rates reach millions of records per second, with peaks during holidays.
Complex multi‑scenario queries : Both offline analytics and online services demand low‑latency responses.
High storage cost for historical data : Continuous growth without effective tiering leads to cost pressure.
Data silos and business dependencies : Over 20 downstream services depend on the trajectory pipeline, creating architectural complexity.
Unified Link Optimization
To address these challenges, the team designed a unified data processing and storage framework with four core goals:
Unified data processing : Standardize pipelines across business lines.
General storage and query service : Provide a common trajectory store and query API.
Cost reduction : Optimize resource usage and operational overhead.
No performance compromise : Preserve real‑time and query performance under the unified architecture.
Data Layering Design
Based on the analysis of daily access span, trajectory data is divided into three layers:
Hot data (0‑1 day) : Stored in Redis for sub‑second latency, supporting real‑time position queries and high‑frequency interactions.
Warm data (1‑60 days) : Stored in Lindorm, organized by user + time‑slice + trajectory segment, delivering sub‑100 ms responses.
Cold data (>60 days) : Stored in Apache Paimon + StarRocks, compressed with Polyline encoding (≈43‑50 % reduction) and aggregated by trajectory ID to minimize storage.
The warm and cold layers share the same Apache Paimon + StarRocks stack, enabling near‑real‑time writes via Flink and high‑performance OLAP queries.
Performance and Storage Validation
Extensive benchmarks demonstrated that the solution meets the required performance:
Flink + Paimon achieves the necessary write throughput for near‑real‑time ingestion.
StarRocks point‑query tests on a trillion‑scale dataset satisfy latency targets.
Parameter tuning further improved efficiency:
Reduced file-block-size from 128 MB to 32 MB for finer data pruning.
Disabled thread‑pool serialization to avoid bottlenecks under high QPS.
Increased manifest cache to 4 GB to improve metadata hit rates.
Optimization Techniques
Additional optimizations include:
Trajectory compression using Google Polyline encoding, achieving ~47 % overall storage savings even at billions of records.
Alake portal storage governance to merge small files and clean orphan files.
Partition pruning based on trajectory start date to avoid cross‑day scans.
Enabling Deletion Vector (DV) in Paimon and StarRocks native DV for faster reads.
Physical isolation of C‑end query clusters from internal analytics clusters to protect SLA.
Overall Architecture
The end‑to‑end pipeline consists of three layers: data processing (Flink), storage (Redis, Lindorm, Paimon + Pangu), and interface (StarRocks query service). Flink consumes raw trajectory streams, enriches them with planning data, writes to the appropriate tier based on access span, and uses Paimon's Partial Update engine to materialize full trip records. The unified query layer serves both real‑time and historical queries with consistent APIs.
Future Plans
Gaode aims to extend the stream‑batch integration to other log‑type data, deepen collaboration with BI and AI teams, and explore feature mining for user behavior to power AI agents. The lakehouse architecture will continue to evolve as a unified foundation for both analytics and real‑time services.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
