Big Data 17 min read

Engineering Practices and Performance Optimizations of Apache Druid for Real‑Time OLAP at Shopee

Shopee’s engineering team scaled a 100‑node Apache Druid cluster for real‑time OLAP by redesigning the Coordinator load‑balancing algorithm, adding incremental metadata pulls, introducing a segment‑merged result cache, and building exact‑count and flexible sliding‑window operators, while planning cloud‑native deployment.

Shopee Tech Team
Shopee Tech Team
Shopee Tech Team
Engineering Practices and Performance Optimizations of Apache Druid for Real‑Time OLAP at Shopee

Apache Druid is a high‑performance open‑source time‑series database used for low‑latency, interactive analytical queries. This article shares the engineering practices and performance‑tuning techniques applied to support Shopee’s core OLAP real‑time analytics workloads.

1. Druid Cluster in Shopee – A single large‑scale cluster of over 100 physical nodes serves as the downstream store for various core business data pipelines. Batch and streaming jobs write segments into Druid, and downstream services perform real‑time OLAP queries.

2. Technical Optimizations

2.1 Coordinator load‑balancing algorithm

Problem: As the number of dataSources grew, the number of segments increased to millions, causing the Coordinator’s metadata management to become a bottleneck. A serial sub‑task that balances segment loading on historical nodes took more than 10 minutes, leading to segment hand‑off timeouts and query jitter.

Analysis: The reservoir‑sampling implementation sampled only one element per call from a 5‑million‑segment list, requiring 2 000 passes per balancing cycle.

Solution: Implement a batch‑sampling reservoir algorithm that traverses the segment list once and selects 2 000 elements, reducing the sub‑task execution time to ~300 ms.

Benchmark result – The batch‑sampling algorithm outperforms the original approach (see image in the original article).

2.2 Incremental metadata management

Problem: The Coordinator periodically pulls the entire segment table (≈5 million rows) from MySQL, which becomes slow and resource‑intensive as the cluster grows.

Solution: Introduce an incremental pull that only fetches newly added segments within a recent time window, merging them into the existing in‑memory snapshot. Full pulls are still performed at a longer interval for consistency.

Full‑pull SQL:

SELECT payload FROM druid_segments WHERE used=true;

Incremental‑pull SQL:

-- To ensure efficient execution, an index is created on the new filter condition
SELECT payload FROM druid_segments WHERE used=true AND created_date > :created_date;

Configuration:

# Pull metadata added in the last 5 minutes
druid.manager.segments.pollLatestPeriod=PT5M
# Full pull every 15 minutes
druid.manager.segments.fullyPollDuration=PT15M

Result: Metadata pull and deserialization latency dropped dramatically, reducing pressure on the metadata DB and improving data visibility for users.

2.3 Broker result‑cache optimization

Problem: Existing cache mechanisms (segment‑level intermediate cache and result cache) have clear limitations. The result cache cannot be used with the group‑by v2 engine, and segment‑level cache adds serialization overhead on the Broker.

Solution: Implement a new "segment merged result cache" on the Broker that stores the merged intermediate results of historical segments. This cache works for all four usage scenarios (group‑by v2, historical only, hybrid, large‑scale segment results).

Configuration:

druid.broker.cache.useSegmentMergedResultCache=true
druid.broker.cache.populateSegmentMergedResultCache=true

Benchmark shows roughly 50 % reduction in query latency after enabling the new cache.

3. Custom Feature Development

3.1 Integer‑exact deduplication operator

Business need: Precise order‑count and UV metrics, where existing approximate deduplication operators introduce error. Implemented a bitmap‑based exact count aggregator using RoaringBitmap for 32‑bit and 64‑bit integer fields.

Aggregator JSON API:

{
  "type": "Bitmap32ExactCountBuild or Bitmap32ExactCountMerge",
  "name": "exactCountMetric",
  "fieldName": "userId"
}

SQL usage:

SELECT "dim", Bitmap32_EXACT_COUNT("exactCountMetric") FROM "ds_name" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY GROUP BY key

Limitations: High memory consumption for large data volumes; future work includes better compression and memory‑estimation techniques.

3.2 Flexible sliding‑window functions

Problem: Druid only supports fixed‑size windows; many use‑cases require trailing aggregates (e.g., 7‑day UV).

Solution: Added a generic "default" averager that can apply a sliding window to any native aggregator, including hyperUnique (HLL) and other complex types.

Averager JSON:

{
  "aggregations": [
    {
      "type": "hyperUnique",
      "name": "deltaDayUniqueUsers",
      "fieldName": "uniq_user"
    }
  ],
  "averagers": [
    {
      "name": "trailing7DayUniqueUsers",
      "fieldName": "deltaDayUniqueUsers",
      "type": "default",
      "buckets": 7
    }
  ]
}

SQL usage:

SELECT TIME_FLOOR(__time, 'PT1H'), dim, MA_TRAILING_AGGREGATE_DEFAULT(DS_HLL(user), 7) FROM ds_name WHERE __time >= '2021-06-27T00:00:00.000Z' AND __time < '2021-06-28T00:00:00.000Z' GROUP BY 1, 2

Both the bitmap operator and the sliding‑window averager have been contributed to the Apache Druid community via pull requests.

4. Future Architecture Evolution

The team is exploring cloud‑native deployment patterns for Druid to further improve stability, cost efficiency, and scalability.

References – Links to Apache Druid documentation, related pull requests, and issue discussions are provided in the original article.

performance optimizationCacheBig DataOLAPbitmap indexReal-time AnalyticsApache Druid
Shopee Tech Team
Written by

Shopee Tech Team

How to innovate and solve technical challenges in diverse, complex overseas scenarios? The Shopee Tech Team will explore cutting‑edge technology concepts and applications with you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.