Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI
This article describes how iQIYI evaluated various OLAP engines, selected Apache Druid for real‑time analytics, detailed its architecture, identified performance bottlene‑cks in Coordinator, Overlord and indexing, applied configuration and resource‑allocation optimizations, and built a user‑friendly RAP platform to democratize real‑time data analysis.
In recent years, big‑data technologies have become pervasive, but the growing volume and timeliness requirements of business data create challenges for real‑time analysis. iQIYI’s big‑data service team evaluated mainstream OLAP engines and chose Apache Druid as the core solution for its low‑latency, multi‑dimensional query capabilities.
Why Druid? Compared with offline tools (Hive, Impala, Kylin) that suffer from poor freshness, and with other real‑time options (Elasticsearch, OpenTSDB, Spark/Flink, Kudu) that have scalability or latency issues, Druid offers minute‑level query visibility, second‑level query latency, flexible dimensions, instant index configuration changes, and exactly‑once semantics in the newer KIS mode.
Druid Architecture consists of MiddleManager (real‑time indexing), Historical (segment storage and query), Broker (query routing), Overlord (task management) and Coordinator (load balancing). The platform now runs on hundreds of nodes, ingesting billions of events daily with sub‑second query latency.
Performance bottlenecks and fixes :
Coordinator: load‑balancing was O(N²) because each segment’s cost was computed against every historical node. Switching to druid.coordinator.balancer.strategy = CachingCostBalancerStrategy reduced scheduling time by 10,000×.
Overlord: API latency caused task failures due to a small DB connection pool. Increasing the pool with druid.metadata.storage.connector.dbcp.maxTotal = 64 improved API response, though CPU pressure required tuning druidBeam.overlordPollPeriod and eventually moving to KIS mode.
Indexing cost: default index tasks used 3 cores per task, wasting resources for small workloads. Introducing “Tiny” nodes (1 core / 3 GB) for 80 % of low‑volume tasks doubled slot capacity without adding hardware.
RAP (Realtime Analysis Platform) was built to hide Druid’s complexity. It provides wizard‑driven data ingestion, JSON‑free query definition, automatic dashboard generation, sub‑minute end‑to‑end latency, second‑level query response, and seamless dimension changes.
The platform is already deployed in iQIYI’s membership, recommendation and BI systems, supporting thousands of real‑time reports and alerts. Future work includes integrating the Pilot intelligent SQL engine, enhancing operational tooling, adding offline Parquet indexing, and supporting JOINs.
For more technical details, see the referenced iQIYI big‑data real‑time analysis articles.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.