Big Data 12 min read

iQIYI's Big Data Architecture Evolution and Adoption of Druid

iQIYI upgraded its big‑data stack by adopting Druid as the core engine for free‑time queries and ElasticSearch for pre‑computed fixed‑time queries, overcoming early API, security and scaling challenges through monthly segment granularity, parallel sub‑queries, Redis caching and failover, cutting typical query latency from over two seconds to about 150 ms and reaching 99.9 % service success.

iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI Technical Product Team
iQIYI's Big Data Architecture Evolution and Adoption of Druid

iQIYI currently employs a range of big‑data technologies, including Druid, Impala, Kudu, Kylin, Presto and ElasticSearch, and upgrades them as new versions are released.

Because of the massive daily data volume (hundreds of millions of records) and the need to support both fixed‑time and free‑time queries (often spanning up to two years), iQIYI adopts a combined architecture of Druid for free‑time queries and ElasticSearch for fixed‑time queries. Both systems maintain two copies of data for high availability.

During the use of Druid and ElasticSearch, several challenges were encountered:

Druid only provides JSON‑based APIs for ingestion and querying, resulting in a steep learning curve.

Early Druid versions had weak data‑security features.

High QPS and long‑time‑span aggregation queries put pressure on the cluster.

ElasticSearch is not ideal for large‑scale aggregation and its RESTful API also has a high learning cost.

Business isolation within a shared Druid cluster is difficult.

Initially, iQIYI’s data service for “iQIYI 号” relied on Kylin. As video‑detail data grew, Kylin’s cube build time and query latency increased dramatically, often causing time‑outs and requiring extensive manual intervention.

After evaluating Impala + Kudu, ElasticSearch and Druid, iQIYI selected Druid as the core platform because of its ability to handle ultra‑large data scales, millisecond‑level latency, high concurrency and stability. Fixed‑time queries are pre‑computed and stored in ElasticSearch, while free‑time queries are served by Druid.

To improve Druid performance for long‑range queries, the segment granularity was changed from daily to monthly for each video, reducing the number of segments scanned. When a query spans more than six months, the time range is split into separate month‑level and day‑level sub‑queries that run in parallel.

These optimizations reduced a typical user query time from over 2 seconds to about 150 ms, and the success rate of the service rose to 99.9 % thanks to additional Redis caching and Hystrix‑based failover to a standby Druid cluster.

Druid’s architecture consists of five node types:

MiddleManager – ingests data and creates segment files.

Historical – loads segment files for query serving.

Coordinator – balances data load across Historical nodes and manages data lifecycle.

Overlord – balances ingestion load.

Broker – receives client queries, forwards them to MiddleManager/Historical nodes, merges results and returns them.

External dependencies include Metadata storage (typically MySQL), Zookeeper for coordination, and Deep Storage (usually HDFS) for segment files.

The query flow is: a client sends a request to the Broker, which discovers relevant segments via Zookeeper, dispatches the request to the appropriate Historical and real‑time nodes, collects their results, merges them and returns the final answer.

Druid achieves fast queries through three main mechanisms:

In‑memory query processing.

Result caching at the Broker level.

Segment storage format: columnar layout, bitmap indexes, and RoaringBitmap compression.

Bitmap indexes enable rapid filtering. For example, to find records of basketball player Curry, separate bitmaps are built for each dimension value; an AND operation on the “name” and “profession” bitmaps instantly yields the matching rows. RoaringBitmap further compresses the sparse bitmaps.

Future work includes expanding SQL support (currently limited, no JOINs), rolling up older data to coarser granularities to improve long‑range query performance, and isolating different business functions within the shared Druid cluster.

performance optimizationelasticsearchBig Databitmap indexDruiddata architecture
iQIYI Technical Product Team
Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.