Druid OLAP Platform Practice at YouZan: Architecture, Features, and Challenges
YouZan adopted MetaMarket’s Druid OLAP platform—featuring millisecond‑level interactive queries, high availability, horizontal scalability, and rich SQL/API query types—by configuring simple ingestion tasks that automatically manage real‑time and batch data, tiered hot/cold storage, and monitoring, while still facing ingestion limits, lack of joins, and occasional latency spikes.
This article introduces Druid, a high-performance OLAP (Online Analysis Processing) data storage and analysis system developed by MetaMarket, now incubating at Apache Foundation. Druid is designed for massive datasets and offers sub-second query response times.
Druid's Key Features:
Interactive Query: Millisecond-level query latency after events are created, thanks to columnar storage and optimized query execution that reads only necessary data
High Availability: Uses HDFS/S3 for deep storage with segments replicated across 2 Historical nodes; supports multi-replica data ingestion
Horizontal Scalability: All deployment components can scale horizontally to improve data ingestion and query performance
Parallel Processing: Enables parallel query processing across the entire cluster
Rich Query Capabilities: Supports Scan, TopN, GroupBy, Approximate queries via both API and SQL
Why YouZan Uses Druid: As a SaaS company with massive real-time and offline data, YouZan previously used SparkStreaming/Storm for OLAP scenarios, requiring separate real-time tasks and carefully designed storage. With Druid, developers only need to fill in a data ingestion configuration specifying dimensions and metrics. A real-time task can be created in about 10 minutes.
Druid Architecture (Lambda Architecture):
Coordinator: Manages cluster segments and load balancing
Overlord: Accepts tasks, coordinates task distribution, and collects task status
MiddleManager: Receives indexing tasks from Overlord and creates Peon instances to execute them
Broker: Receives query requests from clients and forwards them to Historical and MiddleManager nodes
Historical: Loads non-real-time window segments
Router: Optional API gateway for improved concurrent query capabilities
YouZan OLAP Platform Features:
DataSource Management
Tranquility Configuration and Instance Management: Automatically deploys Tranquility instances based on traffic and server resources
Data Compensation: Solves late-arriving data issues through offline batch processing
Druid SQL Query Integration
Monitoring and Alerting: Includes basic monitoring (service, cluster status, machine info) and business monitoring (real-time tasks, TPS, latency, RT/QPS)
Hot/Cold Data Separation: Using Historical Tier grouping and load rules - recent 30-day data on SSD (hot tier), 180-day data on SATA (default tier), older data dropped from Historical but retained in HDFS.
Challenges and Future Work:
Data Ingestion: Tranquility has limited ingestion methods and lacks monitoring capabilities
Druid doesn't support JOIN queries across multiple DataSources
Hourly query RT spikes when new Index tasks are created
Historical data automatic Roll-Up for storage optimization
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.