How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations
This article describes Hulu’s real‑world OLAP query optimization, covering the fundamentals of OLAP, comparisons of row‑ and column‑based storage formats, detailed indexing mechanisms of Parquet, ORC and CarbonData, and the specific schema, shuffle, block size, speculation and GC tuning techniques that enabled CarbonData to dramatically accelerate wide‑table queries on SparkSQL.
1. Background
Hulu is a North American video streaming company. Its OLAP queries analyze large amounts of user behavior data (login, watch, purchase) to improve market expansion, user experience, and ad revenue.
OLAP query optimization starts with scenario segmentation and tuning storage format and query engine. This article shares Hulu's experience on CarbonData and SparkSQL.
2. OLAP Basics
Basic concepts:
Dimension : columns used in WHERE conditions.
Measure : data describing the fact.
Fact : table partitioned by dimensions.
Example: a flight statistics table with measures such as ticket price and satisfaction, and dimensions such as flight time and passenger ID.
Common data models: star schema and snowflake schema.
3. Main Big‑Data File Formats
3.1 Storage Layout
3.1.1 Row‑based storage
Traditional databases store rows together, which is good for ACID transactions but yields low compression and high I/O for column‑sparse OLAP queries.
3.1.2 Column‑based storage
Columnar formats such as Parquet and ORC store data column‑wise, offering higher compression, less I/O for few‑column queries, and support for nested data.
Higher compression because each unit stores same‑type values.
Reduced I/O when only a few columns are accessed.
Support for nested structures, effectively pre‑joined.
Columnar read cost grows linearly with the number of requested columns; excessive columns can make columnar slower than row‑based. CarbonData tries to improve this.
3.2 Indexes
Most formats add indexes at various levels to filter blocks.
3.2.1 Parquet
Supports min/max indexes at row‑group level and predicate push‑down.
3.2.2 ORC
Provides min/max indexes at file, stripe, and row‑group levels.
3.2.3 CarbonData
Offers file‑level and blocklet‑level min/max indexes and row‑level invert indexes. Different index levels act on driver, executor task, and data read respectively, giving finer‑grained filtering than Parquet or ORC.
CarbonData uses MDK (multi‑dimension keys) for multi‑column sorting and global dictionaries with lazy conversion.
4. Typical Hulu OLAP Scenarios
4.1 Relational structure
Queries such as “viewing time trend of a TV series per day” use multiple tables and indexes; data volume is moderate, requiring flexible queries and transactional support. Implemented with row or column storage (e.g., ORC) and engines like Hive, Presto, Impala.
4.2 Nested structure
Used for user segmentation; many repeated features are stored in nested columns. Hulu built the Nesto engine to enable filtering on repeated columns, effectively performing pre‑join and improving performance.
4.3 Wide‑table queries
Feature‑rich queries select hundreds of columns; they benefit from strong indexing and a hybrid of column‑ and row‑based storage, for which CarbonData is employed.
5. CarbonData Optimizations at Hulu
5.1 Table Optimizations
5.1.1 Schema
The CREATE TABLE statement determines MDK index column order. All non‑numeric types are treated as dimensions and included in MDK unless excluded via DICTIONARY_INCLUDE/EXCLUDE.
CREATE TABLE IF NOT EXISTS Table1 (
a Int, b Int, c String, d Int, e Int,
STORED BY 'org.apache.carbondata.format'
) TBLPROPERTIES("DICTIONARY_INCLUDE"="a,b,c,d,e")Columns should be ordered by increasing distinct values or decreasing filter selectivity to reduce index size.
5.1.2 Shuffle
Each load creates a segment of CarbonData files. Uneven data distribution can cause some blocklets to contain many target rows, leading to straggler tasks. Repartitioning the source data before load evens out blocklet sizes.
5.1.3 Block/Blocklet Size
Default blocklet size is 120 k rows; a block is split when its size exceeds 90 % of the 1 GB block size. Smaller block/blocklet sizes improve filter effectiveness but may increase file count and I/O overhead.
5.2 Query Engine Optimizations
5.2.1 Speculation
Enabling Spark speculation re‑executes slow tasks on other nodes, reducing tail latency. Parameters are set aggressively because only a few tasks are outliers.
5.2.2 Garbage Collection
Configure Young generation (512 MB–1 GB) and trigger Old‑gen GC at 50 % occupancy. Prefer G1GC over CMS for better performance on large heaps.
5.3 Tuning Experiments
Test environment: 500 GB raw data, ~300 columns (including nested), >300 M rows; queries select 100–200 columns, filter >95 % of rows; cluster with 100 executors, 5 vCores each, 10 GB memory per executor.
Results show CarbonData + SparkSQL outperforms Parquet + Presto/Impala due to fine‑grained blocklet filtering and speculation. Further parameter tweaks yield diminishing returns.
6. References
https://en.wikipedia.org/wiki/Online_analytical_processing
https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home
https://github.com/apache/incubator-carbondata
https://orc.apache.org/docs/spec-index.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
