Big Data 17 min read

How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations

This article describes Hulu’s real‑world OLAP query optimization, covering the fundamentals of OLAP, comparisons of row‑ and column‑based storage formats, detailed indexing mechanisms of Parquet, ORC and CarbonData, and the specific schema, shuffle, block size, speculation and GC tuning techniques that enabled CarbonData to dramatically accelerate wide‑table queries on SparkSQL.

Hulu Beijing

Dec 20, 2016

How Hulu Supercharges OLAP Queries with CarbonData: Real‑World Optimizations

1. Background

Hulu is a North American video streaming company. Its OLAP queries analyze large amounts of user behavior data (login, watch, purchase) to improve market expansion, user experience, and ad revenue.

OLAP query optimization starts with scenario segmentation and tuning storage format and query engine. This article shares Hulu's experience on CarbonData and SparkSQL.

2. OLAP Basics

Basic concepts:

Dimension : columns used in WHERE conditions.

Measure : data describing the fact.

Fact : table partitioned by dimensions.

Example: a flight statistics table with measures such as ticket price and satisfaction, and dimensions such as flight time and passenger ID.

Common data models: star schema and snowflake schema.

3. Main Big‑Data File Formats

3.1 Storage Layout

3.1.1 Row‑based storage

Traditional databases store rows together, which is good for ACID transactions but yields low compression and high I/O for column‑sparse OLAP queries.

3.1.2 Column‑based storage

Columnar formats such as Parquet and ORC store data column‑wise, offering higher compression, less I/O for few‑column queries, and support for nested data.

Higher compression because each unit stores same‑type values.

Reduced I/O when only a few columns are accessed.

Support for nested structures, effectively pre‑joined.

Columnar read cost grows linearly with the number of requested columns; excessive columns can make columnar slower than row‑based. CarbonData tries to improve this.

3.2 Indexes

Most formats add indexes at various levels to filter blocks.

3.2.1 Parquet

Supports min/max indexes at row‑group level and predicate push‑down.

3.2.2 ORC

Provides min/max indexes at file, stripe, and row‑group levels.

3.2.3 CarbonData

Offers file‑level and blocklet‑level min/max indexes and row‑level invert indexes. Different index levels act on driver, executor task, and data read respectively, giving finer‑grained filtering than Parquet or ORC.

CarbonData uses MDK (multi‑dimension keys) for multi‑column sorting and global dictionaries with lazy conversion.

4. Typical Hulu OLAP Scenarios

4.1 Relational structure

Queries such as “viewing time trend of a TV series per day” use multiple tables and indexes; data volume is moderate, requiring flexible queries and transactional support. Implemented with row or column storage (e.g., ORC) and engines like Hive, Presto, Impala.

4.2 Nested structure

Used for user segmentation; many repeated features are stored in nested columns. Hulu built the Nesto engine to enable filtering on repeated columns, effectively performing pre‑join and improving performance.

4.3 Wide‑table queries

Feature‑rich queries select hundreds of columns; they benefit from strong indexing and a hybrid of column‑ and row‑based storage, for which CarbonData is employed.

5. CarbonData Optimizations at Hulu

5.1 Table Optimizations

5.1.1 Schema

The CREATE TABLE statement determines MDK index column order. All non‑numeric types are treated as dimensions and included in MDK unless excluded via DICTIONARY_INCLUDE/EXCLUDE.

CREATE TABLE IF NOT EXISTS Table1 (
  a Int, b Int, c String, d Int, e Int,
  STORED BY 'org.apache.carbondata.format'
) TBLPROPERTIES("DICTIONARY_INCLUDE"="a,b,c,d,e")

Columns should be ordered by increasing distinct values or decreasing filter selectivity to reduce index size.

5.1.2 Shuffle

Each load creates a segment of CarbonData files. Uneven data distribution can cause some blocklets to contain many target rows, leading to straggler tasks. Repartitioning the source data before load evens out blocklet sizes.

5.1.3 Block/Blocklet Size

Default blocklet size is 120 k rows; a block is split when its size exceeds 90 % of the 1 GB block size. Smaller block/blocklet sizes improve filter effectiveness but may increase file count and I/O overhead.

5.2 Query Engine Optimizations

5.2.1 Speculation

Enabling Spark speculation re‑executes slow tasks on other nodes, reducing tail latency. Parameters are set aggressively because only a few tasks are outliers.

5.2.2 Garbage Collection

Configure Young generation (512 MB–1 GB) and trigger Old‑gen GC at 50 % occupancy. Prefer G1GC over CMS for better performance on large heaps.

5.3 Tuning Experiments

Test environment: 500 GB raw data, ~300 columns (including nested), >300 M rows; queries select 100–200 columns, filter >95 % of rows; cluster with 100 executors, 5 vCores each, 10 GB memory per executor.

Results show CarbonData + SparkSQL outperforms Parquet + Presto/Impala due to fine‑grained blocklet filtering and speculation. Further parameter tweaks yield diminishing returns.

6. References

https://en.wikipedia.org/wiki/Online_analytical_processing

https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home

https://github.com/apache/incubator-carbondata

https://orc.apache.org/docs/spec-index.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SparkSQL Query Optimization OLAP Columnar Storage CarbonData

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.