Big Data 22 min read

How Hulu’s Nesto Engine Delivers Near‑Real‑Time OLAP on TB‑Scale Data

This article introduces Hulu's in‑house OLAP engine Nesto, detailing its near‑real‑time data ingestion, nested data model, TB‑level storage using HBase and Parquet, MPP query execution, custom predicate library, and the overall architecture that enables sub‑second ad‑hoc queries for user analytics.

Hulu Beijing

Feb 28, 2018

How Hulu’s Nesto Engine Delivers Near‑Real‑Time OLAP on TB‑Scale Data

1. Project Background

Nesto originated from Hulu's user analytics team, which needed a user‑centric, analytical product capable of ad‑hoc interactive queries and data export for operations, product, and third‑party data companies.

2. Data Platform Pipeline

The core asset is a pipeline that integrates data from multiple internal teams into HBase, stores it centrally, and provides a UI portal where users describe requirements that are saved in a metadata DB. A Spark job periodically scans HBase, performs batch computation, and serves valuable data such as user‑tag services.

Typical use case: export new users who watched "Game of Thrones" S7E7 more than five times in January 2018, including username and email, for marketing emails.

3. Storage Model

HBase stores user‑dimension data as a wide‑table KV model, offering horizontal scalability, strong random read/write, high‑throughput bulk import, and seamless integration with the Hadoop/Spark stack. Each HBase row contains raw attributes, nested behaviors, birth date, age (derived), and timestamps, with nested watch behavior stored using multi‑version columns.

4. Nesto Genesis

To meet OLAP needs, Nesto supports filter, projection, aggregation, and custom UDFs, provides ad‑hoc queries with second‑level latency for regular columns and sub‑hundred‑second latency for nested columns, and aims for hour‑level data ingestion latency.

5. Technical Foundations

Nested, non‑relational storage model.

Columnar storage using Parquet (Google Dremel implementation) for efficient I/O and compression.

Massively Parallel Processing (MPP) architecture inspired by Presto.

Thrift‑based RPC with NIO, Reactor pattern, and non‑blocking I/O.

Distributed configuration, high‑availability via YARN and Zookeeper.

Real‑time data ingestion inspired by Google MESA.

Supporting components: MySQL for task tracking, Hadoop/HDFS for data storage, Java implementation.

6. Nesto Storage Model Details

Logically, Nesto appears as a nested flat table; physically it uses Parquet files split into row groups, pages, and column chunks, each encoded with techniques like run‑length, dictionary, or delta encoding.

Table schemas are managed via a distributed configuration center; an example schema for RegUsersAttributes is stored as a Proto schema.

7. Overall Architecture

Nesto consists of a query engine and a data‑processing infrastructure.

Query Engine Components

Nesto‑portal : Web UI for submitting queries and downloading results.

Nesto‑cli : Command‑line interactive query client.

State Store : Zookeeper‑based cluster management for high availability.

Coordinator : Receives client requests, parses DSL (JSON or SQL++), generates a simple execution plan by pushing filters to workers, splits Parquet files into sub‑tasks, schedules them, aggregates results, and streams data back to the client.

Worker : Executes sub‑tasks, reads Parquet via the Parquet API, applies the predicate library for filter push‑down, performs pre‑aggregation, limit push‑down, and streams data chunks back to the coordinator.

Data‑Processing Infrastructure

Supports near‑real‑time data ingestion using a Base‑Cumulative‑Delta model (Google MESA). Base files are exported via heavy ETL jobs; incremental changes are captured by HBase coprocessors, published to Kafka, and processed hourly into delta files. A compactor merges deltas into larger files to reduce random I/O during queries.

8. Deployment

Nesto is deployed on YARN for high‑availability and resource management, with support for standalone and pseudo‑distributed modes for testing.

9. Summary and Future Plans

Currently, Nesto serves real‑time user data OLAP queries for Hulu, running on 60 servers with Parquet files up to 3‑4 TB. Future enhancements include code‑generated predicate libraries, hot‑cold data separation, intelligent table routing, improved security, multi‑tenant support, and broader nested SQL capabilities.