Big Data 10 min read

How Fanatics Scaled to PB‑Level Data with StarRocks & Apache Iceberg Lakehouse

Fanatics unified its fragmented data stack by building a StarRocks‑powered Lakehouse on Apache Iceberg, replacing Redshift, Snowflake, Athena, and Druid, which cut costs by up to 95%, delivered sub‑second dashboard queries on petabyte‑scale data, and enabled real‑time and historical analytics on a single platform.

StarRocks
StarRocks
StarRocks
How Fanatics Scaled to PB‑Level Data with StarRocks & Apache Iceberg Lakehouse

Original Architecture and Challenges

Fanatics operated a fragmented data stack:

Data warehouse: Amazon Redshift and Snowflake

Data lake: S3 with Athena queries (interactive latency >30 seconds)

Real‑time analytics layer: Apache Druid for dashboards

Problems included slow Athena queries, inability to join across warehouses, Druid’s limited join and primary‑key update capabilities, redundant data copies, high storage/transfer costs, and a lack of unified data governance.

Technology Selection and Evaluation

After evaluating several OLAP engines (ClickHouse, etc.), Fanatics selected StarRocks because it offers:

High‑performance query execution

Native integration with AWS Glue Catalog

Asynchronous Materialized Views (AMV) that automatically rewrite queries

Next‑Generation Lakehouse Architecture

The new stack consolidates storage, compute, and metadata:

Storage layer: Amazon S3 as the source of truth, using Apache Iceberg open table format.

Compute layer: StarRocks as the high‑performance query engine.

Metadata management: Unified catalog via AWS Glue.

Transformation & visualization: dbt Core for data modeling; Superset and a Bedrock‑based Text‑to‑SQL agent for self‑service analytics.

Data Import Modes and Table Design

Fanatics supports multiple StarRocks ingestion methods (batch load, stream load, CDC) and optimizes table schemas based on workload characteristics:

Primary‑Key tables: Enable fast point queries and real‑time updates.

Duplicate‑Key tables: Optimized for high‑throughput append‑only workloads.

Tablet size: Tuned to ~1 GB per tablet to balance I/O and memory usage.

Tenant‑aware partitioning: Prevents data skew and reduces the number of small files.

Architecture diagram
Architecture diagram

Performance Optimization

Asynchronous Materialized Views (AMV): Fine‑grained refresh windows and partition alignment automatically rewrite queries, eliminating manual SQL rewrites.

Use of EXPLAIN and ANALYZE to verify partition pruning and predicate push‑down.

Continuous tuning of Colocate Join and Shuffle Join to minimize inter‑node I/O.

Multi‑level aggregation for high‑cardinality metrics.

Cache and Storage Strategy

Most queries read Iceberg tables directly; hot data blocks are cached locally via StarRocks Data Cache.

AMVs are introduced only when direct table scans cannot meet latency SLAs.

Current deployment runs StarRocks on EC2 instances with EBS (compute‑storage integrated). Future roadmap includes a disaggregated compute‑storage architecture for elastic scaling.

Benefits

Replaced Apache Druid for real‑time dashboards, achieving sub‑second latency while supporting joins, primary‑key updates, and point queries on a single platform.

Snowflake usage dropped by ~95 % and related costs fell by ~90 %.

Ad‑hoc query latency on Iceberg tables is ~10× faster than Athena.

A single StarRocks cluster now serves thousands of tables and materialized views, handling petabyte‑scale Iceberg data.

Dashboard provisioning time reduced from days/weeks to minutes.

Future Plans

Deepen AI integration for conversational analytics across datasets.

Expand Streamlit adoption for rapid lightweight data‑app delivery.

Implement touchless onboarding of new databases and tables to increase automation.

Strengthen disaster‑recovery with comprehensive metadata backup and restore.

Standardize BI tools on StarRocks as the unified analytics backbone.

StarRocks + Apache Iceberg

The Fanatics case demonstrates that combining an open table format (Iceberg) with a high‑performance query engine (StarRocks) enables large enterprises to simplify architecture, dramatically cut costs, and accelerate data‑driven insights.

Performance chart
Performance chart
Performance optimizationStarRocksApache Icebergdata architecturelakehouseFanatics
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.