Building a Scalable OLAP Platform at SF Express: StarRocks Evaluation and Lessons
SF Express’s data engineering team details how they migrated from a mixed‑component OLAP stack to a unified StarRocks platform, describing the evaluation criteria, performance‑critical design choices, import and query optimizations, and future roadmap for a high‑availability, low‑cost big‑data analytics solution.
Background and Existing OLAP Stack
SF Express’s technology unit, established in 2009, built a comprehensive big‑data ecosystem covering data collection, storage, analysis, machine learning, and visualization. Historically the OLAP layer relied on Elasticsearch (v5.4, later 7.6 with customizations), ClickHouse, Presto, and Kylin, each serving specific workloads such as log search, high‑throughput order processing, Hive queries, and pilot finance projects.
Current Pain Points and Component Selection Challenges
Multiple versions and frameworks coexist, making component upgrades risky and complex.
Users often choose components without deep understanding, leading to misuse (e.g., Elasticsearch for heavy aggregations).
Operational difficulty varies across components, requiring specialized knowledge.
Selection Principles for a New OLAP Engine
Core capabilities must be strong with no obvious shortcomings.
High engineering quality of delivered versions.
Rapid response to production issues.
Strong extensibility and long‑term development potential.
Low operational overhead.
Evaluation Methodology
The team performed horizontal tests using a standard benchmark suite across ClickHouse, Presto, Apache Doris, and StarRocks. Scenarios included a 10‑billion‑row workload with joins, typical SQLs from a 100‑billion‑row order‑level use case, and key features such as bulk data import, large‑table joins, and failover.
Why StarRocks Was Chosen
StarRocks demonstrated superior performance and stability, offered fast technical support, and provided a comprehensive operations management system. Consequently, it was selected as the foundation for a one‑stop big‑data analysis platform.
StarRocks Application Practice at SF Express
Overall Goals
The aim is to make StarRocks the core of a unified analytics platform, handling three data streams: real‑time ingestion, batch ETL via Flink/Spark connectors, and external sources (Hadoop, Elasticsearch, MySQL) via external tables.
Data Ingestion Design
Partition tables by date and bucket by order number; align Kafka partitions with StarRocks BE nodes and import task parallelism.
Use replace_if_not_null for partial field updates and specify JSON paths for each column to avoid import failures.
Separate frequently updated columns into a “private” table and less‑changed columns into a “public” table to improve import throughput.
Upgrade hardware: increase disk count from 6 to 12 (SSD planned) and CPU cores from 40 to 80 to boost QPS.
Handle cross‑machine and cross‑disk replica balancing; recent StarRocks releases resolve machine‑level balancing, with disk‑level balancing slated for future versions.
Mitigate version‑count‑induced BE pause by adjusting Kafka consumption intervals and reducing partition/replica counts.
Query Optimization
Add frequently filtered fields to the key column.
Enable Bloom filter indexes for faster lookups.
Reorder joins manually (CBO disabled) and push down join predicates by adding redundant fields to the ON clause.
Introduce a view that unifies the two physical tables, preserving the original schema for seamless BI migration.
Collaborate with the BI platform to limit query parallelism and cache hot data, improving stability.
Architecture and High Availability
Data from multiple business systems is processed by Flink, written to a new Kafka topic, and consumed by StarRocks via Routine Load, achieving exactly‑once semantics. The deployment spans two data centers with dual‑write, dual‑active configurations, and a JDBC load‑balancer for BI tools and business applications.
Table Design Highlights
Aggregated table model supporting both detail tables and materialized views.
Two tables split by update frequency to increase parallel import tasks.
Date‑based partitions and order‑number bucketing.
Partial updates via replace_if_not_null.
Key‑column placement for low‑frequency fields and redundant columns for join push‑down.
Collocate join between the two tables for efficient joins.
Dynamic date partitions for data expiration.
Bloom filter indexes on query predicates.
Results and Impact
After migration, StarRocks handles the same workload with roughly one‑third of the previous resource consumption while meeting a 2 k‑row wide table and 80 k TPS write requirement for the Double‑11 peak. The platform now supports real‑time, batch, and external data sources with high availability and low latency.
Future Plans and Community Co‑building
Phase out new ClickHouse business ingestion.
Scale StarRocks adoption across more business lines in the coming year, with budget approvals already in place.
Deepen integration with cloud data‑warehouse projects.
Collaborate with the StarRocks community to contribute code, improve serverless management, enhance operational tools, expand scenario‑specific data models, and support more database engines.
Visual References
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
