Big Data 8 min read

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.

TAL Education Technology
TAL Education Technology
TAL Education Technology
Optimization of A/B Test Metric Computation Using Spark and ClickHouse

Introduction

A/B testing is a data‑driven method that splits traffic to run multiple product versions simultaneously, records user behavior, and compares metrics to support scientific product decisions.

Metric Product Design

The metric system uses a registration approach where users define metrics with SQL formulas and optional custom dimensions; the analysis layer provides both pre‑computed and on‑demand multi‑dimensional queries.

Metric Technical Architecture

The platform employs Spark as the core computation engine for its performance and maturity, and ClickHouse as the OLAP engine for fast multi‑dimensional analysis of detailed data.

Initially, 10+ experiments and 50+ metrics required 2–3 hours of processing; after six months, the workload grew to 10‑parallel experiments and 100‑core resources, prompting optimization.

Stage 1: Engine and Architecture Optimization

Adopted Spark for batch jobs and ClickHouse for analytical queries, enabling parallel execution of multiple experiments while keeping metric calculations within each experiment serial.

Stage 2: Cumulative Calculation Model Optimization

Replaced the original model that scanned all historical data for each cumulative metric with a new model that builds daily aggregates incrementally, improving performance and accuracy.

Stage 3: Rate Metric Batch Optimization

Implemented batch processing for rate metrics that share the same SQL definition across experiments, reducing total runtime to about 5 hours for 150+ experiments and 600+ metrics.

Stage 4: Mean Metric Batch Optimization

For complex mean metrics, introduced Spark checkpointing and a hybrid Spark‑ClickHouse workflow that caches intermediate detail data, achieving further speed‑ups despite increased complexity.

Conclusion

After the four‑stage optimization, the system now handles over 150 experiments and 600 metrics with a stable processing time of 2–3 hours, demonstrating scalable and controllable performance as the workload grows.

Data EngineeringBig DataClickHouseA/B testingSparkMetric Optimization
TAL Education Technology
Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.