Big Data 22 min read

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

When attribution analysis on over 900 million rows slowed to more than 40 seconds and threatened cluster stability, Ctrip's smart attribution team rebuilt the architecture with Ray and DuckDB, achieving sub‑15‑second query times, 160 % performance gain, and complete resource isolation.

Ctrip Technology

Apr 16, 2026

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

Background

Rapid digital transformation has increased data volumes from terabytes to petabytes, requiring near‑real‑time analytics. Traditional batch‑oriented stacks such as Hadoop and Spark struggle with the scale, high‑frequency query demands, and complex analytical workloads like Jensen‑Shannon divergence‑based attribution.

Technology Selection

Ray was chosen for its fine‑grained dynamic scheduling, zero‑copy object store, and support for actors, tasks, and data‑parallel patterns. DuckDB was selected as an embedded OLAP engine because of its vectorized execution, columnar storage, adaptive query optimizer, and zero‑management deployment model.

Solution Overview

The monolithic “heavy compute” model on ClickHouse was replaced by a set of lightweight tasks. Ray provides the distributed orchestration layer, while each task runs a DuckDB instance locally on a worker node, processing a partition of the data in parallel.

Architecture Details

Access Layer : Unified entry point for analysts and data scientists, exposing model training and inference services.

Compute Layer : Ray cluster (head + workers) handles task scheduling, fault tolerance, and object storage.

Infrastructure Layer : Kubernetes with the KubeRay operator automates deployment, scaling, and lifecycle management.

Storage Layer : Parquet files stored in a Ceph‑backed S3‑compatible store; workers cache needed partitions locally.

Key deployment modules include the KubeRay operator, a head‑node StatefulSet for stable metadata, and a worker‑node Deployment that auto‑scales based on Ray task‑queue length.

Attribution Workflow

Data is extracted daily into Parquet files and stored in S3. When a user triggers an attribution request, the front‑end builds an execution tree, which Ray splits into dozens or hundreds of micro‑tasks. Each task loads its relevant Parquet slice, creates DuckDB views for push‑down filtering, and executes vectorized SQL. Results are aggregated on the driver node with minimal data transfer.

Results

Query latency for a 9‑billion‑row attribution dropped from >40 seconds to <15 seconds, a >160 % speedup.

Resource contention with the ClickHouse cluster was eliminated, improving overall platform stability.

Ray’s elastic scaling allowed on‑demand resource usage, reducing cost while handling peak loads.

On Huawei Kunpeng ARM servers, Ray + DuckDB outperformed comparable x86 deployments by ~25 % under similar resource caps.

Deployment Details

KubeRay Operator defines a RayCluster custom resource that specifies head and worker pod specifications, image versions, and replica counts. The head node runs as a StatefulSet to preserve GCS metadata; workers run as a Deployment with a Horizontal Pod Autoscaler that scales when the Ray task queue exceeds a threshold (e.g., >50 tasks) and scales down after idle periods (e.g., 5 minutes).

DuckDB Integration : DuckDB is packaged as a Python dependency inside the Ray worker image. Each Ray task/actor initializes an in‑process DuckDB instance, pulls the required Parquet partition from Ceph S3, caches it locally, and executes the query using DuckDB’s vectorized engine.

Network & Access : Head and worker pods communicate via the Kubernetes pod network. External access to the Ray dashboard and client ports is exposed through a Kubernetes Service and a load balancer.

Data Preparation

ETL pipelines export cleaned attribution data to columnar Parquet files daily. Files are partitioned by business‑relevant keys (e.g., date) to enable partition pruning. RowGroup sizes are tuned to balance I/O efficiency and memory usage.

Task Decomposition & Execution

When a request arrives, the application assembles an execution tree from the user‑specified dimensions. Ray parses the high‑level SQL, identifies join/union keys, and splits the work into independent micro‑tasks. The Ray scheduler assigns tasks to workers based on current CPU/memory availability. Each worker loads its Parquet slice, creates DuckDB views to push down filters, and runs the vectorized query locally.

After all micro‑tasks finish, Ray aggregates the small intermediate results on the driver node using simple Python logic.

Hardware Benchmark

Configuration :

Kunpeng ARM: 2 nodes × 160 logical cores (80 physical cores each).

x86: 5 nodes × 64 logical cores (32 physical cores each).

Performance (dimension attribution task):

ClickHouse: 67 s.

Ray on x86: 33 s (CPU util 64 %).

Ray on ARM: 25 s (CPU util 57 %).

Future Directions

Planned work includes evaluating the distributed DuckDB project smallpond and deeper integration of machine‑learning models (e.g., XGBoost on Ray) into the attribution pipeline to further automate decision‑support workflows.

Performance optimization big data Kubernetes Distributed Computing Ray Attribution Analysis DuckDB

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.