Databases 16 min read

How ByteHouse Redefines ELT for Cloud‑Native Data Warehousing

This article explains how ByteHouse, a cloud‑native data warehouse, shifts traditional ETL to ELT, simplifies data pipelines, enhances scalability, and introduces advanced features such as stage‑by‑stage scheduling, adaptive resource management, async execution, and future roadmap for big‑data workloads.

Volcano Engine Developer Services

Nov 9, 2023

How ByteHouse Redefines ELT for Cloud‑Native Data Warehousing

Introduction to ETL/ELT

When discussing data warehouses, Extract‑Transform‑Load (ETL) or Extract‑Load‑Transform (ELT) are indispensable. Traditional ETL relies on an external system, incurring high maintenance costs.

ByteHouse, a cloud‑native data warehouse from Volcano Engine, now fully supports ELT, allowing users to import data and transform it inside ByteHouse with custom SQL, eliminating the need for separate ETL systems.

ByteHouse in ByteDance

Since 2017, ByteDance’s data volume grew rapidly, prompting the adoption of ClickHouse for real‑time analytics. From 2018 to 2019, ClickHouse expanded to BI, A/B testing, and model estimation. In 2020 ByteHouse was launched internally and opened to external users in 2021. By March 2022 it operated 18,000 nodes, with single clusters up to 2,400 nodes.

Product Forms

Enterprise edition: PaaS, fully managed, tenant‑dedicated resources.

Data‑warehouse edition: SaaS, users create tables, import data, and query without operational overhead.

Small‑scale users can start with the enterprise edition and migrate to the data‑warehouse edition as scale and operational demands increase.

Application Scenarios

Data Insight : A one‑stop platform for self‑service analysis of trillion‑level data, delivering results via portals, dashboards, and management cockpits.

Growth Analysis : User behavior analysis covering data collection, behavior analysis, user profiling, and content analysis, with intelligent applications for anomaly detection, attribution, and ad strategy.

One‑Stop KPI Platform : Example of Dongchedi using ByteHouse to achieve sub‑second analytics for marketing and sales models.

ETL vs ELT

ETL extracts, transforms, and loads data before it reaches the warehouse, requiring pre‑modeling. ELT loads raw data first and performs most transformations during analysis, offering flexibility and becoming the norm for big data.

Industry Solutions

Pre‑computation : Tools like Kylin pre‑aggregate data into cubes.

Streaming‑Batch Fusion : Flink, RisingWave aggregate data in‑memory and write results to storage.

Lake‑Warehouse Fusion : Combine data lake and warehouse capabilities.

ETL in ByteHouse

ByteHouse’s ELT pipeline requires overall scalability, reliability, performance, and observability. It supports horizontal scaling, fault‑tolerant job scheduling, efficient multi‑core utilization, and integration with various tools.

Storage Serviceization

ETL results are first stored as Parquet, then served via a storage service, converted to Part files, and finally deleted after use, providing unified access.

val df = spark.read.format("CnchPart").options(Map("table" -> "cnch_db.c1")).load()

val spark = SparkSession.builder().appName("CNCH-Reader").config("spark.sql.extensions", "CnchAutoConvertExtension").enableHiveSupport().getOrCreate()
val df = spark.sql("select * from cnch_db.c1")

Overall Process

ClickHouse executes queries in two stages: the coordinator distributes sub‑queries to shards, then aggregates results. ByteHouse improves this by inserting exchange operators, splitting the plan into stages, and scheduling them separately, reducing coordinator bottlenecks and worker OOM.

Adaptive Scheduler

Worker health metrics (CPU, memory, query count) guide dynamic worker selection and concurrency control, mitigating load imbalance.

Query Queue

A server‑side manager checks cluster resources before dispatching queries; if resources are insufficient, queries wait, preventing overload and crashes.

Async Execution

Long‑running ELT tasks can be run asynchronously by enabling enable_async_query=1. The query runs in a background thread, returns an async ID, and clients poll for status, avoiding blocked connections.

Future Roadmap

Import Optimization : Spark part writer execution inside the domain, fine‑grained transaction handling, and lock improvements.

Fault Recovery : Operator spill, exchange spill, stage‑level retries, query state persistence, and remote shuffle service support.

Resource Management : User‑specified compute resources, dynamic estimation and reservation, on‑demand resource allocation, and finer isolation.

Ecosystem : Support for more ETL orchestration tools (dbt, AirFlow, Kettle, Dolphin, SeaTunnel) and data‑lake formats (Hudi, Iceberg) with JNI readers.

Overall, ByteHouse aims to provide a highly scalable, efficient, and flexible ELT platform for modern big‑data analytics.

Data Warehouse SQL Optimization ByteHouse ELT

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.