Databases 12 min read

Databend: A Cloud‑Native Modern Data Warehouse Architecture

This article explains how Databend, a cloud‑native OLAP data warehouse, addresses modern data‑warehouse challenges by separating storage and compute, providing elastic scaling, multi‑cloud support, and efficient query planning and execution to deliver low‑cost, on‑demand analytics.

DataFunTalk
DataFunTalk
DataFunTalk
Databend: A Cloud‑Native Modern Data Warehouse Architecture

Databend’s name originates from the concept of “Time Bend” in relativity, reflecting its goal of reshaping how users perceive and extract value from data.

Traditional data‑warehouse architectures struggle to meet current demands, while modern cloud‑native warehouses aim to eliminate hardware management, software configuration, and resource provisioning complexities, offering elastic, pay‑as‑you‑go scaling.

Key requirements of a modern cloud‑native warehouse include: no hardware management, no software configuration, no performance degradation with data growth, second‑level elastic scaling, and pay‑only‑for‑used resources.

Databend satisfies these needs through a fully cloud‑native design that separates state, storage, and compute, delivering lower cost, ease of use, and consumption‑based billing.

Databend Architecture

The system consists of four layers:

Data Ingestion: supports SQL sources such as MySQL and ClickHouse, allowing easy integration with existing ecosystems.

Meta Service: a multi‑tenant, transactional key‑value store that holds metadata, authentication information, and acts as a namespace registry.

Compute Layer: each compute node parses SQL, generates logical and physical plans, and executes queries with its own cache and indexes, forming a scalable cluster.

Storage Layer: relies on cloud shared storage (e.g., S3, Azure Blob) and uses columnar formats like Parquet with MinMax and sparse indexes.

Meta Service Details

Meta Service stores schema information, user authentication, and acts as a lightweight namespace service, while actual data remains outside this layer.

Query Planning and Execution

Logical planning parses SQL into a logical plan, which is then optimized and transformed into a physical pipeline represented as a directed acyclic graph of compute nodes and channels.

Example execution plan:

explain pipeline SELECT avg(age) FROM class WHERE age > 13 GROUP BY city

The physical pipeline shows parallelism (e.g., 8 CPUs) for filtering, partial aggregation, and final merging, emphasizing connectivity and parallel execution.

Task Scheduling

Databend uses a pull‑based scheduler with work‑stealing: each node and CPU fetches tasks from a global queue, stealing work from others when idle, ensuring high resource utilization.

Cluster Communication

Instead of traditional gRPC, Databend adopts Arrow Flight RPC to avoid costly serialization/deserialization.

Storage Layer Optimizations

Columnar storage with Parquet format.

MinMax and sparse indexes to prune data early.

Unified multi‑cloud view enabling transparent cross‑cloud storage access.

Automatic data clustering to accelerate hot‑data queries.

When a hot data range is frequently accessed, Databend merges relevant files to reduce read depth and improve query speed.

Future Plans

Databend, open‑sourced in March 2021, is in early development with an upcoming Alpha release. The roadmap includes continued performance and elasticity improvements, multi‑cloud deployment, and a move toward a serverless model where compute nodes are launched on demand and billed per usage.

For more details, visit the GitHub repository: https://github.com/datafuselabs/databend

Join the DataFunTalk community for further discussions and updates.

Cloud Nativearchitecturebig dataData WarehouseOLAPelastic scalingDatabend
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.