Databend: A Cloud‑Native Modern Data Warehouse Architecture
This article explains how Databend, a cloud‑native OLAP data warehouse, addresses modern data‑warehouse challenges by separating storage and compute, providing elastic scaling, multi‑cloud support, and efficient query planning and execution to deliver low‑cost, on‑demand analytics.
Databend’s name originates from the concept of “Time Bend” in relativity, reflecting its goal of reshaping how users perceive and extract value from data.
Traditional data‑warehouse architectures struggle to meet current demands, while modern cloud‑native warehouses aim to eliminate hardware management, software configuration, and resource provisioning complexities, offering elastic, pay‑as‑you‑go scaling.
Key requirements of a modern cloud‑native warehouse include: no hardware management, no software configuration, no performance degradation with data growth, second‑level elastic scaling, and pay‑only‑for‑used resources.
Databend satisfies these needs through a fully cloud‑native design that separates state, storage, and compute, delivering lower cost, ease of use, and consumption‑based billing.
Databend Architecture
The system consists of four layers:
Data Ingestion: supports SQL sources such as MySQL and ClickHouse, allowing easy integration with existing ecosystems.
Meta Service: a multi‑tenant, transactional key‑value store that holds metadata, authentication information, and acts as a namespace registry.
Compute Layer: each compute node parses SQL, generates logical and physical plans, and executes queries with its own cache and indexes, forming a scalable cluster.
Storage Layer: relies on cloud shared storage (e.g., S3, Azure Blob) and uses columnar formats like Parquet with MinMax and sparse indexes.
Meta Service Details
Meta Service stores schema information, user authentication, and acts as a lightweight namespace service, while actual data remains outside this layer.
Query Planning and Execution
Logical planning parses SQL into a logical plan, which is then optimized and transformed into a physical pipeline represented as a directed acyclic graph of compute nodes and channels.
Example execution plan:
explain pipeline SELECT avg(age) FROM class WHERE age > 13 GROUP BY cityThe physical pipeline shows parallelism (e.g., 8 CPUs) for filtering, partial aggregation, and final merging, emphasizing connectivity and parallel execution.
Task Scheduling
Databend uses a pull‑based scheduler with work‑stealing: each node and CPU fetches tasks from a global queue, stealing work from others when idle, ensuring high resource utilization.
Cluster Communication
Instead of traditional gRPC, Databend adopts Arrow Flight RPC to avoid costly serialization/deserialization.
Storage Layer Optimizations
Columnar storage with Parquet format.
MinMax and sparse indexes to prune data early.
Unified multi‑cloud view enabling transparent cross‑cloud storage access.
Automatic data clustering to accelerate hot‑data queries.
When a hot data range is frequently accessed, Databend merges relevant files to reduce read depth and improve query speed.
Future Plans
Databend, open‑sourced in March 2021, is in early development with an upcoming Alpha release. The roadmap includes continued performance and elasticity improvements, multi‑cloud deployment, and a move toward a serverless model where compute nodes are launched on demand and billed per usage.
For more details, visit the GitHub repository: https://github.com/datafuselabs/databend
Join the DataFunTalk community for further discussions and updates.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.