An Overview of RisingWave: Design, Architecture, and Use Cases of an Open‑Source Distributed Streaming SQL Database
RisingWave is an open‑source distributed streaming SQL database that uses SQL to define tables and materialized views, offering low‑latency incremental query processing, a scalable architecture with separate compute and storage, robust consistency guarantees, and real‑time analytics demonstrated through several practical use cases.
RisingWave is an open‑source distributed streaming SQL database from RisingWave Labs. It provides a SQL interface for defining tables, materialized views, and other streaming computation tasks, helping users with technology selection and system evolution.
What is a Streaming Database? Traditional databases process queries on demand, while streaming databases continuously maintain query results as data arrives, dramatically reducing query latency for complex joins and aggregations.
Core Concepts
Streaming databases treat a stream as the change log of a regular table (stream‑table duality). RisingWave supports true tables stored physically, allowing standard PostgreSQL‑compatible DML (INSERT, UPDATE, DELETE). Tables can have optional connectors (e.g., Debezium for MySQL) to ingest upstream change logs.
Materialized views are defined via SQL and are dynamically updated as their base tables change. They support full SQL features such as joins, GROUP BY, window functions, and are used in benchmarks like TPC‑H and Nexmark.
Architecture Design
The system consists of scalable Frontend nodes (handling SQL parsing, planning, and serving) and Compute nodes (executing distributed streaming pipelines). A separate Connector node handles external systems such as Kafka or JDBC sources via RPC.
State is stored in a shared object store (e.g., AWS S3, MinIO, HDFS). Compute nodes treat their in‑memory state as a cache; on restart they fetch missing data from the object store, enabling fast recovery.
Compaction is offloaded to a dedicated cluster, allowing cheap resources (spot instances or FaaS) to handle LSM‑tree maintenance without impacting query latency.
The SQL workflow: parser → AST → logical plan → optimizer → physical plan → fragmentation → scheduler → distributed execution on Compute nodes. Optimizer applies rules such as join ordering, two‑phase aggregation, and chooses appropriate physical operators (e.g., hash join).
Execution is built on async Rust, leveraging zero‑cost async for IO‑heavy streaming workloads. Operators form a DAG; each actor combines multiple operators and pipelines data across partitions.
RisingWave uses a high‑frequency (default 1 s) Chandy‑Lamport checkpointing algorithm, providing exactly‑once semantics and serving read requests from consistent snapshots.
Features and Performance
Separation of storage and compute with cache‑miss recovery for fast node restarts.
Full PostgreSQL‑compatible SQL optimizer, making migration from Flink SQL or other SQL‑based pipelines easy.
Multi‑way join support with join‑ordering optimizations to keep latency low.
Strong consistency guarantees via snapshot reads, atomic writes, backfill for materialized view creation, and frequent checkpointing.
Benchmarks against Flink 1.16 show comparable or up‑to‑10× speedups on Nexmark queries, especially for state‑heavy workloads.
Use‑Case Highlights
Real‑time analytics with complex joins and aggregations displayed on live dashboards.
Monitoring and alerting pipelines that ingest events from MQs and emit alerts to downstream systems.
ETL scenarios where RisingWave performs multi‑way joins before forwarding results to data warehouses or lakes.
Case #1: Replacing a Flink + columnar‑store stack in a warehouse‑logistics system with RisingWave, reducing latency and operational cost.
Case #2: A Web3 monitoring application that ingests blockchain events via Kinesis, applies SQL‑defined rules, and pushes alerts to Kafka, with optional UDF Server (Python/Java) for complex logic.
Q&A
Q1: Differences from Materialize – Materialize runs entirely in memory on a single node and lacks checkpointing; RisingWave is distributed, uses async Rust, and checkpoints frequently.
Q2: Remote write overhead – Writes occur only during checkpoints (default 1 s), producing small SST files to S3; reads may cause cache misses but most workloads exhibit good temporal locality.
Q3: Ad‑hoc query performance – Not the primary focus; RisingWave is row‑store oriented for streaming, so ad‑hoc queries are acceptable but not as fast as columnar OLAP systems.
Q4: Indexes on materialized views – Supported; indexes are essentially materialized views themselves.
Q5‑Q6: UDF Server design – Allows heavy, stateful functions (e.g., ML models) to run outside the kernel, mitigating stability risks and enabling batch RPC calls.
Q7: Sink binding – Default 1 s writes for fast sinks like MySQL; configurable batching for higher‑latency sinks such as Delta Lake.
Q8: Comparison with distributed in‑memory databases – Those target OLTP workloads; RisingWave focuses on real‑time materialized view maintenance for streaming analytics.
Overall, RisingWave combines the familiarity of SQL with the performance characteristics of modern streaming systems, offering a robust platform for real‑time data processing.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.