Databases 14 min read

In‑Depth Analysis of Rockset’s Cloud‑Native Real‑Time Analytics Architecture

This article examines Rockset’s cloud‑native real‑time analytics database, detailing its document‑oriented data model, RocksDB‑Cloud storage engine, compute‑storage separation, sharding, converged indexing, query processing pipeline, and the implications of OpenAI’s recent acquisition for the broader database ecosystem.

AntData
AntData
AntData
In‑Depth Analysis of Rockset’s Cloud‑Native Real‑Time Analytics Architecture

The author, Liu Jiacai, a technical expert at Ant Group and core developer of Apache HoraeDB, introduces Rockset—a leading real‑time analytics database that OpenAI recently acquired.

Rockset is positioned as a cloud‑native search and analytics database offering real‑time indexing and full‑featured SQL on JSON, time‑series, geospatial, and vector data, with a typical usage pattern illustrated in the original whitepaper.

Unlike traditional relational databases, Rockset stores data in a schemaless document model, allowing flexible ingestion of semi‑structured formats (JSON, Avro, Parquet) while still supporting SQL queries and joins.

Its cloud‑native architecture leverages shared storage (e.g., S3) to separate compute from storage, decouples persistence from performance by using inexpensive object storage for durability, and employs multi‑tier storage (SSD for hot data, object storage for cold data).

RocksDB‑Cloud, an embedded persistent engine derived from RocksDB, provides zero‑copy‑clone replication, automatic SST upload to S3, and a separation of compression workloads onto stateless servers, enabling elastic scaling.

Rockset further separates compute from storage and even compute from compute: write‑path and query‑path modules run independently, allowing multiple compute nodes to read the same shared data with minimal latency.

Data is horizontally partitioned into "microshards"; each microshard maps to a RocksDB instance, facilitating parallel scans and replication for high availability.

The system employs a converged index that combines row, column, and search indexes, automatically selecting the optimal index for each query. Example SQL queries demonstrate how the optimizer chooses column‑store or search indexes based on query predicates.

Real‑time indexing is achieved through a covering index with delta and dictionary (zSTD) compression, plus a 10‑bit Bloom filter that reduces I/O by 99%.

SELECT keyword,
    count(*) c
FROM search_logs
GROUP BY keyword
ORDER BY c DESC
SELECT *
FROM search_logs
WHERE keyword = 'rockset' AND locale = 'en'

Query execution follows three classic stages—planning, optimization, and execution—handled by aggregators that route sub‑queries to appropriate data partitions and combine results before returning them to the API server.

The article concludes that Rockset’s cloud‑native design reflects current best practices for distributed systems, offering cost‑effective compute‑storage separation, though few projects have successfully adapted RocksDB for such environments.

Finally, the author reflects on the broader impact of OpenAI’s acquisition, suggesting that integrating structured database capabilities with vector retrieval (RAG) may become a dominant approach for large‑language‑model memory.

cloud nativereal-time analyticsRocksDBconverged indexingRockset
AntData
Written by

AntData

Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.