Big Data 14 min read

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Dolphin, Alibaba’s hyper‑converged multi‑modal big‑data engine, unifies OLAP, AI, streaming, and batch workloads on a decoupled compute‑storage MPP foundation, offering a Dolphin SQL layer, advanced bitmap/GroupTable/AFile indexes, intelligent materialization, and one‑write‑multiple‑read storage that cuts costs over 70% while delivering sub‑millisecond queries on trillion‑row datasets.

Alimama Tech
Alimama Tech
Alimama Tech
Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Background – To improve usability and reduce cost, big‑data technology is moving toward serverless, integrated, and intelligent architectures. Alibaba’s self‑developed hyper‑converged multi‑modal engine, Dolphin, originated to solve performance issues in general OLAP for large‑scale audience‑targeting scenarios and has evolved over five years to cover OLAP, AI, Streaming, and Batch workloads.

Core Engine Capabilities – Dolphin is built on an MPP foundation and provides five key modules: a self‑developed compute‑storage decoupled engine, a Dolphin SQL engine, an Index Build engine, intelligent computing, and one‑write‑multiple‑read storage. The engine supports bitmap, GroupTable, and AFile indexes, vector‑based recall, model inference, and high‑performance real‑time writes.

Dolphin SQL Engine – Translates business SQL into physical execution SQL, offering translation, plan optimization, load balancing, materialization, and federated query capabilities. It includes Dolphin JDBC, a fastsql‑based parser, a translation framework with rule‑based CBO, and a scheduler that dispatches operators to appropriate execution engines.

Index Build Engine – Provides bitmap, GroupTable, and AFile indexes to accelerate massive audience‑targeting queries, achieving up to 50× data compression and 15× performance gains for bitmap, 60% storage savings and 30× speedup for GroupTable, and 10‑100× acceleration for AFile‑based joins.

Intelligent Computing – Implements smart materialization and automatic index selection. Materialization leverages statistical analysis and machine‑learning models to pre‑compute hot queries, while heuristic algorithms choose optimal partition columns and indexes based on table statistics and query history.

One‑Write‑Multiple‑Read – Combines DBFS (high‑performance SSD storage) and HDFS (cold storage) to reduce storage costs by >70% while maintaining query performance through column pruning, caching, and push‑down optimizations.

Domain Capabilities

• OLAP : Supports sub‑millisecond, high‑QPS queries on trillion‑row datasets using bitmap and GroupTable indexes, as well as MergeTree tables for reporting and attribution.

• AI Service : Provides a unified SQL layer that abstracts downstream systems, enabling data scientists to perform preprocessing, vector recall, and model scoring entirely with SQL.

• Streaming : Offers a DB‑for‑Streaming experience where users develop real‑time jobs via SQL without dealing with Flink or storage details, supporting transparent, low‑latency feature computation.

• Batch : Delivers domain‑specific batch processing (e.g., vector recall, batch scoring) with a productized UI that reduces development time to minutes and improves performance and stability.

Conclusion – Dolphin exemplifies the trend toward hyper‑converged, serverless, and intelligent big‑data platforms, allowing engineers and data scientists to handle any scale of data processing, real‑time jobs, and algorithmic workloads using a unified SQL interface.

AIstreamingBig DataOLAPSQL engineIndexing
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.