Databases 11 min read

How Apache Doris Enables Cloud‑Native Real‑Time Data Warehousing for Log Analytics

Based on a DTCC2022 presentation, this article explains Apache Doris's high‑performance MPP architecture, its cloud‑native extensions in SelectDB, and how they solve large‑scale log storage and analysis with superior write throughput, storage efficiency, and interactive query speed.

ITPUB
ITPUB
ITPUB
How Apache Doris Enables Cloud‑Native Real‑Time Data Warehousing for Log Analytics

Apache Doris Overview

Apache Doris is a high‑performance, real‑time analytical database built on an MPP architecture, offering sub‑second query responses on massive data sets. Since graduating from the Apache incubator in June 2022, it has attracted over 400 contributors and is used in more than 1,000 production environments worldwide, including major Chinese internet companies such as Baidu, Meituan, Xiaomi, JD, ByteDance, Tencent, and many traditional industries.

Doris supports both high‑concurrency point queries and high‑throughput complex analytics, and its MySQL‑compatible protocol enables seamless integration with existing tools.

Typical Log Storage and Analysis Scenario

Log data requires massive write throughput, low storage cost, and fast interactive queries with full‑text search and time‑based sorting. Traditional solutions fall into two categories: inverted‑index systems like Elasticsearch (ES) and metadata‑index or no‑index architectures like Loki. ES provides fast query performance but limited write throughput and higher storage cost, while Loki offers higher write throughput and lower storage cost but slower queries.

Log Scenario Solution

SelectDB, the commercial cloud‑native version of Apache Doris, builds a log‑analysis solution by leveraging Doris's high‑performance vectorized engine, SelectDB's compute‑storage separation, lightweight inverted index, and time‑series management.

Data ingestion uses Logstash’s http output plugin to write logs into SelectDB. Downstream, Grafana with MySQL data source and Superset provide visual dashboards, while BI tools can query via the MySQL protocol.

Performance tests on a 3‑node 16‑core, 64 GB cluster using the ES official benchmark dataset (32 GB, 2.47 billion rows) show that SelectDB achieves 4.2× higher write speed than ES, uses only one‑fifth of ES’s storage space, and delivers 2× faster query performance.

Key Technology Breakdown

1. MPP Query and Vectorized Engine – Columnar memory layout and SIMD instructions improve cache hit rates, delivering 5‑10× speedups in wide‑table aggregations.

2. Multi‑Operator Optimization & Optimizer – Adaptive two‑stage aggregation, runtime filter push‑down, and the Nereids optimizer (supporting CBO and RBO) enhance join reorder and predicate push‑down, yielding 2‑10× performance gains.

3. Lightweight Inverted Index – Integrated index supports fast text, numeric, and date searches with bitmap‑based structures, and columnar storage with ZSTD compression achieves >5× higher compression than gzip.

4. Compute‑Storage Separation Cloud‑Native Architecture – Object storage for data, shared cache for writes, elastic scaling, and workload isolation reduce storage cost to one‑fifth and unit cost to one‑third.

5. High‑Throughput Real‑Time Writes – Client‑side micro‑batch writes combined with server‑side compaction achieve GB/s write throughput with low write amplification.

6. Fast Interactive Queries – Partition‑pruned time‑range scans, inverted‑index full‑text search, and TopN algorithms enable billion‑log queries with sub‑second response times.

Open Source Contribution

The discussed technologies—lightweight inverted index, TopN optimization, and time‑series compaction—have been contributed back to the Apache Doris community and are slated for release in Doris 2.0 (Q1 2023). A preview of Doris 2.0 will be available in February for community testing.

Key Technology Diagram
Key Technology Diagram
Performance Test Results
Performance Test Results
Architecture Overview
Architecture Overview
Storage Cost Model
Storage Cost Model
Write Throughput
Write Throughput
Query Optimization
Query Optimization
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeReal-time analyticslog analysisMPPApache DorisSelectDB
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.