Industry Insights 15 min read

How Paimon + Dolphin Transform Alibaba’s Brand Data Warehouse for Real‑Time Insights

This article analyzes the challenges of Alibaba Mama's brand advertising data warehouse built on a Lambda architecture, introduces Apache Paimon lake storage and Dolphin OLAP engine as a unified lakehouse solution, details implementation steps, performance gains, and business benefits across multiple advertising scenarios.

Alimama Tech
Alimama Tech
Alimama Tech
How Paimon + Dolphin Transform Alibaba’s Brand Data Warehouse for Real‑Time Insights

Background

Alibaba Mama’s brand advertising logs (delivery, exposure, click, etc.) were stored in a Lambda‑based data warehouse. The architecture required separate offline and real‑time jobs, duplicated storage, and high compute consumption, leading to low development efficiency, high storage cost, and heavy resource usage.

Paimon Lake Storage

Key Concepts

Schema : Table metadata, including field definitions and configuration parameters.

Snapshot : Captures the table state at a specific point in time; the latest snapshot provides the most recent data.

Manifest : Records file additions and deletions for each snapshot.

Partition : Same partition semantics as ODPS, used to separate data.

Bucket : Divides data within a partition (or non‑partitioned table) into buckets for efficient read/write.

Data Files : Files are grouped by partition and bucket; Paimon supports ORC, Parquet, and Avro formats.

Why Choose Paimon?

Paimon is incubated from the Apache Flink community and integrates tightly with Flink, Spark, and various OLAP engines. It supports multiple storage back‑ends (e.g., Pangu, OSS) and offers streaming‑write and batch‑write, primary‑key updates, aggregation tables, and partial‑update tables, delivering high read/write performance with low storage cost.

Dolphin OLAP Engine

Dolphin is an OLAP query engine built on top of Paimon, exposing a high‑performance query interface (HSF) for low‑latency analytics.

Technical Enhancements

Approximate UV Calculation

Exact UV counting is resource‑intensive. Paimon provides a theta_sketch aggregation, and Dolphin adds an hll_sketch function for approximate UV.

CREATE TABLE IF NOT EXISTS `brand_lakehouse`.`brand`.`uv_table` (
  `order_id` STRING NOT NULL,
  `uv` BYTES,
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'fields.uv.aggregate-function' = 'hll_sketch',
  'merge-engine' = 'aggregation'
);
INSERT INTO `brand_lakehouse`.`brand`.`uv_table`
SELECT t.pid AS pid,
       HllSketchFunction(CAST(t.aid AS BIGINT)) AS uv
FROM brand_ob_midlog,
     LATERAL TABLE(BrandOBMidLogCommonParser(content)) AS t;
SELECT hll_sketch_get_estimate(
       hll_sketch_in(CAST(encode(uv,'base64') AS cstring))
) FROM uv_table
WHERE pid = 'xxx';

SST Format Support

For high‑QPS point‑lookup scenarios, Dolphin can create Paimon tables in SST format.

CREATE TABLE `brand_lakehouse`.`brand`.`brand_aid_behavior_table` (
  aid STRING,
  behavior STRING,
  dt STRING,
  PRIMARY KEY (aid, dt) NOT ENFORCED
) WITH (
  'file.format' = 'sst',
  'bucket' = '100',
  'merge-engine' = 'deduplicate',
  'write-mode' = 'change-log',
  'snapshot.time-retained' = '8h'
);

INSERT INTO `brand_lakehouse`.`brand`.`brand_aid_behavior_table`
SELECT aid, behavior,
       DATE_FORMAT(now(),'yyyyMMdd') AS dt,
       DATE_FORMAT(now(),'HH') AS hh
FROM brand_midlog;

Query Performance Optimization

Dolphin’s HSF interface delivers lower response time and higher QPS compared with Flink, ODPS, Hologres, and StarRocks.

Brand Data Pipeline Upgrade

Data Ingestion

Paimon supports two write modes:

change‑log : Used for primary‑key tables; supports insert, update, delete, and deduplication via merge-engine.

append‑only : Used for high‑volume, low‑quality‑requirement tables; only inserts are allowed.

Primary‑key table example

CREATE TABLE `alimm-brand-lakehouse`.`brand`.`brand_imp_acc_log` (
  pkey STRING,
  ...,
  dt STRING,
  hh STRING,
  PRIMARY KEY (dt, hh, pkey) NOT ENFORCED
) PARTITIONED BY (dt, hh) WITH (
  'bucket' = '100',
  'write-mode' = 'change-log',
  'merge-engine' = 'first-row',
  'maxcompute.life-cycle' = '180',
  'bucket-key' = 'pkey',
  'changelog-producer' = 'lookup',
  'consumer.expiration-time' = '24h',
  'snapshot.num-retained.min' = '3',
  'snapshot.num-retained.max' = '5'
);

INSERT INTO `alimm-brand-lakehouse`.`brand`.`brand_imp_acc_log`
SELECT pkey, ...,
       DATE_FORMAT(now(),'yyyyMMdd') AS dt,
       DATE_FORMAT(now(),'HH') AS hh
FROM brand_imp_acc_log,
     LATERAL TABLE(BrandLogParserUDTF(content)) AS T(...);

Append‑only table example

CREATE TABLE `alimm-brand-lakehouse`.`brand`.`brand_sn_trace_log` (
  campaign_id BIGINT,
  ...,
  dt STRING,
  hh STRING
) PARTITIONED BY (dt, hh) WITH (
  'maxcompute.life-cycle' = '30',
  'bucket' = '1000',
  'bucket-key' = 'session_id',
  'consumer.expiration-time' = '24h',
  'write-mode' = 'append-only',
  'snapshot.time-retained' = '24h',
  'partition.expiration-time' = '30d',
  'partition.timestamp-formatter' = 'yyyyMMdd',
  'snapshot.num-retained.min' = '3',
  'snapshot.num-retained.max' = '5'
);

INSERT INTO `alimm-brand-lakehouse`.`brand`.`brand_sn_trace_log`
SELECT campaign_id, ...,
       DATE_FORMAT(now(),'yyyyMMdd') AS dt,
       DATE_FORMAT(now(),'HH') AS hh
FROM brand_sn_ht_trace_log,
     LATERAL TABLE(BrandSNParseUDTF(content)) AS T(...);

Scenario 1 – Real‑Time Brand Fulfilment Report

Data freshness improved from 1 hour to 8 minutes; query latency reduced by ~50% after Dolphin optimization.

Scenario 2 – Multi‑Goal Advertising Control

The unified lakehouse cut development and resource costs by >50%. Dolphin increased QPS from 50 to 700 and reduced response time from 1 s to 100 ms (≈10× improvement).

Scenario 3 – Real‑Time Feature Production

Streaming ingestion of exposure, click, and reach logs into Paimon enables one‑day window feature generation. Partial‑update tables join on aid to provide complete feature sets for downstream models, improving CTR by >2%.

Benefits

Unified stream‑batch development reduces development and operation effort.

Data freshness upgraded from hourly to minute‑level.

Storage and compute costs reduced by >60% while supporting high‑QPS analytical workloads.

Manpower requirements decreased by ~50%.

big dataData WarehouseOLAPPaimonDolphinlakehouse
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.