Big Data 18 min read

WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

The WeChat Experiment Platform migrated its 60,000 metric, 200,000 core, 30 PB plus data pipeline to an Iceberg based lakehouse, leveraging three layer metadata, fine grained partitioning, MERGE into writes, time travel snapshots and skew handling UDFs, which cut core time by 69%, saved ~100 PB storage, and reduced latency by up to 70%.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

The WeChat Experiment Platform provides experimental support for all internal WeChat business scenarios (Video Channels, Live, Search, Official Accounts, etc.) and offers various experiment types such as AB, MAB, BO, Interleaving, client‑side, social‑network, and bilateral experiments.

It currently handles more than 60,000 business metrics, making it a major consumer of compute and storage resources. The platform’s scale includes:

Big‑data scenario compute resources: total cores > 200,000, storage > 30 PB

OLAP scenario compute resources: total cores > 20,000, storage > 5 PB

To ensure cost‑effectiveness and stability, most compute resources are consolidated on the Gemini cloud‑native big‑data platform, with the Gaia (Tencent Cloud Native Big‑Data Platform) serving as a backup. Storage relies entirely on Gaia.

Why Iceberg? Iceberg, Hudi, and DeltaLake emerged around the same time, but the team chose Iceberg because:

Strong internal support from the data‑analysis team.

Workloads are primarily offline or near‑real‑time (≈5 min latency), and Iceberg’s advanced features meet these needs.

Iceberg’s pluggable design supports multiple file formats (Parquet/ORC/Avro) and storage types (Object/File storage) and integrates with various compute engines (Spark, Flink, Hive, Trino, Impala, etc.).

Iceberg’s three‑layer metadata architecture alleviates the Hive metastore bottleneck. By moving metadata to HDFS, the platform avoids the single‑point RDBMS limitation that caused Hive to fail when metadata volume grew to millions of partitions.

Schema Optimization : For hit tables, the platform partitions first by day and then by experiment ID, dramatically reducing I/O and shuffle costs. Iceberg’s independent metadata allows this fine‑grained partitioning without overloading the metastore.

Merge Into vs. Insert Overwrite – Iceberg only rewrites the affected data files, improving write efficiency. Example:

MERGE INTO iceberg_catalog.mmexpt_lakehouse.mmexpt_cumu_finder t
USING (
  SELECT first_hit_ds, uin, exptid, groupid, bucketsrc_hit
  FROM iceberg_catalog.mmexpt_lakehouse.mmexpt_daily_finder
) s
ON t.uin = s.uin AND t.groupid = s.groupid
WHEN MATCHED AND s.ds < t.first_hit_ds THEN UPDATE SET t.first_hit_ds = s.ds
WHEN NOT MATCHED THEN INSERT (first_hit_ds, uin, exptid, groupid, bucketsrc_hit, ext_int, ext_string)
VALUES (s.first_hit_ds, s.uin, s.exptid, s.groupid, s.bucketsrc_hit, NULL, NULL);

Time‑Travel Snapshot enables querying historical states without explicit start/end timestamps:

-- time travel to 2022-12-07 01:21:00
SELECT * FROM mmexpt_lakehouse.table TIMESTAMP AS OF '2022-12-07 01:21:00';

Iceberg’s snapshot mechanism also supports version rollback:

CALL catalog_name.system.rollback_to_snapshot('db.sample', 1);
CALL catalog_name.system.set_current_snapshot('db.sample', 1);

Handling Data Skew : A UDF hashes experiment ID and user ID to create a secondary bucket, then repartitions and sorts within partitions:

val bucketIdHashUdf = udf((exptid: Long, uin: Long) => {
  val maxExptIds: ListBuffer[Long] = maxExptIdsBroadCast.value
  if (maxExptIds.contains(exptid)) {
    exptid.toString + "_" + ((uin.hashCode() & Integer.MAX_VALUE) % 50)
  } else {
    exptid.toString
  }
})
val icebergDf = df
  .withColumn("bucket_id", bucketIdHashUdf(col("exptid"), col("uin")))
  .repartition(partitionNum, col("ds"), col("bucket_id"))
  .sortWithinPartitions("ds", "exptid")

Write‑distribution mode is set to 'range' to control file shuffling, and write.target-file-size-bytes is tuned to avoid the small‑file explosion in batch jobs.

Performance Gains : After migrating ~20 PB of historical data to Iceberg and adopting the new pipeline, the platform achieved:

Core‑time reduction of 69.4% (≈200,000 core‑hours saved per day)

Storage savings of ~100 PB

Estimated cost reduction of ~3 kW per year

Task p99 latency reduced by 70% and average latency by 60%

Additional optimizations include hardware acceleration (RAID‑0 SSD arrays increasing I/O throughput fourfold) and exploring StarRocks + Iceberg for tighter lake‑warehouse integration.

The article concludes with a call for feedback on lake‑house advantages and mentions a related patent (CN 2023010065).

big datadata warehouseSparkiceberglakehouseMerge IntoMetric Computationtime travel
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.