Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies
Facing server‑hardware constraints, Bilibili’s data platform replaced wasteful full‑partition tables with a zipper‑table approach—preserving change history while cutting storage from petabytes to terabytes—and complemented it with Hudi + Flink CDC for near‑real‑time updates, dramatically lowering I/O, compute usage and latency.
Since 2020 the global semiconductor shortage has constrained server hardware procurement for internet companies. Bilibili, which relies heavily on enterprise‑grade servers for data computation and storage, faces increasing demand for storage and compute resources as user activity grows, leading to insufficient capacity for both calculation and storage.
The data platform team classified the main data types into four categories: log data, business data, manually produced data, and metadata. Log data (client and server logs) are incremental and static, while business data originates from transactional databases and requires frequent updates. Manual and metadata have relatively low volume and resource demands.
To handle business data, the team traditionally uses a full‑partition table pattern (T‑1), where each partition stores the complete dataset as of the previous day. When resources are abundant, this approach works well, but under tight resource conditions it causes severe storage waste due to duplicate copies of unchanged data and risks data loss when early partitions are deleted.
Solution 1 – Traditional Full Partition Table (offline mode)
Data is refreshed in an offline pipeline: T‑2 full data is loaded, then T‑1 incremental data is applied to produce the latest T‑1 full partition. This method offers low timeliness, high performance, and high throughput.
SQL pseudo‑code:
insert overwrite table target_table partition (partition = yesterday)
select
column_list
from(
select
column_list,
row_number() over(
partition by primary_key
order by
update_field desc
) as rn
from
(
select
column_list
from
incremental_table
where
partition = yesterday
union all
select
column_list
from
full_table
where
partition = day_before
) n
) t
where
rn = 1Solution 2 – Real‑time Data Lake with Hudi + Flink CDC
By ingesting the initial full snapshot offline into a Hudi table and then continuously applying MySQL binlog changes via Flink CDC, the system achieves near‑real‑time consistency (minute‑level latency). Benchmarks show that for daily new/changed records below 10 million, Hudi + Flink CDC can merge data efficiently, though full historical versioning still needs community improvements.
Solution 3 – Zipper Table (link table) for Full Partition Optimization
The zipper table records the full history of each record, enabling seamless migration from full partition tables and supporting low‑timeliness offline workloads while preserving all change versions. This approach aligns with the data‑warehouse concept of a subject‑oriented, integrated, time‑variant collection.
Definition: a zipper table stores start and end dates for each record, where the end date ‘9999‑12‑31’ denotes the current version.
Example SQL to generate a zipper table:
INSERT OVERWRITE TABLE zipper_table
SELECT
n1.id,
n1.nickname,
n1.start_date,
CASE
WHEN n1.end_date = '9999-12-31'
AND n2.id IS NOT NULL THEN 'business_date-1'
ELSE n1.end_date
END AS end_date
FROM zipper_table n1
LEFT OUTER JOIN
(SELECT id FROM user_table
WHERE yesterday_new_register OR yesterday_nickname_change) n2
ON n1.id = n2.id
UNION ALL
SELECT id, nickname, 'business_date' as start_date, '9999-12-31' as end_date
FROM user_table
WHERE
yesterday_new_register OR yesterday_nickname_changeFurther optimization is achieved via a MapReduce workflow that reads the existing zipper table and the incremental data once, avoiding extra sorting steps and reducing both storage and compute consumption.
In summary, converting full partition tables to zipper tables significantly cuts storage (e.g., from 1.20 PB to 5.06 TB in a test case) and reduces I/O and compute resource usage by about 17 %. Future work includes making zipper tables idempotent, automating coverage and duplication testing, and selectively applying the optimization.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.