Big Data 6 min read

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

Experts from Kuaishou, former Tencent, Ping An Insurance and others discuss data lake maturity, column‑level governance, resource management of unstructured data, and automated optimization techniques such as Iceberg small‑file merging, highlighting how these advances improve data quality and business decision‑making.

DataFunSummit

Nov 8, 2024

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

This excerpt is from the roundtable summary document of the Data Lake Technology Maturity Curve release, featuring experts from Kuaishou, former Tencent, Ping An Insurance and other companies.

Download the Data Lake Technology Maturity Curve

Host Jin Guowei notes that whether it is a lake, warehouse, or lakehouse, the ultimate goal is to solve business problems; building massive data assets inevitably leads to the need for governance once a certain scale is reached.

He explains that data lake governance shifts focus from table‑level to column‑level operations, citing the need to manage column additions and deletions that affect downstream data, and invites Tang Langfei to share his view.

Tang Langfei emphasizes resource management, stressing the growing importance of unstructured or semi‑structured data in modern AI‑driven projects and the necessity of governance mechanisms for such data.

He describes how the lakehouse model unifies management of unstructured and structured data, making it easier to handle lifecycle and schema changes compared with traditional warehouses.

He also points out that traditional databases mainly handle relational data, whereas data lakes provide clearer field lineage and enable indexing, aiding data definition governance and content repair.

In summary, he identifies resource management and clear lineage as two key directions for improving data quality and supporting reliable business decisions.

Host Jin thanks Tang and invites Shao Saisei to share further thoughts.

Shao Saisei highlights two aspects of governance: first, tracking the lifecycle of data assets at column or table level to ensure quality and consistency; second, system‑level or technical governance, which is less visible in traditional warehouses but essential for data lakes.

He describes Tencent’s internal automatic optimization system for Iceberg tables, which merges small files, adapts to incremental data changes, collects query metrics, builds indexes or pre‑sorts hot fields, and optimizes statistics, allowing users to query efficiently without worrying about file layout.

He concludes that such automation is vital for data lake governance; without it, issues like small‑file proliferation in Iceberg or Hudi degrade write and read performance, making system‑level optimization a core governance focus.

Finally, the discussion notes that DataFun is launching a Data Lake Practical Workshop that provides detailed guidance on data lake governance, inviting readers to scan the QR code for more information.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake Iceberg Hudi Column-level Governance

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.