Interview with Shopee Data Engineer Deng Lin on Lakehouse Architecture and Big Data Trends
During a pre‑GIAC interview, Shopee data engineer Deng Lin discusses the evolution of data lakes and warehouses, lakehouse integration, big‑data technology choices, real‑time processing with Flink and Kafka, and offers career advice for newcomers to the big‑data field.
Ahead of the 8th GIAC Global Internet Architecture Conference in Shenzhen, High Availability Architecture interviewed Deng Lin, a senior data engineer from Shopee, about lakehouse technology and big‑data trends.
Deng Lin, who has been working on big‑data platforms since 2012, described his experience building offline scheduling systems, data ingestion pipelines, and a Flink‑based real‑time computation platform, witnessing Hadoop’s rise and the shift toward stream‑batch convergence.
He explained that a data lake expands the boundaries and variety of data compared to traditional data warehouses, complementing them by providing timeliness, diverse content, and richer business scenarios.
Traditional data‑warehouse stacks focus on SQL‑structured data and rely on databases or Hive, whereas data‑lake stacks add streaming processing, ACID support, and time‑semantic capabilities, bringing data processing closer to business needs.
Lakehouse is not a simple merger of lake and warehouse; it represents a deep integration of data management and application, addressing pain points such as insufficient timeliness, storage waste, and lack of ACID semantics in classic warehouses.
By leveraging lake technologies, organizations can build unified stream‑batch warehouses, reduce storage costs with incremental processing semantics, and meet varied latency requirements.
According to Deng, big‑data technologies have dramatically improved data timeliness—from hour‑level to second‑level reporting—enhanced data richness through CDC, and enabled new use cases like machine learning and complex event processing.
For technology selection, he advises matching tools to business scale: Hadoop + Hive for GB‑scale workloads, Spark for TB‑scale batch processing, and Kafka + Flink for millisecond‑level real‑time needs; Storm is now less active.
When real‑time and batch streams coexist, adopting lake components such as Apache Hudi can provide transactional, indexed, and stream‑batch unified storage, which Shopee chose for its data‑lake implementation.
Newcomers to big data should first master fundamental computer science concepts (Java, Scala, SQL, networking, databases) and then gain hands‑on experience with core components like Spark, Flink, Kafka, and HDFS. Deeper expertise comes from studying component architectures and underlying theories, such as Flink’s global snapshot algorithm.
He also advises graduates to pursue long‑term interests rather than short‑term trends, continuously improve their skills on the job, and integrate domain knowledge with practical requirements.
Finally, Deng expressed enthusiasm for GIAC’s focus on infrastructure and data‑intelligence platform evolution, hoping the conference will spark new ideas for platform construction and become a successful knowledge‑sharing event.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
