Tagged articles

real-time ETL

9 articles · Page 1 of 1

Jan 8, 2023 · Big Data

ByteDance Event‑Tracking Data Cost Governance Practices

This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.

Big DataByteDanceData Governance

0 likes · 18 min read

ByteDance Event‑Tracking Data Cost Governance Practices

ITPUB

Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake

0 likes · 21 min read

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

Volcano Engine Developer Services

Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkdata pipeline

0 likes · 16 min read

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

Didi Tech

Jul 1, 2021 · Big Data

Full-Chain Traffic Data Detection in DiDi's Omega Platform

DiDi’s Omega platform provides an end‑to‑end traffic‑data pipeline—from SDK collection through real‑time and offline ETL to storage and analysis—augmented by a detection service that measures loss, duplication and accuracy, achieving sub‑1% SDK loss, integrity tagging, comprehensive monitoring dashboards, and includes a senior data‑engineer hiring call.

Data QualityOmega Platformdata pipeline

0 likes · 9 min read

Full-Chain Traffic Data Detection in DiDi's Omega Platform

Big Data Technology & Architecture

Mar 30, 2021 · Big Data

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how Soul's data engineering team replaced nightly batch ETL with real-time Delta Lake ingestion on EMR, detailing the motivations, comparative analysis of Delta, Hudi, Iceberg, the implementation architecture, encountered issues such as data skew and schema evolution, and the solutions adopted to improve performance and reliability.

Data LakeData SkewDelta Lake

0 likes · 13 min read

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

Big Data Technology Architecture

Mar 2, 2021 · Big Data

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.

Delta LakeHiveSpark

0 likes · 14 min read

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

Big Data Technology & Architecture

Dec 9, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Apache FlinkBig DataExactly-once

0 likes · 11 min read

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

Big Data Technology & Architecture

Sep 19, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics

This article demonstrates how to develop a real‑time ETL job using Apache Flink, covering project setup, Kafka as a source, custom bucket assigners for HDFS, checkpointing, savepoints, and deployment on YARN to achieve exactly‑once processing guarantees.

Apache FlinkBig DataExactly-once

0 likes · 11 min read

Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics

DataFunTalk

Aug 1, 2019 · Big Data

Streaming Data Platform Practices and Challenges at Beike Real Estate

This article presents an in‑depth overview of Beike's four‑layer streaming data platform, covering the foundational infrastructure, capability aggregation, data content, and output layers, as well as the challenges of metadata management, real‑time processing, and productization through the Ark and Tianyan systems.

Ark platformBeikeTianyan

0 likes · 14 min read

Streaming Data Platform Practices and Challenges at Beike Real Estate