Tagged articles
9 articles
Page 1 of 1
DataFunTalk
DataFunTalk
Jan 8, 2023 · Big Data

ByteDance Event‑Tracking Data Cost Governance Practices

This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.

Big DataByteDanceData Governance
0 likes · 18 min read
ByteDance Event‑Tracking Data Cost Governance Practices
ITPUB
ITPUB
Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake
0 likes · 21 min read
Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkScalability
0 likes · 16 min read
How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline
Didi Tech
Didi Tech
Jul 1, 2021 · Big Data

Full-Chain Traffic Data Detection in DiDi's Omega Platform

DiDi’s Omega platform provides an end‑to‑end traffic‑data pipeline—from SDK collection through real‑time and offline ETL to storage and analysis—augmented by a detection service that measures loss, duplication and accuracy, achieving sub‑1% SDK loss, integrity tagging, comprehensive monitoring dashboards, and includes a senior data‑engineer hiring call.

Data QualityOmega Platformdata pipeline
0 likes · 9 min read
Full-Chain Traffic Data Detection in DiDi's Omega Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2021 · Big Data

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how Soul's data engineering team replaced nightly batch ETL with real-time Delta Lake ingestion on EMR, detailing the motivations, comparative analysis of Delta, Hudi, Iceberg, the implementation architecture, encountered issues such as data skew and schema evolution, and the solutions adopted to improve performance and reliability.

Data LakeData SkewDelta Lake
0 likes · 13 min read
Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
Big Data Technology Architecture
Big Data Technology Architecture
Mar 2, 2021 · Big Data

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.

Delta LakeHiveSpark
0 likes · 14 min read
Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Apache FlinkBig DataExactly-Once
0 likes · 11 min read
Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees
DataFunTalk
DataFunTalk
Aug 1, 2019 · Big Data

Streaming Data Platform Practices and Challenges at Beike Real Estate

This article presents an in‑depth overview of Beike's four‑layer streaming data platform, covering the foundational infrastructure, capability aggregation, data content, and output layers, as well as the challenges of metadata management, real‑time processing, and productization through the Ark and Tianyan systems.

Ark platformBeikeTianyan
0 likes · 14 min read
Streaming Data Platform Practices and Challenges at Beike Real Estate