Big Data 12 min read

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

This article describes how SmartNews integrated Flink into its Airflow‑driven Hive batch pipeline to cut the actions table generation latency from three hours to about thirty‑four minutes, detailing the technical challenges, design decisions, and production results.

DataFunTalk

Jul 26, 2021

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

SmartNews operates a large data stack based on Airflow, Hive, S3 and EMR, and the growing volume of daily Hive tables caused processing times to exceed three hours, impacting downstream users such as data scientists and product managers.

To address this, the Speedy Batch project targeted the actions table, which is generated each day from mobile app logs and serves as the source for many other tables. The goal was to reduce the actions table latency from three hours to thirty minutes while keeping downstream users completely transparent to the change and preserving performance.

Initial attempts to speed up the existing Hive job by adding resources or pre‑aggregating hourly ran into S3 IOPS limits and still left a 2.5‑hour latency. The team therefore switched to a streaming approach using Flink, which already had strong internal expertise and recent Hive improvements.

Key technical challenges included:

Maintaining the RCFile format required by the existing Hive table, which meant Flink had to output RCFile files in‑place.

Preventing a proliferation of small files across checkpoints, which would hurt read performance.

Providing downstream jobs with a reliable signal that a partition is fully ready, despite the dynamic and numerous action partitions.

Ensuring exactly‑once processing despite using S3 event notifications (at‑least‑once) and multiple Flink jobs.

The final solution consists of two Flink jobs. The first job streams raw JSON logs from S3 (detected via S3 event notifications forwarded to Kinesis) and writes them in row‑format JSON using a rolling‑policy‑controlled sink, leveraging S3 multipart upload (MPU) to create large parts that are later merged on the S3 side. The second job watches for completed JSON files, converts them to RCFile format, and writes the results to the Hive location.

To make downstream consumption transparent, a custom StreamingFileWriter emits partitionCreated and partitionInactive signals; a custom PartitionCommitter watches these signals and, when all partitions for a day are inactive, writes a _SUCCESS file at s3://hivebucket/actions/dt=2021-05-29/_SUCCESS. Airflow can then safely trigger downstream jobs.

Production results show the end‑to‑end latency stabilized around 34 minutes (including a 15‑minute wait for late files), with a modest 50% increase in file count that does not noticeably affect downstream performance. Users reported no regressions, confirming the transparency of the change.

Future work includes extending the same streaming‑batch hybrid pattern to other Hive tables, exploring hour‑level granularity, and investigating a unified data‑lake architecture to further converge the technology stack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline Flink Streaming Hive AWS

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.