Big Data 15 min read

Kafka Real-time Data Archiving to Hive: Flink SQL and DataStream Implementation Solutions

The article explains how to archive Kafka real‑time data to Hive using either Flink SQL, which quickly creates partitioned ORC tables but requires timezone handling, or Flink DataStream for more complex pipelines, and offers best‑practice guidance on data quality, system complexity, security, and performance.

vivo Internet Technology

May 24, 2023

Kafka Real-time Data Archiving to Hive: Flink SQL and DataStream Implementation Solutions

This article discusses how to archive Kafka real-time data to Hive for historical data analysis and troubleshooting. Kafka stores real-time data in Topics with time-limited retention (e.g., 24/36/48 hours), making it difficult to trace issues without historical data archiving.

Solution 1: Flink SQL Writing to Hive

The implementation steps include: constructing Hive Catalog, creating Hive tables, and writing real-time data to Hive tables. The code demonstrates setting up Flink execution environment, registering HiveCatalog, creating partitioned tables with ORC format, and using auto-compaction to handle small file issues. A key pitfall is the timezone handling in PartitionTimeCommitTrigger when using EventTime - the system treats the partition time as UTC instead of the configured timezone (e.g., GMT+8), causing partition commit failures. The solution involves implementing a custom TimeZoneTableFunction to properly handle timezone conversion.

Solution 2: Flink DataStream Writing to Hive

This approach is suitable for complex business scenarios. The flow includes: consuming Kafka Topic data, preprocessing with MapReduce tasks, storing to HDFS, and loading to Hive tables. Implementation steps involve using FlinkKafkaConsumer to read data, BucketingSink to write to HDFS with minute-level partitioning, generating preprocessing strategies for 5-minute data windows, and using Hive LOAD DATA to import processed files.

Choosing Between Flink SQL and DataStream

Flink SQL offers faster development and better maintainability for standard data processing tasks, while Flink DataStream provides more customization for complex requirements. The choice depends on specific business needs.

Best Practices Summary:

Data Quality: Monitor and optimize to avoid data duplication, loss, or errors

System Complexity: Manage multiple components including Kafka, Flink, and Hive

Security: Implement proper Hive permission controls

Performance: Optimize data processing and query engines

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink kafka Hive real-time data processing DataStream Flink SQL Data Archiving

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.