Big Data 7 min read

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.

Big Data Technology & Architecture

Oct 22, 2019

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

For real‑time product or development teams, ensuring that displayed metrics such as PV, UV, and GMV are accurate is a common challenge; this article introduces a comprehensive real‑time data verification solution that helps convince managers and peers of data correctness.

1. Background

Practitioners often wonder whether the statistics they compute in real time are reliable, especially when different teams report conflicting numbers for the same KPI.

2. Real‑time Data Calculation Flow

The typical pipeline receives logs or messages from Kafka, processes them with Flink, and writes the final results to Redis for dashboards and large‑screen displays.

However, the correctness of the metrics stored in Redis remains uncertain, prompting the need for a verifiable log‑comparison approach.

3. Log Verification Solution

Using the example of differing GMV figures from real‑time and offline sources, the article proposes storing the detailed wide‑table data generated by Flink in two alternative sinks:

Write the wide‑table to Elasticsearch, enabling fine‑grained queries for cross‑checking.

Write the wide‑table to HDFS and query it via Hive, which offers lower learning cost and easy SQL‑based comparison with offline tables.

Storing detailed records in Elasticsearch or HDFS allows developers to compare each transaction (e.g., number of iPhone X purchases) against other data providers, quickly identifying discrepancies such as differing order counts or missing pre‑sale orders.

Using Hive/HDFS requires only basic SQL knowledge, avoiding the complexity of Elasticsearch aggregations.

Hive tables can be joined with offline datasets, making it straightforward to spot data gaps or mismatches.

In the presented case, the developer discovered that the offline team had excluded pre‑sale orders, corrected the report, and received commendation from management.

4. Summary

Real‑time analytics provide up‑to‑date metrics, but their credibility hinges on the ability to validate results against other data sources; retaining detailed transaction data in Elasticsearch or Hive enables precise, SQL‑driven comparisons that build confidence in the numbers presented to stakeholders.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink real-time analytics Elasticsearch Hive Data verification

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.