Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive
This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.
For real‑time product or development teams, ensuring that displayed metrics such as PV, UV, and GMV are accurate is a common challenge; this article introduces a comprehensive real‑time data verification solution that helps convince managers and peers of data correctness.
1. Background
Practitioners often wonder whether the statistics they compute in real time are reliable, especially when different teams report conflicting numbers for the same KPI.
2. Real‑time Data Calculation Flow
The typical pipeline receives logs or messages from Kafka, processes them with Flink, and writes the final results to Redis for dashboards and large‑screen displays.
However, the correctness of the metrics stored in Redis remains uncertain, prompting the need for a verifiable log‑comparison approach.
3. Log Verification Solution
Using the example of differing GMV figures from real‑time and offline sources, the article proposes storing the detailed wide‑table data generated by Flink in two alternative sinks:
Write the wide‑table to Elasticsearch, enabling fine‑grained queries for cross‑checking.
Write the wide‑table to HDFS and query it via Hive, which offers lower learning cost and easy SQL‑based comparison with offline tables.
Storing detailed records in Elasticsearch or HDFS allows developers to compare each transaction (e.g., number of iPhone X purchases) against other data providers, quickly identifying discrepancies such as differing order counts or missing pre‑sale orders.
Using Hive/HDFS requires only basic SQL knowledge, avoiding the complexity of Elasticsearch aggregations.
Hive tables can be joined with offline datasets, making it straightforward to spot data gaps or mismatches.
In the presented case, the developer discovered that the offline team had excluded pre‑sale orders, corrected the report, and received commendation from management.
4. Summary
Real‑time analytics provide up‑to‑date metrics, but their credibility hinges on the ability to validate results against other data sources; retaining detailed transaction data in Elasticsearch or Hive enables precise, SQL‑driven comparisons that build confidence in the numbers presented to stakeholders.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
