How We Fixed Minute‑Level Data Rollbacks by Replacing Impala with Apache Doris
Facing mysterious minute‑level data rollbacks caused by Impala's metadata cache, a team migrated from a T+1 Hive‑Impala stack to Apache Doris, achieving real‑time consistency, higher performance, simplified ETL, and reduced operational complexity across their points‑based loyalty system.
Background: Architecture Evolution under Real‑time Demand
In data‑driven businesses, real‑time insight is critical. A core points‑and‑level system originally used a T+1 offline Hive workflow, processing the previous day's data at night. As the business grew, the one‑day delay could no longer meet the operations team's need for minute‑level updates.
Problem: Mysterious Data Rollback
After switching to a minute‑level schedule with Impala as the query engine, users reported that points sometimes dropped abruptly and recovered a few minutes later, eroding trust and harming decision‑making.
Investigation Process
The team traced the issue to the computation stage. The minute‑level task flow is:
Data write → REFRESH metadata → Points aggregation queryDeep analysis revealed that Impala's metadata caching mechanism does not align with high‑frequency updates, leading to inconsistent query results.
Technical Analysis
Impala caches metadata in memory to speed up queries, but this introduces three consistency risks:
Metadata perception delay: writing to HDFS and Impala seeing the update are separate steps.
REFRESH non‑atomicity: a "vacuum period" exists where metadata is partially refreshed.
Query timing risk: queries that run during the refresh may read an incomplete data view.
Scenario Simulation
Timeline:
T0: cache 10 files → query result 100 points ✓
T1: write 11th file, execute REFRESH
T1.1: clear old cache (10 files lost)
T1.2: query arrives, only partial metadata (5 files) → result 50 points ✗ (rollback)
T2: REFRESH completes, cache now contains 11 files
T3: next query → result 110 points ✓In low‑frequency workloads this flaw is hidden, but under minute‑level updates it becomes a fatal defect for monotonic point accumulation.
Solution: Adopt Apache Doris as Next‑Gen Real‑time Data Warehouse
Technology Selection Criteria
Write‑once data instantly queryable.
Guarantee query consistency.
Support high‑frequency real‑time updates.
Easy to operate and scale.
After multiple rounds of research and benchmark testing, the team selected Apache Doris.
Doris Architecture Advantages
Unified metadata management : eliminates the need for REFRESH; metadata updates are atomic with data writes.
MVCC snapshot consistency : each query sees a consistent version of the data, solving the "half‑read" problem.
Native real‑time upserts : the Unique Key model supports UPSERT on primary keys, simplifying ETL.
Compute‑storage separation : Doris 3.x decouples compute and storage, providing elastic scaling, cost‑effective storage (e.g., cloud object stores), and high availability through multi‑replica storage.
Cluster Deployment Plan
The solution was deployed as an independent 3FE + 3BE architecture using compute‑storage separation.
Cluster node plan
Deployment diagram
Cluster monitoring
Results: Significant Benefits from Architecture Upgrade
Problem Elimination and Performance Gains
Business Value
User experience improved – real‑time accurate points increase trust.
Operational efficiency optimized – minute‑level feedback enables fine‑grained actions.
Development cost reduced – ETL simplified, development productivity up 40%.
Operations complexity lowered – removed complex metadata coordination.
Total cost of ownership decreased thanks to compute‑storage separation.
Conclusion and Outlook
Key Takeaways
Technology must match the scenario: Impala + Hive suits offline analysis, not high‑frequency updates.
Metadata consistency issues can stay hidden in low‑frequency workloads.
Smooth migration relies on thorough validation mechanisms.
Compute‑storage separation delivers elasticity and cost benefits.
Future Plans
Expand Doris to more real‑time business scenarios.
Deeply optimize performance using Doris advanced features.
Standardize best‑practice guidelines for real‑time data‑warehouse construction.
Explore new capabilities such as vectorized engine and lake‑house integration.
The upgrade shifted the data service model from "data available" to "data real‑time reliable", providing a solid foundation for continuous business innovation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
