Databases 9 min read

How We Fixed Minute‑Level Data Rollbacks by Replacing Impala with Apache Doris

Facing mysterious minute‑level data rollbacks caused by Impala's metadata cache, a team migrated from a T+1 Hive‑Impala stack to Apache Doris, achieving real‑time consistency, higher performance, simplified ETL, and reduced operational complexity across their points‑based loyalty system.

WeiLi Technology Team

Jan 28, 2026

How We Fixed Minute‑Level Data Rollbacks by Replacing Impala with Apache Doris

Background: Architecture Evolution under Real‑time Demand

In data‑driven businesses, real‑time insight is critical. A core points‑and‑level system originally used a T+1 offline Hive workflow, processing the previous day's data at night. As the business grew, the one‑day delay could no longer meet the operations team's need for minute‑level updates.

Problem: Mysterious Data Rollback

After switching to a minute‑level schedule with Impala as the query engine, users reported that points sometimes dropped abruptly and recovered a few minutes later, eroding trust and harming decision‑making.

Investigation Process

The team traced the issue to the computation stage. The minute‑level task flow is:

Data write → REFRESH metadata → Points aggregation query

Deep analysis revealed that Impala's metadata caching mechanism does not align with high‑frequency updates, leading to inconsistent query results.

Technical Analysis

Impala caches metadata in memory to speed up queries, but this introduces three consistency risks:

Metadata perception delay: writing to HDFS and Impala seeing the update are separate steps.

REFRESH non‑atomicity: a "vacuum period" exists where metadata is partially refreshed.

Query timing risk: queries that run during the refresh may read an incomplete data view.

Scenario Simulation

Timeline:
T0: cache 10 files → query result 100 points ✓
T1: write 11th file, execute REFRESH
T1.1: clear old cache (10 files lost)
T1.2: query arrives, only partial metadata (5 files) → result 50 points ✗ (rollback)
T2: REFRESH completes, cache now contains 11 files
T3: next query → result 110 points ✓

In low‑frequency workloads this flaw is hidden, but under minute‑level updates it becomes a fatal defect for monotonic point accumulation.

Solution: Adopt Apache Doris as Next‑Gen Real‑time Data Warehouse

Technology Selection Criteria

Write‑once data instantly queryable.

Guarantee query consistency.

Support high‑frequency real‑time updates.

Easy to operate and scale.

After multiple rounds of research and benchmark testing, the team selected Apache Doris.

Doris Architecture Advantages

Unified metadata management : eliminates the need for REFRESH; metadata updates are atomic with data writes.

MVCC snapshot consistency : each query sees a consistent version of the data, solving the "half‑read" problem.

Native real‑time upserts : the Unique Key model supports UPSERT on primary keys, simplifying ETL.

Compute‑storage separation : Doris 3.x decouples compute and storage, providing elastic scaling, cost‑effective storage (e.g., cloud object stores), and high availability through multi‑replica storage.

Cluster Deployment Plan

The solution was deployed as an independent 3FE + 3BE architecture using compute‑storage separation.

Cluster node plan

Deployment diagram

Cluster monitoring

Results: Significant Benefits from Architecture Upgrade

Problem Elimination and Performance Gains

Business Value

User experience improved – real‑time accurate points increase trust.

Operational efficiency optimized – minute‑level feedback enables fine‑grained actions.

Development cost reduced – ETL simplified, development productivity up 40%.

Operations complexity lowered – removed complex metadata coordination.

Total cost of ownership decreased thanks to compute‑storage separation.

Conclusion and Outlook

Key Takeaways

Technology must match the scenario: Impala + Hive suits offline analysis, not high‑frequency updates.

Metadata consistency issues can stay hidden in low‑frequency workloads.

Smooth migration relies on thorough validation mechanisms.

Compute‑storage separation delivers elasticity and cost benefits.

Future Plans

Expand Doris to more real‑time business scenarios.

Deeply optimize performance using Doris advanced features.

Standardize best‑practice guidelines for real‑time data‑warehouse construction.

Explore new capabilities such as vectorized engine and lake‑house integration.

The upgrade shifted the data service model from "data available" to "data real‑time reliable", providing a solid foundation for continuous business innovation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse Real-time Data MVCC Apache Doris Impala Metadata Cache

Written by

WeiLi Technology Team

Practicing data-driven principles and believing technology can change the world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.