How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse
This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.
Introduction
The rapid growth of the ZhaiZhai e‑commerce platform exposed critical limitations in its original microservice‑based finance system. The "pyramid" architecture collected upstream data and generated financial metrics, but increasing data volume caused metric traceability issues, low processing efficiency, and unstable stability.
Phase 1 – Microservice Architecture Challenges
1.1 Initial Architecture Design
1.2 Storage Design
-- Business characteristics
-- 1. Highly heterogeneous fields across services – no common column, only JSON can unify.
-- 2. Processed financial data must be queryable in real time – JSON hinders structured queries.
-- 3. Adding a new service would require a new table; large volumes would need sharding.
-- 4. Source data arrives via APIs, cloud‑window, or manual entry.
-- Storage solution
CREATE TABLE source_data (
id BIGINT PRIMARY KEY,
payload JSON,
created_at TIMESTAMP
);
-- Binlog listener syncs source_data to Elasticsearch for multidimensional queries.1.3 Scheduling Model Design
Offline processing of multiple modules uses the distributed scheduler xxl‑job . Each module’s tasks are assigned to a single machine (simplified processing model) while different modules run in parallel threads.
1.4 Processing Model Design
Data is joined in‑memory via RPC, results are written to dwd_financial, aggregated into dws_financial, and later synced to Hive for downstream analysis.
Phase 2 – Architecture Evolution Considerations
2.1 Core Problem Analysis
Data integrity : Metrics are scattered across microservice databases, forming data islands. RPC calls to fetch dimensions and measures fail with a timeout rate up to 5%.
Scheduling & sync reliability : ES sync status is invisible, causing calculations on unsynced data; xxl‑job and cloud‑window extraction are not coordinated.
Scalability bottlenecks : Single‑node processing caps latency at ~6 h, adding new metrics requires code changes and redeployment, and resources cannot elastically scale.
2.2 Solution Ideas
Avoid RPC by pre‑loading dimension data into a centralized hub.
Decouple analytical workloads from OLTP services; keep microservices focused on transaction processing.
Adopt a unified scheduling ecosystem (e.g., cloud‑window dependency management).
Introduce parallel distributed computation with Spark to overcome single‑node limits.
2.3 Decision Matrix
Three paths were evaluated:
Enhance existing microservice stack – low effort but cannot solve fundamental analysis bottlenecks.
Introduce a big‑data stack – higher learning curve but provides professional analytical capabilities for medium‑to‑large data volumes.
Adopt a commercial solution – quick to launch but costly and inflexible.
Given daily offline processing of tens of billions of rows and a three‑month transformation window, the team selected the big‑data stack.
Phase 3 – New Architecture Design
3.1 Data Model Design
ODS layer : Stores raw data snapshots.
DW layer : Normalized, cleaned operational data (facts).
DIM layer : Dimension tables (product, city, channel, etc.).
DM layer : Data marts for department‑specific OLAP analysis.
ADS layer : Application‑oriented data services.
3.1.2 Dimensional Modeling
Process: business process → grain definition → dimension selection → fact table design.
-- Shared dimension table
CREATE TABLE dim_time (
date_key INT PRIMARY KEY,
full_date DATE,
day_of_week TINYINT,
month TINYINT,
quarter TINYINT,
year SMALLINT
);
-- Order fact table
CREATE TABLE fact_orders (
order_id BIGINT,
date_key INT REFERENCES dim_time,
-- other fields ...
);
-- Inventory fact table
CREATE TABLE fact_inventory (
sku_id BIGINT,
date_key INT REFERENCES dim_time,
-- other fields ...
);3.1.3 Scheduling System
Unified cloud‑window task dependency management for parent‑child relationships.
Critical‑path monitoring and automatic retries.
3.2 Big‑Data Technology Selection
Compute engine : SparkSQL was chosen for its in‑memory batch processing, ANSI‑SQL compatibility, and ability to handle >10 PB data with sub‑second to minute latency.
OLAP engine : StarRocks was selected for superior single‑table performance, multi‑table join capability, and high concurrency, outperforming Doris and ClickHouse in the evaluated scenarios.
3.3 Data Warehouse Architecture Diagram
3.4 Data Processing Examples
Replace Java service calls with SparkSQL:
-- Basic order amount (single table)
SELECT order_id, original_amount, shipping_fee,
original_amount + shipping_fee AS total_amount
FROM dwd_order_detail
WHERE dt = '${biz_date}'; -- Multi‑dimensional join (replace Java RPC)
SELECT o.order_id,
CASE WHEN u.vip_level = 'PLATINUM' THEN o.original_amount * 0.9 ELSE o.original_amount END AS vip_adjusted_amount,
COALESCE(c.coupon_amount, 0) AS coupon_deduction,
(o.original_amount + o.shipping_fee - COALESCE(c.coupon_amount,0)) AS final_amount
FROM dwd_order_detail o
LEFT JOIN dim_user u ON o.user_id = u.user_id AND u.dt = '${biz_date}'
LEFT JOIN dim_coupon c ON o.coupon_id = c.coupon_id AND c.dt = '${biz_date}'
WHERE o.dt = '${biz_date}'; -- Window function example (average of last 3 orders)
SELECT order_id, user_id, amount,
AVG(amount) OVER (PARTITION BY user_id ORDER BY create_time ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_amount
FROM dwd_order_detail
WHERE dt BETWEEN date_sub('${biz_date}',30) AND '${biz_date}';3.5 Issues Encountered and Optimizations
Data consistency : Different departments used different timestamps (order creation vs. payment success), causing metric gaps. The solution was to unify statistical granularity and add daily diff‑rate validation jobs.
Data skew : A hot SKU (e.g., SKU = 888) caused a single reducer to become a bottleneck. The fix added a random suffix to the skewed key and expanded the dimension table to 100 partitions, then coalesced the results.
-- Add random suffix to skewed key
WITH skewed_data AS (
SELECT order_id,
CASE WHEN sku_id='888' THEN CONCAT(sku_id,'_',CAST(FLOOR(RAND()*100) AS INT)) ELSE sku_id END AS skewed_sku_id,
quantity
FROM fact_orders
),
expanded_dim AS (
SELECT sku_id, sku_name, pos
FROM dim_sku LATERAL VIEW EXPLODE(ARRAY_RANGE(0,100)) t AS pos
WHERE sku_id='888'
UNION ALL
SELECT sku_id, sku_name, NULL AS pos
FROM dim_sku WHERE sku_id!='888'
)
SELECT a.order_id,
COALESCE(b.sku_name,c.sku_name) AS sku_name,
SUM(a.quantity) AS total_qty
FROM skewed_data a
LEFT JOIN expanded_dim b ON a.skewed_sku_id = CONCAT(b.sku_id,'_',b.pos) AND b.sku_id='888'
LEFT JOIN dim_sku c ON a.skewed_sku_id = c.sku_id
GROUP BY a.order_id, COALESCE(b.sku_name,c.sku_name);After optimization, job duration dropped from 3 hours to 25 minutes.
3.6 Architecture Comparison Results
RPC stability: eliminated, failure rate ≈ 0%.
Task reliability: automated scheduling raised success rate to 99.8%, reducing O&M effort by 75%.
Data accuracy: unified processing lifted metric deviation from up to 10% to >99.9% correctness.
Processing capacity: distributed computation increased throughput 5× and cut latency from 6 h to minutes.
Re‑run efficiency: insert‑overwrite reduced re‑run time from >4 h to 30 min (≈ 87.5% improvement).
Future Outlook
Adopt a Lambda architecture to combine batch and streaming for real‑time reporting.
Upgrade the stack with Flink for stream processing and Kudu for low‑latency analytics.
Apply microservice resilience patterns (e.g., Sentinel) to the real‑time pipeline for error‑rate monitoring and proactive alerts.
Expand financial data coverage across the entire group to enable predictive analytics.
Conclusion
The evolution from a microservice‑centric finance system to a big‑data‑driven architecture demonstrates that there is no universally "best" architecture—only the one that fits the current business stage. Transitioning required careful problem analysis, systematic redesign, and extensive performance tuning, providing a reference for teams facing similar scaling challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
