Upgrading Hangzhou Bank Consumer Finance Big Data Platform with Apache Doris 1.2: Architecture, Performance Gains, and Integration
This article details how Hangzhou Bank Consumer Finance modernized its big‑data platform by introducing Apache Doris 1.2, replacing the original Greenplum + CDH architecture, unifying data sources via Multi‑Catalog, achieving second‑level query latency, reducing storage and compute costs, and outlining the integration workflow with DolphinScheduler, SeaTunnel, and Spark.
With rapid business growth and expanding data volumes, Hangzhou Bank Consumer Finance (Hangzhou Yinxiaojin) faced performance bottlenecks in its original big‑data platform built on Greenplum and CDH. To meet stricter real‑time and complexity requirements, the company introduced Apache Doris 1.2 in October 2022, leveraging its Multi‑Catalog feature to federate ES, Hive, and Greenplum data sources and to provide a unified query gateway.
Business Requirements
Alerting: real‑time monitoring of loan‑process user counts and amounts.
Analysis: ad‑hoc queries and temporary data extraction for credit‑approval, disbursement, etc.
Dashboard: real‑time operation cockpit and T+1 business dashboards.
Modeling: multi‑dimensional variable modeling to improve credit‑approval models.
Data Architecture 1.0
The initial platform combined Greenplum, Hive, ES, and MongoDB. Data from core systems flowed through CloudCanal to Greenplum for BI, to Kafka → Flink → ES for real‑time variables, and to Hive for batch analytics. This architecture satisfied basic needs but suffered from wide tables, massive data size (hundreds of TB), unstable interfaces, and minute‑level query latency.
Problems Identified
Extremely wide tables (>1,000 columns) degraded Hive query performance.
Data volume >100 TB made maintenance costly.
Batch‑derived variables pushed to ES caused GC spikes and unstable APIs.
Technical Selection
After evaluating ClickHouse and Apache Doris, the team chose Doris 1.2 for its high‑performance wide‑table queries, simple data‑ingestion framework, rich ecosystem (Spark/Doris and Flink/Doris connectors), and active community support.
Data Architecture 2.0
Doris became the core component. Multi‑Catalog unified ES, Hive, and Greenplum under a single query layer, while Spark‑Doris‑Connector enabled bidirectional data flow between Hive and Doris. This redesign reduced query latency from minutes to seconds (or milliseconds) across all four business scenarios.
Production Cluster Monitoring
The production cluster runs 4 Front‑Ends and 8 Back‑Ends (64 CPU + 256 GB + 4 TB per node). The high query efficiency allows a smaller cluster compared to the original plan of 10 nodes.
Data Integration Solution
To automate data sync from Hive to Doris, the team combined DolphinScheduler, SeaTunnel, and Spark on YARN. The configuration file used for the job is shown below:
env{
spark.app.name = "hive2doris-template"
spark.executor.instances = 10
spark.executor.cores = 5
spark.executor.memory = "20g"
}
spark {
spark.sql.catalogImplementation = "hive"
}
source {
hive {
pre_sql = "select * from ods.demo_tbl where dt='2023-03-09'"
result_table_name = "ods_demo_tbl"
}
}
transform {
}
sink {
doris {
fenodes = "192.168.0.10:8030,192.168.0.11:8030,192.168.0.12:8030,192.168.0.13:8030"
user = root
password = "XXX"
database = ods
table = ods_demo_tbl
batch_size = 500000
max_retries = 1
interval = 10000
doris.column_separator = "\t"
}
}After deployment, storage consumption dropped from 810 GB to 270 GB (≈1/3), and data sync time reduced to ~40 minutes. Query performance improved dramatically: a Hive GROUP BY that previously took 162 seconds was reduced to 58.5 seconds via Doris + Hive Catalog, and the same query on a native Doris table finished in 0.828 seconds.
Benefits of the New Architecture
Unified data source export with significantly faster query speed.
Separation of Hive batch jobs from real‑time queries, improving cluster resource utilization.
Stabilized data interfaces and faster write throughput by moving writes from ES to Doris.
Rich ecosystem and low migration cost thanks to Spark‑Doris connector.
Horizontal hot‑deployment enables seamless scaling and simple operations.
Future Plans (Architecture 3.0)
Ingest all real‑time data directly into Doris via Flink, eliminating ES.
Build a real‑time raw layer in Doris and sync MySQL and MongoDB data via Flink‑CDC.
Migrate most Hive + Spark batch jobs to Doris for higher efficiency.
Consolidate all query and analysis workloads on Doris to create a unified query gateway.
Introduce visual operation tools for stable cluster expansion.
Conclusion
Apache Doris 1.2, with its Multi‑Catalog and high‑performance OLAP engine, has transformed the big‑data platform of Hangzhou Bank Consumer Finance, delivering second‑level query latency, reduced storage/computation costs, and a more maintainable architecture. The team thanks the Doris community and SelectDB for their support and recommends Doris to other enterprises seeking similar upgrades.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
