Optimizing Real‑Time Data Warehouse with Apache Doris at 360 DataTech
Facing stricter security, accuracy, and latency demands, 360 DataTech rebuilt its real‑time data warehouse by selecting Apache Doris for its high‑performance writes, SQL compatibility, low operational complexity, and active community, then detailed the architecture, ingestion, query acceleration, monitoring, troubleshooting, and future plans.
360 DataTech needed higher data security, accuracy, and real‑time performance, prompting a redesign of its real‑time data warehouse. After evaluating options, Apache Doris 1.1.2 was selected for its high write performance, SQL compatibility, strong join capabilities, low operational complexity, active community, and commercial friendliness.
The Doris cluster was integrated atop the existing Hive warehouse, using Broker Load for data ingestion and dynamic partitioning to handle various table types (pda, pdi, a, s). Duplicate and Unique models were chosen based on update patterns, and automatic schema sync tools were developed.
To accelerate ad‑hoc queries, a Doris acceleration layer was added, routing queries to Doris when tables are synchronized and meet size criteria, otherwise falling back to Presto, Spark, or Hive. This reduced typical query latency from minutes to under five seconds.
Slow‑query and slow‑import profiling was enabled, exposing detailed OLAP_SCAN_NODE and EXCHANGE_NODE metrics. Issues such as insufficient bucket numbers, ORC schema changes, empty ORC files, and HDFS path timeouts were identified and resolved through configuration tweaks (e.g., broker_timeout_ms, max_broker_concurrency, default_load_parallelism).
Monitoring combines host‑level metrics via Open‑Falcon, cluster metrics via Prometheus/Grafana, and log‑based alerts. Audit logs are collected using the built‑in auditloader plugin, facilitating root‑cause analysis of BE crashes and SQL performance bottlenecks.
A custom Replicator plugin provides cross‑region disaster recovery by replaying DDL/DML to a standby cluster, with a validator ensuring data consistency.
Since its production launch in July 2022, the Doris service now supports hundreds of tables, dozens of terabytes of data, and thousands of daily sync jobs, delivering fast BI reporting, stable operations, and strong community support.
Future work includes expanding Doris usage to more scenarios, leveraging materialized views and query caching, and building automated cluster health diagnostics.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
