JD's OLAP Architecture: Design, Challenges, and Solutions
This article explains how JD constructs its OLAP platform from data ingestion to storage, querying, and management, describing the diverse data sources, real‑time and offline processing, scalability, consistency, fault tolerance, and future optimization plans, while addressing key technical challenges and solutions.
Guest Speaker : Li Yang, Senior R&D Engineer at JD.
Overview : The talk introduces JD's end‑to‑end OLAP construction, starting from business demand scenarios, analyzing existing problems, proposing solutions, and outlining the evolution of the OLAP system.
Demand Scenarios
1. JD Data Ingress
① Business Data – Orders : JD's e‑commerce platform generates order data, which is analyzed from multiple dimensions such as store, product category, and conversion rates.
② Behavioral Data – Clicks and Searches : User click and search actions are combined with order information for funnel analysis and conversion rate calculation.
③ Advertising and Recommendations : Based on order behavior, targeted ads and recommendations are delivered and measured.
④ Monitoring Metrics : Operational metrics, alongside user behavior data, are also monitored.
2. JD Data Egress
Data export is divided into offline and real‑time streams.
Offline: weekly/monthly reports, financial statements, and machine‑learning training data.
Real‑time: interactive queries for analysts, real‑time dashboards for promotions, and dynamic resource adjustments.
Key Issues and Solutions
1. Write
Data source diversity : files, HDFS, Kafka/MQ, and various formats (CSV, TSV, JSON, AVRO, PARQUET, BINLOG). Solution: a unified import service that abstracts source types, allowing users to configure import via a visual UI (select topic, target, format, field types).
2. Timeliness
Real‑time data requires immediate computation; offline data can be batch‑processed. Solution: physically isolate real‑time and offline clusters to avoid interference and allocate resources appropriately.
3. Updates and Deletions
Updates are handled by overwriting records (e.g., order status changes); deletions use partition drops or versioned data replacement.
4. High Throughput
Solution: equip real‑time clusters with 10 GbE and SSDs, offline clusters with HDDs.
5. Storage
Challenges: petabyte‑scale data cannot be stored on a single node. Solution: distributed storage with columnar format, compression (e.g., Snappy), and multi‑replica fault tolerance.
6. Consistency
Solution: distributed coordination (e.g., Zookeeper) combined with local transaction mechanisms to ensure data consistency.
7. Read
Techniques: partitioning by time, pre‑aggregation, indexing (hash, B‑tree, range, inverted), materialized views.
8. Usability
Solution: support JDBC/ODBC and standard SQL, provide a graphical interface for analysts without database expertise.
9. QPS
Solution: partition cache, result cache, multi‑replica deployment, and scaling hardware.
10. Management
Current issues: manual disk replacement and data rebalancing are time‑consuming. Solutions: automated monitoring and alerting, black‑list node removal, scripted node replacement reducing downtime from hours to minutes.
Evolution of JD's OLAP
1.0 Era: Small order data, handled by relational databases (Oracle/MySQL).
2.0 Era: Added logistics, supply‑chain, customer service, payment; data grew to TB/PB, prompting offline warehouses using Hive and Spark.
3.0 Era: Real‑time queries introduced; unified OLAP service using Doris and ClickHouse to serve both batch and streaming workloads.
Future Plans
Management Platform Optimization:
Dynamic scaling of ClickHouse nodes.
Intelligent operations to automate node up/down and data balancing.
Real‑time cache enhancements in Doris.
Smart index management that auto‑creates indexes based on query patterns.
Q&A Highlights:
JD has not used Druid due to limited SQL support and unsuitability for rapidly changing order data.
ClickHouse excels in single‑table queries; Doris performs better on large joins and offers higher QPS in some scenarios.
Operational cost is lower for Doris because of automatic node scaling.
Data updates are simpler in Doris (overwrite) compared to ClickHouse's multiple engines.
Automatic engine selection is not yet implemented; users currently choose manually.
Data ingestion to ClickHouse follows two paths: legacy user‑managed pipelines or a unified OLAP service built on the platform.
End of presentation – thank you for listening.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
