Big Data 11 min read

Druid Principles and Their Application in Insurance Data Analytics

This article summarizes a presentation by Ping An Insurance data engineers on Druid’s architecture, core concepts, node roles, tuning strategies, and real-world deployment for insurance analytics, illustrating how Druid enables sub‑second, high‑cardinality OLAP queries and supports both real‑time and batch processing.

DataFunTalk
DataFunTalk
DataFunTalk
Druid Principles and Their Application in Insurance Data Analytics

The article is based on a talk by Ping An Insurance data engineers Li Kaibo and Guan Zhihua at the DataFunTalk big data technology salon, covering the principles of Druid and its practical use in insurance data analysis.

Druid is described as a non‑Alibaba open‑source, column‑oriented, in‑memory MOLAP database with a multi‑node MMDB architecture. It supports pre‑aggregation, time‑series data, and a variety of plugins such as Kafka, MySQL, and HDFS.

The authors compare Druid with other technologies like Spark and Elasticsearch, explaining why Druid was chosen for its sub‑second response time, high cardinality handling, and Lambda architecture suitability.

Key architectural components are introduced: Broker (query node with REST interface), Historical nodes (offline storage and query), Realtime nodes (real‑time ingestion), Coordinator (load balancing and segment management), and external dependencies such as ZooKeeper, MySQL metadata store, and Deep Storage (HDFS/S3/local disk).

Operational details include segment design (timestamp‑based partitioning), memory and disk considerations, recommended memory allocations for Broker (20‑30 GB), and strategies for query optimization such as pushing aggregations to ingestion time and minimizing group‑by operations.

Real‑world deployment in the BDAS system is described: migration from Cognos to Druid began in May, with full integration by December, supporting dozens of data sources, billions of rows, and achieving average query latency under 2 seconds for thousands of concurrent users.

Several use cases are presented: top‑N queries for overview dashboards, cross‑tab analysis with multi‑threaded execution, in‑Druid metric calculations versus pre‑computing in Hive, dimension merging/hiding, and row‑level access control based on department codes.

The article concludes with a recommended architecture emphasizing multiple Broker nodes, two Coordinators, sufficient Realtime nodes, and tiered storage to separate hot and cold data, aiming for high availability, scalability, and fault tolerance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OLAPDruidData ArchitectureInsurance
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.