Elasticsearch and Big Data: Architecture, Use Cases, and Advantages
This article explains what Elasticsearch is, how it solves database acceleration, log observability, and data analysis problems, details its core components and underlying engine features, compares its strengths and weaknesses, and presents classic application scenarios and a real‑world case study integrating Elasticsearch with Flink for large‑scale log analytics.
Share Guest: Yan Xumian, Alibaba Cloud Solution Architect
Editor: Cai Liping TRS
Platform: DataFunTalk
01 Elasticsearch and Big Data
1. What is Elasticsearch? Elasticsearch (ES) is a leading search and analytics engine, ranked No.8 in DB‑Engine index popularity and No.1 among search engines worldwide. It, together with Beats, Logstash, and Kibana, forms the Elastic Stack, a big‑data platform.
The four core products are:
Beats – data collection and delivery.
Logstash – data transport and ETL preprocessing.
Elasticsearch – storage, computation, query and analysis.
Kibana – visualization, reporting and analysis tools.
Community‑provided solutions include Elastic Enterprise Search, Elastic Observability, and Elastic Security, with enhanced kernel versions available in the cloud.
2. Problems solved by Elasticsearch
Database acceleration: Acts as a secondary index to speed up complex queries, reducing load on the primary database.
Full‑log observability: Under the ELK stack, Kibana enables unified time‑axis analysis of application issues.
Data analysis: Leverages structured and unstructured query capabilities as a powerful analytics engine.
3. How Elasticsearch solves these problems
Inverted index: Based on Lucene, provides text search and query capabilities, though single‑field sorting or aggregation may be limited.
DocValues storage: Column‑wise storage for non‑text fields, enabling efficient query, sort, and aggregation.
BKD tree: Optimized range queries for numeric fields.
Aggregation framework & FieldData: Bucket‑based aggregations with support for text‑type FieldData processing.
4. Selection advantages and disadvantages
Advantages
Supports almost all data types, both structured and unstructured.
Near‑real‑time search and analysis.
Native distributed architecture scales to petabytes and beyond.
Flexible DSL queries in multiple languages.
Disadvantages
Limited multi‑table joins, leading to data redundancy.
Non‑real‑time visibility can impact write performance.
No transaction support, unsuitable for strict financial use cases.
5. Open‑source big‑data support
The limitations of ES can be mitigated by integrating other open‑source tools such as Fluentd, Flume, Kafka, Apache Flink for real‑time computation, and Hadoop for offline processing, all of which are compatible with ES.
6. Elasticsearch combined with real‑time computing
Flink provides dozens of connectors covering databases, message queues, and OLAP engines, allowing SQL‑based real‑time data integration. In the cloud, CDC (Change Data Capture) can directly read MySQL binlogs for shorter, cheaper, and more stable pipelines.
02 Classic Applications and Case Analysis
1. Applicable scenarios
Online education – homework grading, course search, teacher management.
Gaming – community forums, tag systems, player metric analysis.
Internet entertainment – content search, recommendation recall, payment behavior analysis.
Industrial internet – geo‑search, enterprise information retrieval, sentiment analysis.
Transaction platforms – product search, order retrieval, marketing analytics.
Automotive & autonomous driving – comprehensive search, test‑track labeling, safety and commercial analysis.
ES also supports full‑stack operations monitoring, including log queries, metric monitoring, and application performance monitoring.
2. Case Study: IT Infrastructure & Micro‑Operations System
(1) Pain points
Rapidly increasing business scenarios (vehicle‑based micro‑services, charging‑pile data, member services) cause fast‑growing log volume and resource cost challenges.
Multiple data sources require table joins and data stitching.
Future log scale expected to exceed petabytes, demanding low‑cost storage, fast retrieval, and on‑demand analysis.
Legacy systems need quick integration with new cloud/on‑prem components while keeping architecture flexible and open.
Solution: Build a log platform and query system based on ES and Flink.
(3) Value points
Low migration cost: ES/Flink are widely adopted; Alibaba Cloud managed services are fully compatible with open‑source versions, enabling a one‑week rollout.
High stability: High‑availability deployment; three months of production without incidents, far better than competitor‑built on‑prem ES.
Performance optimization: Proprietary high‑performance kernel improves write throughput tenfold; same VM specs handle previously impossible workloads.
Low storage cost: Tiered storage engine reduces expenses; example: 1 PB of logs stored in OSS costs ¥12.6 W/month, with an additional ¥3 W for second‑level search and aggregation, saving ¥20.9 W compared to self‑hosted ELK.
4. Architecture analysis
The overall architecture consists of data collection, data aggregation & transmission, storage/index/compute, and visualization.
① Data collection: Sources include data‑collection platforms (vehicle and road‑base data), application services (container data via Beats), and databases (RDS).
② Data aggregation & transmission: Kafka buffers logs; Logstash consumes Kafka, performs lightweight processing, and writes to ES. Flink reads binlogs for real‑time table joins and feeds results into ES.
③ Storage / Index / Compute: Alibaba Cloud ES provides read‑write separation and hot‑cold tiered storage.
④ Visualization: Kibana offers log query, analysis, and dashboards for operations and business developers.
(5) Selection advantage – Elasticsearch
Write performance: Read‑write separation avoids resource contention; peak write throughput up to 200 MB/s; write‑related compute cost reduced by 60%.
Storage cost: Optimized media lowers cost by 70% versus high‑performance cloud disks; pay‑as‑you‑go cold storage without pre‑purchasing capacity; three‑replica mechanism ensures data safety; SATA‑disk query performance improves by 100%.
(6) Selection advantage – Real‑time Computing (Flink)
Performance: up to twice the speed of the open‑source version.
Stability: early fixes for engine defects ahead of the community.
Enterprise capabilities: CDC supports more data sources, whole‑database sync, and schema change sync.
Thank you for listening.
At the end of the article, please share, like, and give a three‑click boost!
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
