Elasticsearch Deep Dive: Features, Mapping & Zero‑Downtime Reindexing
This article provides a comprehensive overview of Elasticsearch, covering its distributed architecture, key features such as JSON RESTful APIs and multi‑tenant support, core functionalities like full‑text search and aggregations, comparisons with Solr, advanced mapping techniques, various query DSLs, suggestion mechanisms, and practical zero‑downtime reindexing strategies.
Overview
Elasticsearch can achieve second‑level search; its cluster is a distributed deployment that scales easily, handling petabyte‑scale data. It returns results sorted by relevance scores, providing the most relevant results.
Features
Easy installation: No other dependencies; after download, a cluster can be set up by modifying a few parameters.
JSON: Input/output format is JSON, eliminating the need to define a schema.
RESTful: Almost all operations (indexing, querying, configuration) are accessible via HTTP.
Distributed: Nodes are peers; adding nodes automatically balances load.
Multi‑tenant: Separate indices can be created for different purposes, allowing simultaneous operations.
Supports massive data: Can scale to petabyte‑level structured and unstructured data with near‑real‑time processing.
Functions
Distributed search engine: Elasticsearch automatically distributes massive data across multiple servers for storage and retrieval.
Full‑text search: Provides fuzzy search, relevance ranking, highlighting, etc.
Data analysis engine (aggregations): Example: community site user login statistics, feature usage over the past week or month.
Near‑real‑time processing of massive data: Distributed architecture enables large‑scale storage and retrieval.
Scenarios
Search scenarios: Person lookup, device lookup, in‑app search, order search.
Log analysis: Classic ELK stack (Elasticsearch/Logstash/Kibana) for log collection, storage, and analysis.
Data alert platforms: Example: community group‑buy alerts when price drops below a threshold, triggering notifications.
Business BI systems: Analyze regional user spending, generate reports, predict hot‑selling products, and provide targeted recommendations using Elasticsearch for analysis and Kibana for visualization.
Comparison
1) Solr uses Zookeeper for distributed management, while Elasticsearch has built‑in coordination.
2) Solr offers more comprehensive features out of the box; Elasticsearch focuses on core functions with many advanced features provided by third‑party plugins.
3) Solr performs better in traditional search use cases, whereas Elasticsearch excels in real‑time search.
Current mainstream version is Elasticsearch 7.x (latest 7.8). Optimizations include default JDK integration, Lucene 8 upgrade improving TopK performance, and a circuit‑breaker to avoid OOM.
Basic Concepts
IK Analyzer
IKAnalyzer is an open‑source lightweight Chinese tokenizer written in Java. Version 3.0 is a standalone component that can be used with Lucene and provides default optimizations.
Features of IK Analyzer 3.0:
Uses a forward‑iterating finest‑granularity segmentation algorithm with processing speed of 600 k characters/second.
Multi‑processor analysis mode supporting English letters (IP, Email, URL), numbers (dates, Chinese quantity words, Roman numerals, scientific notation), and Chinese words (names, places).
Supports custom dictionary for personal term optimization, reducing memory usage.
Provides IKQueryParser for Lucene full‑text search optimization and disambiguation.
Combines tokens to greatly improve Lucene hit rate.
Extended dictionary: ext_dict
Stopword dictionary: stop_dict
Synonym dictionary: same_dict
Index (Database‑like)
Settings
Define index settings such as number of shards and replicas.
Mapping (Schema‑like)
Field data types
Analyzer types
Whether to store the field or create an index
Document (Data)
Full updates use PUT Partial updates use
POSTAdvanced Features
Advanced Mapping
Geo‑point data type
Geo‑point represents a location on Earth using latitude and longitude, useful for distance calculations and region queries. The field type must be declared as geo_point .
Dynamic Mapping
Dynamic mapping automatically determines field data types and adds new fields to the mapping.
Advanced DSL
Match all query
Full‑text queries
Match query
Match phrase query
Query string
Multi‑match query
Term‑level queries
Term
Terms
Range
Prefix
Wildcard
Regexp
Fuzzy
Compound queries
Sorting ( sort), pagination ( size), highlighting ( highLight), bulk operations ( bulk)
Aggregations
Aggregations compute metrics (max, min, sum, avg, etc.) on a query result set and can perform bucket aggregations (group‑by) on those metrics.
Intelligent Suggestions
Term Suggester
Phrase Suggester
Completion Suggester
Context Suggester
If Completion Suggester returns zero matches, try Phrase Suggester; if still no match, fall back to Term Suggester. Precision ranking: Completion > Phrase > Term; recall ranking is the opposite. Completion Suggester is the fastest; use it when it meets business needs.
Practical Optimizations
Write Optimizations
Set replica count to 0 during initial bulk load, then restore after writing.
Enable auto‑generated IDs to avoid existence checks.
Use appropriate analyzers: avoid binary type; use different analyzers for title and text to improve speed.
Disable scoring and increase index refresh interval.
Batch multiple index operations.
Read Optimizations
Use filter instead of query to reduce scoring overhead; combine with bool.
Group data by day, month, year and query localized indices.
Zero‑Downtime Reindexing Strategies
External data import via MQ: Send messages through MQ console or CLI; microservice consumers trigger ES data import; microservice queries DB for total count and pagination, sends to MQ; consumer assembles JSON and uses bulk to index into new cluster.
Scroll + bulk + alias: Create new index book_new with desired mapping and settings; use Scroll API to retrieve data in batches; bulk‑load into book_new; switch alias book_alias to new index without code changes.
Reindex API: Elasticsearch 6.3.1+ supports Reindex API, which wraps scroll and bulk to rebuild indices without external tools.
Participation & flexibility: custom > scroll+bulk > reindex. Stability & reliability: custom < scroll+bulk < reindex.
Deep Paging Performance Solution
Using from + size for massive pagination (e.g., sending announcements to all users in a province) is impractical; alternative approaches are needed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
