Big Data 13 min read

Elasticsearch Deep Dive: Features, Mapping & Zero‑Downtime Reindexing

This article provides a comprehensive overview of Elasticsearch, covering its distributed architecture, key features such as JSON RESTful APIs and multi‑tenant support, core functionalities like full‑text search and aggregations, comparisons with Solr, advanced mapping techniques, various query DSLs, suggestion mechanisms, and practical zero‑downtime reindexing strategies.

Alibaba Cloud Developer

Mar 4, 2024

Elasticsearch Deep Dive: Features, Mapping & Zero‑Downtime Reindexing

Overview

Elasticsearch can achieve second‑level search; its cluster is a distributed deployment that scales easily, handling petabyte‑scale data. It returns results sorted by relevance scores, providing the most relevant results.

Features

Easy installation: No other dependencies; after download, a cluster can be set up by modifying a few parameters.

JSON: Input/output format is JSON, eliminating the need to define a schema.

RESTful: Almost all operations (indexing, querying, configuration) are accessible via HTTP.

Distributed: Nodes are peers; adding nodes automatically balances load.

Multi‑tenant: Separate indices can be created for different purposes, allowing simultaneous operations.

Supports massive data: Can scale to petabyte‑level structured and unstructured data with near‑real‑time processing.

Functions

Distributed search engine: Elasticsearch automatically distributes massive data across multiple servers for storage and retrieval.

Full‑text search: Provides fuzzy search, relevance ranking, highlighting, etc.

Data analysis engine (aggregations): Example: community site user login statistics, feature usage over the past week or month.

Near‑real‑time processing of massive data: Distributed architecture enables large‑scale storage and retrieval.

Scenarios

Search scenarios: Person lookup, device lookup, in‑app search, order search.

Log analysis: Classic ELK stack (Elasticsearch/Logstash/Kibana) for log collection, storage, and analysis.

Data alert platforms: Example: community group‑buy alerts when price drops below a threshold, triggering notifications.

Business BI systems: Analyze regional user spending, generate reports, predict hot‑selling products, and provide targeted recommendations using Elasticsearch for analysis and Kibana for visualization.

Comparison

1) Solr uses Zookeeper for distributed management, while Elasticsearch has built‑in coordination.

2) Solr offers more comprehensive features out of the box; Elasticsearch focuses on core functions with many advanced features provided by third‑party plugins.

3) Solr performs better in traditional search use cases, whereas Elasticsearch excels in real‑time search.

Current mainstream version is Elasticsearch 7.x (latest 7.8). Optimizations include default JDK integration, Lucene 8 upgrade improving TopK performance, and a circuit‑breaker to avoid OOM.

Basic Concepts

IK Analyzer

IKAnalyzer is an open‑source lightweight Chinese tokenizer written in Java. Version 3.0 is a standalone component that can be used with Lucene and provides default optimizations.

Features of IK Analyzer 3.0:

Uses a forward‑iterating finest‑granularity segmentation algorithm with processing speed of 600 k characters/second.

Multi‑processor analysis mode supporting English letters (IP, Email, URL), numbers (dates, Chinese quantity words, Roman numerals, scientific notation), and Chinese words (names, places).

Supports custom dictionary for personal term optimization, reducing memory usage.

Provides IKQueryParser for Lucene full‑text search optimization and disambiguation.

Combines tokens to greatly improve Lucene hit rate.

Extended dictionary: ext_dict

Stopword dictionary: stop_dict

Synonym dictionary: same_dict

Index (Database‑like)

Settings

Define index settings such as number of shards and replicas.

Mapping (Schema‑like)

Field data types

Analyzer types

Whether to store the field or create an index

Document (Data)

Full updates use PUT Partial updates use

POST

Advanced Features

Advanced Mapping

Geo‑point data type

Geo‑point represents a location on Earth using latitude and longitude, useful for distance calculations and region queries. The field type must be declared as geo_point .

Dynamic Mapping

Dynamic mapping automatically determines field data types and adds new fields to the mapping.

Advanced DSL

Match all query

Full‑text queries

Match query

Match phrase query

Query string

Multi‑match query

Term‑level queries

Term

Terms

Range

Prefix

Wildcard

Regexp

Fuzzy

Compound queries

Sorting ( sort), pagination ( size), highlighting ( highLight), bulk operations ( bulk)

Aggregations

Aggregations compute metrics (max, min, sum, avg, etc.) on a query result set and can perform bucket aggregations (group‑by) on those metrics.

Intelligent Suggestions

Term Suggester

Phrase Suggester

Completion Suggester

Context Suggester

If Completion Suggester returns zero matches, try Phrase Suggester; if still no match, fall back to Term Suggester. Precision ranking: Completion > Phrase > Term; recall ranking is the opposite. Completion Suggester is the fastest; use it when it meets business needs.

Practical Optimizations

Write Optimizations

Set replica count to 0 during initial bulk load, then restore after writing.

Enable auto‑generated IDs to avoid existence checks.

Use appropriate analyzers: avoid binary type; use different analyzers for title and text to improve speed.

Disable scoring and increase index refresh interval.

Batch multiple index operations.

Read Optimizations

Use filter instead of query to reduce scoring overhead; combine with bool.

Group data by day, month, year and query localized indices.

Zero‑Downtime Reindexing Strategies

External data import via MQ: Send messages through MQ console or CLI; microservice consumers trigger ES data import; microservice queries DB for total count and pagination, sends to MQ; consumer assembles JSON and uses bulk to index into new cluster.

Scroll + bulk + alias: Create new index book_new with desired mapping and settings; use Scroll API to retrieve data in batches; bulk‑load into book_new; switch alias book_alias to new index without code changes.

Reindex API: Elasticsearch 6.3.1+ supports Reindex API, which wraps scroll and bulk to rebuild indices without external tools.

Participation & flexibility: custom > scroll+bulk > reindex. Stability & reliability: custom < scroll+bulk < reindex.

Deep Paging Performance Solution

Using from + size for massive pagination (e.g., sending announcements to all users in a province) is impractical; alternative approaches are needed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

dsl Indexing Elasticsearch Zero Downtime Search Aggregation

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.