Big Data 13 min read

Introduction to Elasticsearch: Core Concepts, Query Types, Pagination, and Data Synchronization

This article provides a comprehensive overview of Elasticsearch, covering its distributed storage architecture, core data model concepts, analysis and query capabilities, practical next‑token pagination techniques, join strategies, and various data synchronization methods for integrating Elasticsearch with other systems.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Introduction to Elasticsearch: Core Concepts, Query Types, Pagination, and Data Synchronization

Elasticsearch (ES) is a distributed storage and search engine widely used in scenarios such as Wikipedia and GitHub search. This article introduces its core concepts, including nodes, clusters, shards, replicas, and data‑model elements like index, type, and document.

It explains the analysis capabilities of ES, covering the inverted index, analyzers, tokenization, normalization, and filtering, and discusses the limitations of built‑in analyzers for Chinese text.

The article then describes major query types supported by ES, from term and fuzzy queries at the word level to full‑text queries such as match and match_phrase , and details the Bool query structure (must, should, must_not, filter) and relevance scoring using TF‑IDF and field length.

For practical pagination, it presents the sort + search_after (nextToken) approach with example DSL, showing how to construct the request and use the returned cursor for subsequent pages:

GET /service_version_index/service_version_type/_search
{
  "size": 100,
  "sort": [
    {"gmt_modified": "desc"},
    {"score": "desc"},
    {"id": "desc"}
  ],
  ...
}

Example of the cursor returned by ES:

{
  "sort": [1614561419000, "6FxZJXgBE6QbUWetnarH"]
}

Using the cursor for the next page:

GET /service_version_index/service_version_type/_search
{
  "size": 100,
  "sort": [
    {"gmt_modified": "desc"},
    {"score": "desc"},
    {"id": "desc"}
  ],
  "query": { ... },
  "search_after": [1614561419000, "6FxZJXgBE6QbUWetnarH"]
}

The article also covers strategies for implementing joins in ES, including parent‑child documents, service‑side joins, and the use of wide tables, comparing wide versus narrow table designs.

Finally, data synchronization methods are discussed, ranging from manual writes and Alibaba Cloud DTS to Logstash and view‑based ETL. An example of creating a SQL view for feeding ES is provided:

CREATE VIEW my_view AS
SELECT sv.*, s.score, sc.category
FROM service_version sv
JOIN service s ON sv.service_id = s.service_id
JOIN service_category sc ON s.service_id = sc.service_id;

Additional references and resources are listed for further reading.

Big Datasearch engineElasticsearchpaginationData SynchronizationDistributed StorageQuery DSL
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.