Databases 29 min read

Designing and Implementing Elasticsearch for Large‑Scale Data Search and Storage

This article details the business background, technical advantages, architecture, indexing mechanisms, clustering, data synchronization strategies, API design, and performance monitoring of Elasticsearch, illustrating how it replaces costly SQL LIKE queries with a scalable, high‑performance search solution for massive user activity data.

Architecture Digest

Jan 15, 2022

Designing and Implementing Elasticsearch for Large‑Scale Data Search and Storage

The author begins by describing a scenario where a company’s SQL Server stores tens of millions of records for user reading history and product search, leading to performance bottlenecks due to full‑table scans and limited scalability.

Elasticsearch is introduced as a NoSQL, document‑oriented search engine that offers horizontal scalability, fast full‑text search, and rich aggregation capabilities, making it suitable for the company’s high‑frequency read/write workloads.

Key advantages of Elasticsearch are presented in a table format, highlighting horizontal scaling, shard‑based parallelism, near‑real‑time indexing, and high availability through replica shards.

The article explains the core indexing structures: inverted index for full‑text search and doc values for efficient aggregations, noting their memory and build‑time trade‑offs.

Cluster sharding and the two‑phase query process (distributed and merge phases) are illustrated, along with a diagram of the query flow.

Design decisions include wrapping Elasticsearch calls in a .NET 5 WebAPI service to hide technical details, using RabbitMQ for asynchronous writes, and employing both push (CDC) and pull (scheduled batch) data synchronization strategies. The author prefers a pull approach with Quartz.NET scheduled jobs due to operational simplicity.

Implementation snippets are shown, such as the abstract ElasticsearchEntity base class and a concrete entity definition:

public abstract class ElasticsearchEntity
{
    private Guid? _id;
    public Guid Id { get => _id ??= Guid.NewGuid(); set => _id = value; }
    private long? _timestamp;
    [Number(NumberType.Long, Name = "timestamp")]
    public long Timestamp { get => _timestamp ??= DateTime.Now.DateTimeToTimestampOfMicrosecond(); set => _timestamp = value; }
}

and a consumer that writes user view duration records:

public class UserViewDurationConsumer : BaseConsumer<UserViewDurationMessage>
{
    private readonly ElasticClient _elasticClient;
    public UserViewDurationConsumer(ElasticClient elasticClient) { _elasticClient = elasticClient; }
    public override void Excute(UserViewDurationMessage msg)
    {
        var document = msg.MapTo<Entity.UserViewDuration>();
        var result = _elasticClient.Create(document, a => a.Index($"userviewrecord-{msg.CreateDateTime:yyyy-MM}")).GetApiResult();
        if (result.Failed) LoggerHelper.WriteToFile(result.Message);
    }
}

API endpoints for querying reading records and searching works are provided, demonstrating the use of search_after, scroll, and minimumShouldMatch to handle deep pagination and combined must / should clauses.

Alias management is described to enable zero‑downtime index swaps, using Elasticsearch aliases to point to the latest index while safely removing old ones.

Monitoring is handled with Elastic APM + Kibana, offering observability of the new search layer.

In conclusion, the migration to Elasticsearch was performed smoothly, delivering significant performance gains and meeting both current and future scalability requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Indexing Search Engine Elasticsearch Data synchronization NoSQL

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.