How a Mistaken Delete in ElasticSearch Nearly Erased 17 Million Products – Key Lessons

A senior engineer accidentally issued a DELETE request on an ElasticSearch index holding 17 million product records, triggering a massive data loss incident, and the team’s subsequent recovery strategies, scaling challenges, and process improvements are detailed to guide backend developers.

dbaplus Community
dbaplus Community
dbaplus Community
How a Mistaken Delete in ElasticSearch Nearly Erased 17 Million Products – Key Lessons

Project Background

The author worked at a fast‑growing e‑commerce startup leading two teams that built core backend services. One team created a product catalog service that stored inventory, product metadata, pricing, and fulfillment data for roughly 17 million items, exposing the data via a REST API.

The catalog is backed by an ElasticSearch cluster because the service needs to support more than 50 different filters, some with full‑text search.

Simplified architecture of the denormalized read model
Simplified architecture of the denormalized read model

ElasticSearch Overview

Unlike traditional databases where writes are restricted to DBAs, ElasticSearch is accessed directly via its REST interface. In version 5 the URL pattern was {cluster_endpoint}/{index}/{type}/{id}, a format later removed. Operations such as GET, POST, PUT, PATCH, and DELETE are performed through HTTP calls.

Event Recap

During a busy Friday, a teammate needed to export data using a filter that was not available in the public API. The author opened a Postman session, intended to issue a GET request, but mistakenly selected DELETE and sent the request, deleting the entire product index.

Selecting HTTP method in Postman
Selecting HTTP method in Postman
Canceling a request in Postman
Canceling a request in Postman

The cancellation only stopped the client; the delete operation had already reached the ElasticSearch server. Subsequent checks showed only a few hundred documents remained instead of the expected 17 million.

Recovery Options

The team convened an emergency war‑room. Because the catalog is a read‑model, they could rebuild it from upstream services. Two main approaches were considered:

Re‑import all data via a custom component that synchronises the REST API with other micro‑services, a process that would take about six days.

Leverage event streams; many services could replay events, and some critical domains already supported data replay.

They ultimately combined both methods, reducing the rebuild time from six days to a few hours.

Catalog Updater architecture
Catalog Updater architecture

Lessons Learned

1. Backup and rebuild speed – While most databases are regularly backed up, the ElasticSearch read‑model lacked proper protection. Rebuilding a read‑model of this scale is time‑consuming; the team managed to cut the rebuild window to a few hours by combining full re‑import and event replay.

2. Horizontal scaling limits – The rebuild component relies on synchronous REST calls to many services, which quickly saturates those services and defeats the expected horizontal scalability of micro‑services.

3. Role‑based access control – The team migrated to ElasticSearch 7, introduced X‑Pack (now free), and created distinct read‑only and write roles, restricting direct write access to the index.

4. Process responsibility – Mistakes stemmed from poor processes rather than individuals. The team instituted stricter approval workflows, automated safeguards, and limited direct database access to reduce human error.

Conclusion

The incident highlighted the fragility of large denormalised read‑models, the importance of automated rebuild pipelines, proper access controls, and robust operational processes. By learning from this near‑catastrophe, the team improved reliability and reduced the risk of similar outages in the future.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Microservicesincident responsescalingdata indexingread model
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.