How a Mistaken Delete in ElasticSearch Nearly Erased 17 Million Products – Key Lessons
A senior engineer accidentally issued a DELETE request on an ElasticSearch index holding 17 million product records, triggering a massive data loss incident, and the team’s subsequent recovery strategies, scaling challenges, and process improvements are detailed to guide backend developers.
Project Background
The author worked at a fast‑growing e‑commerce startup leading two teams that built core backend services. One team created a product catalog service that stored inventory, product metadata, pricing, and fulfillment data for roughly 17 million items, exposing the data via a REST API.
The catalog is backed by an ElasticSearch cluster because the service needs to support more than 50 different filters, some with full‑text search.
ElasticSearch Overview
Unlike traditional databases where writes are restricted to DBAs, ElasticSearch is accessed directly via its REST interface. In version 5 the URL pattern was {cluster_endpoint}/{index}/{type}/{id}, a format later removed. Operations such as GET, POST, PUT, PATCH, and DELETE are performed through HTTP calls.
Event Recap
During a busy Friday, a teammate needed to export data using a filter that was not available in the public API. The author opened a Postman session, intended to issue a GET request, but mistakenly selected DELETE and sent the request, deleting the entire product index.
The cancellation only stopped the client; the delete operation had already reached the ElasticSearch server. Subsequent checks showed only a few hundred documents remained instead of the expected 17 million.
Recovery Options
The team convened an emergency war‑room. Because the catalog is a read‑model, they could rebuild it from upstream services. Two main approaches were considered:
Re‑import all data via a custom component that synchronises the REST API with other micro‑services, a process that would take about six days.
Leverage event streams; many services could replay events, and some critical domains already supported data replay.
They ultimately combined both methods, reducing the rebuild time from six days to a few hours.
Lessons Learned
1. Backup and rebuild speed – While most databases are regularly backed up, the ElasticSearch read‑model lacked proper protection. Rebuilding a read‑model of this scale is time‑consuming; the team managed to cut the rebuild window to a few hours by combining full re‑import and event replay.
2. Horizontal scaling limits – The rebuild component relies on synchronous REST calls to many services, which quickly saturates those services and defeats the expected horizontal scalability of micro‑services.
3. Role‑based access control – The team migrated to ElasticSearch 7, introduced X‑Pack (now free), and created distinct read‑only and write roles, restricting direct write access to the index.
4. Process responsibility – Mistakes stemmed from poor processes rather than individuals. The team instituted stricter approval workflows, automated safeguards, and limited direct database access to reduce human error.
Conclusion
The incident highlighted the fragility of large denormalised read‑models, the importance of automated rebuild pipelines, proper access controls, and robust operational processes. By learning from this near‑catastrophe, the team improved reliability and reduced the risk of similar outages in the future.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
