Operations 9 min read

How a Public Snapshot Leak Almost Cost a Client – Lessons from a Cloud Ops Failure

A cloud engineer mistakenly set a disk snapshot to public, exposing a major client’s data, rushed a rollback, and then reflected on the root causes, highlighting the need for strict review, visual tools, and risk‑aware practices in high‑risk operations.

dbaplus Community

Nov 15, 2023

How a Public Snapshot Leak Almost Cost a Client – Lessons from a Cloud Ops Failure

Background : In December 2018 a senior engineer was asked to share a cloud disk snapshot from user A to user B for a large client. The team decided to use the existing snapshot‑sharing feature of their cloud management system, believing it would be a quick and simple solution.

Incident : After deploying the change, the engineer opened a new console window and noticed the snapshot flag public = true. The snapshot had been shared with all tenants, exposing the client’s data to anyone who could list snapshots. Realising the severity, the engineer panicked, attempted a manual rollback via SQL, and informed the team lead.

Response : The lead ordered an immediate rollback. The engineer executed the rollback within five minutes, preventing any other tenant from creating disks from the leaked snapshot. No actual data was consumed by other users, but the incident triggered alerts across neighboring teams and a post‑mortem meeting.

Reflection : The engineer realized several mistakes:

Relying on a low‑frequency, high‑risk operation without proper safeguards.

Skipping product‑level visual tools for dangerous actions, opting for manual CLI commands.

Failing to implement double‑check or approval workflows for critical APIs.

Allowing the public flag to be set without explicit validation.

These oversights led to a near‑catastrophic data leak and damaged trust with senior management.

Recommendations :

Design high‑risk operations as separate, isolated APIs with mandatory audit trails.

Provide a visual, product‑level interface for critical actions to avoid “manual” or “human‑powered” operations.

Require peer review or automated double‑check mechanisms before executing commands that affect multiple tenants.

Document all requests, keep a record of effort, and push for productization of repetitive high‑risk tasks.

When a feature is not yet productized, involve the team lead to secure temporary safeguards and communicate the risk to stakeholders.

By adopting these practices, teams can reduce the likelihood of similar incidents and protect both customers and the organization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing Operations incident response Data Security

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.