Building a One‑Person PCI DSS Image‑Signing Service and Surviving a P0 Outage
This article recounts how a solo developer built a Django‑based Docker image signing service to meet PCI DSS requirements, faced two severe incidents—including a 17.5‑hour P0 outage caused by concurrency limits and a misconfigured Rekor service—and shares the operational lessons learned for reliable SRE practice.
Our company needed to meet PCI DSS compliance, which mandates that every container image deployed to production be signed and verified beforehand. As the private Harbor administrator, I was tasked with delivering a solution within ten days.
After evaluating options, the team chose the Sigstore ecosystem and its cosign tool for image signing and verification.
Sigstore is an open‑source project that provides key‑less signing for the software supply chain, hosted by the Linux Foundation’s OpenSSF. Cosign, a component of Sigstore, simplifies container image signing and supports both traditional private‑key and key‑less modes.
Because users could not be expected to run
cosignthemselves, we built a backend service—named DIS (Docker Image Sign)—exposing an API for signing and verification. The service was implemented with Django, Django REST Framework, and deployed on Kubernetes.
PCI DSS requires that all images be signed before deployment, which means the signing step blocks the CI/CD pipeline. DIS therefore had to be highly available (365 × 24) and able to handle concurrent requests.
During the first five days after launch, a project’s CI/CD pipeline triggered an unexpected surge of about 600 concurrent signing requests (far above the usual 200). The surge was caused by a mis‑configured pipeline that repeatedly invoked signing for hundreds of images, even though only a few had changed. DIS ran on ten 1‑CPU/1‑GB pods, each invoking a
cosignshell command that took roughly seven seconds, limiting throughput and causing the service to crash.
The root cause was that each request required a full cosign execution, which could not be optimized, capping concurrency.
We quickly restarted the service, increased pod resources, and asked the offending team to pause their CI/CD runs, which restored stability.
After the incident we added monitoring for concurrency, resource utilization, and specific error messages—previously absent.
Approximately forty days later a second P0 incident occurred. DIS depends on a private Rekor service (the immutable log component of Sigstore). A change to Rekor’s DNS entry broke the connection, and the Rekor owner had not communicated the change or performed testing. Our monitoring still failed to detect the outage promptly, resulting in a 17.5‑hour downtime that blocked deployments for a critical SaaS product and cost the company tens of thousands of dollars.
Post‑mortem conclusions emphasized that we should not have promised a production‑grade signing service on such a short timeline without allocating sufficient resources and performing risk assessments.
Key takeaways include the importance of thorough risk evaluation, adequate capacity planning, comprehensive monitoring, and cross‑team communication when relying on shared services.
Finally, the article reflects on the broader SRE principle that failures are inevitable and encourages decoding core SRE values to build truly reliable services.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.