Essential Kubernetes Production Checklist for Web Services
A comprehensive, step‑by‑step checklist guides teams through documentation, application design, security, CI/CD, Kubernetes configuration, monitoring, testing, and 24/7 support to reliably run web services with HTTP APIs in production on Kubernetes.
Running applications in production can be tricky. This article presents a thorough checklist for deploying web services (applications exposing an HTTP API) on Kubernetes.
General
Application name, description, purpose, and owning team are clearly documented (e.g., via a service tree).
Define the application's criticality level (e.g., "critical link service" for business‑critical apps).
The development team has sufficient Kubernetes knowledge/experience, such as understanding stateless services.
A 24/7 on‑call team is identified and notified.
An upgrade plan exists, including potential rollback steps.
Application
The code repository contains clear instructions on development, configuration, and changes (crucial for emergency fixes).
Dependencies are pinned so that patch changes do not unintentionally introduce new libraries.
OpenTracing/OpenTelemetry semantic conventions are followed.
All outbound HTTP calls define timeouts.
HTTP connection pools are sized appropriately for expected traffic.
Thread pools or non‑blocking asynchronous code are correctly implemented and configured.
Redis and database connection pools have correct sizes.
Retry and back‑off strategies are implemented for dependent services.
A rollback mechanism is defined based on business requirements.
Rate‑limiting or throttling mechanisms are in place (often provided by the underlying infrastructure).
Application metrics are exposed for collection (e.g., scraped by Prometheus).
Application logs are written to stdout/stderr.
Logs follow best practices (structured logging, meaningful messages), have clearly defined levels, and debug logging is disabled by default in production.
The application container crashes only on fatal errors, not due to unrecoverable states or deadlocks.
Design and code are reviewed by senior engineers.
Security & Compliance
The application runs as a non‑privileged (non‑root) user.
The container file system is read‑only where possible.
HTTP requests are authenticated and authorized (e.g., using OAuth).
Denial‑of‑service mitigation mechanisms are in place (e.g., ingress rate limiting, WAF).
Security audits have been performed.
Automated vulnerability scanning for code and dependencies is enabled.
Processed data is understood, classified (e.g., PII), and documented.
A threat model has been created and risks recorded.
Other applicable organizational rules and compliance standards are followed.
Continuous Integration / Continuous Delivery
Every change triggers an automated pipeline.
Automated tests are part of the delivery pipeline.
Production deployments require no manual steps.
All relevant team members can deploy and roll back.
Production deployments include smoke tests and optional automatic rollbacks.
Lead time from code commit to production is short (e.g., 15 minutes or less, including test execution).
Kubernetes
The development team has received Kubernetes training and understands related concepts.
Kubernetes manifests use the latest API versions (e.g., apps/v1 for Deployments).
Containers run as non‑root users with read‑only file systems.
Appropriate readiness probes are defined.
Liveness probes are omitted or used only with a clear justification.
Deployments have at least two replicas.
Horizontal Pod Autoscaling (HPA) is configured when appropriate.
Memory and CPU requests are set based on performance and load testing.
Memory limits equal memory requests to avoid over‑consumption.
CPU limits are either unset or their throttling impact is well understood.
Application runtime settings (e.g., JVM heap, single‑threaded runtime, non‑container‑aware runtimes) are correctly configured for the container environment.
Each container runs a single application process.
The application can handle graceful shutdowns and rolling updates without interruption.
If graceful termination is not handled, a Pod lifecycle hook (e.g., preStop with "sleep 20") is used.
All required Pod labels are set.
The application is configured for high availability: Pods are spread across failure domains or deployed to multiple clusters.
Kubernetes Services use correct label selectors (e.g., matching not only "app" but also "component" and "environment" for future scaling).
Optional: Tolerations are used as needed (e.g., binding Pods to specific node pools).
Monitoring
Metrics for the four golden signals are collected.
Application metrics are collected (e.g., scraped by Prometheus).
Databases (e.g., PostgreSQL) are monitored.
Service Level Objectives (SLOs) are defined.
Monitoring dashboards exist (e.g., Grafana) and can be provisioned automatically.
Alert rules are defined based on impact rather than root cause.
Testing
Chaos/breakpoint testing is performed.
Load testing reflects expected traffic patterns.
Backup and restore procedures for data stores (e.g., PostgreSQL) are tested.
24/7 Service Team
All relevant 24/7 service teams are notified of releases (e.g., SRE, incident commanders).
The on‑call team has sufficient knowledge of the application and business context.
The team possesses necessary production access (e.g., kubectl, kube‑web‑view, application logs).
The team has expertise to troubleshoot production issues in the tech stack (e.g., JVM).
The team is trained and confident in executing standard operations (scaling, rollback, etc.).
Monitoring alerts are set up to page the 24/7 team.
Automatic escalation rules are in place (e.g., escalating after 10 minutes without acknowledgment).
Post‑incident analysis and knowledge sharing processes exist.
Regular application‑operation reviews are conducted (e.g., reviewing SLO violations).
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.