Operations 10 min read

Comprehensive Checklist for Deploying Web Services on Kubernetes in Production

This article presents a detailed checklist covering general information, application requirements, security and compliance, CI/CD practices, Kubernetes configuration, monitoring, testing, and 24/7 service team readiness to ensure reliable production deployment of HTTP‑based web services on Kubernetes.

Architecture Digest
Architecture Digest
Architecture Digest
Comprehensive Checklist for Deploying Web Services on Kubernetes in Production

General

Application name, description, purpose, and owning team are clearly documented (e.g., via a service tree).

Criticality level of the application is defined (e.g., "critical link program").

Development team possesses sufficient Kubernetes knowledge (e.g., stateless services).

A 24/7 on‑call team is identified and notified.

A release plan exists, including potential rollback steps.

Application

The code repository contains clear instructions for development, configuration, and changes (crucial for emergency fixes).

Dependencies are pinned so that patch changes do not introduce new libraries unintentionally.

OpenTracing/OpenTelemetry semantic conventions are followed.

All outbound HTTP calls have defined timeouts.

HTTP connection pools are sized appropriately for expected traffic.

Thread pools or non‑blocking asynchronous code are correctly implemented and configured.

Redis and database connection pool sizes are set correctly.

Retry and back‑off strategies are implemented for dependent services.

Rollback mechanisms are defined according to business needs.

Rate‑limiting or throttling mechanisms are in place (often provided by the underlying infrastructure).

Application metrics are exposed for collection (e.g., scraped by Prometheus).

Application logs are written to stdout/stderr.

Logs follow best practices (structured logging, meaningful messages), have clearly defined levels, and debug logging is disabled by default in production.

The application container exits on fatal errors rather than entering unrecoverable states or deadlocks.

Design and code are reviewed by senior engineers.

Security and Compliance

The application runs as a non‑privileged (non‑root) user.

The container filesystem is read‑only; no writable layers are required.

HTTP requests are authenticated and authorized (e.g., using OAuth).

Denial‑of‑service mitigation mechanisms are in place (e.g., ingress rate limiting, WAF).

Security audits have been performed.

Automated vulnerability scanning for code and dependencies is enabled.

Processed data is understood, classified (e.g., PII), and documented.

A threat model has been created and risks recorded.

Other applicable organizational rules and compliance standards are followed.

Continuous Integration / Continuous Delivery

Every change triggers an automated pipeline.

Automated tests are part of the delivery pipeline.

Production deployments require no manual intervention.

All relevant team members can deploy and roll back.

Production deployments include smoke tests and optional automatic rollbacks.

Lead time from code commit to production is short (e.g., ≤15 minutes, including test execution).

Kubernetes

Development team has received Kubernetes training and understands relevant concepts.

Kubernetes manifests use the latest API versions (e.g., apps/v1 for Deployments).

Containers run as non‑root users with read‑only file systems.

Appropriate readiness probes are defined.

Liveness probes are either omitted or used with a clear justification.

Deployments have at least two replicas.

Horizontal Pod Autoscaling (HPA) is configured when appropriate.

Memory and CPU requests are set based on performance and load testing.

Memory limits equal memory requests to avoid over‑consumption.

CPU limits are unset or their throttling impact is well understood.

Application runtime parameters (e.g., JVM heap, single‑threaded mode) are correctly configured for containers.

Each container runs a single application process.

The application can handle graceful shutdowns and rolling updates without interruption.

If graceful termination is not handled, a Pod lifecycle hook (e.g., preStop with "sleep 20") is used.

All required pod labels are set.

The application is deployed for high availability, with pods spread across failure domains or multiple clusters.

Kubernetes Service uses correct label selectors (e.g., not only "app" but also "component" and "environment").

Optional: tolerations are used as needed (e.g., to bind pods to specific node pools).

Monitoring

Four golden signals metrics are collected.

Application metrics are collected (e.g., via Prometheus scraping).

Databases (e.g., PostgreSQL) are monitored.

SLOs are defined.

Monitoring dashboards (e.g., Grafana) exist and can be auto‑provisioned.

Alert rules are defined based on impact rather than root cause.

Testing

Fault injection testing (system/chaos testing) is performed.

Load testing reflects expected traffic patterns.

Backup and restore procedures for data stores (e.g., PostgreSQL) are tested.

24/7 Service Team

All relevant 24/7 service teams are notified of releases (e.g., SRE, incident commander).

The 24/7 team has sufficient knowledge of the application and business context.

The team possesses necessary production access (e.g., kubectl, kube‑web‑view, application logs).

The team has expertise to troubleshoot production issues in the tech stack (e.g., JVM).

The team is trained and confident in executing standard operations (scaling, rollback, etc.).

Monitoring alerts that trigger the 24/7 team are configured.

Automatic escalation rules are in place (e.g., escalation after 10 minutes without acknowledgment).

Post‑mortem analysis and incident learning processes exist.

Regular application and operational reviews are conducted (e.g., reviewing SLO violations).

monitoringCI/CDoperationsKubernetessecurityProduction Deployment
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.