Cloud Native 37 min read

Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

This article presents a step‑by‑step guide to designing and implementing a production‑grade Kubernetes platform with GitOps, observability, capacity governance, fault‑injection, and SRE practices, showing how to achieve unified delivery, reliability, and low‑cost operation for high‑concurrency business services.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Building a Production‑Ready Cloud‑Native Kubernetes Platform: From Zero to SRE Success

1. Project Background and Goals

1.1 Background

The organization runs multiple internal systems (user center, order, payment, marketing, admin). Rapid growth exposed problems such as manual releases, invisible runtime state, resource contention, noisy alerts, and expert‑only incident recovery.

1.2 Goals

Unified delivery: GitOps‑driven code commit → image build → automatic rollback.

Unified runtime: Applications run on a standardized Kubernetes platform with resource limits, self‑healing, autoscaling, and isolation.

Unified observability: Metrics, logs, and events are modeled for system health and business availability.

Unified governance: SLI/SLO‑centric alerting, on‑call rotation, post‑mortem, chaos testing, and capacity planning.

Unified expansion: Multi‑environment, multi‑namespace, multi‑business‑line support with future multi‑cluster and multi‑region evolution.

2. Overall Architecture Design

2.1 Architecture Overview

┌────────────────────────────┐
               │   Developer / Platform Eng │
               └──────────────┬─────────────┘
                              │
                Git Push / Merge Request
                              │
   ┌────────────────────────▼────────────────────────┐
   │                 GitLab CI                     │
   │ Unit test / code scan / image build / sign / │
   │ push                                          │
   └────────────────────────┬──────────────────────┘
                            │
                Update GitOps repository
                            │
               ┌────────────▼─────────────┐
               │        Argo CD           │
               │ Desired‑state sync / auto‑rollback │
               └────────────┬─────────────┘
                            │
   ┌────────────────────────▼─────────────────────────┐
   │            Kubernetes Production               │
   │ Ingress / Service / Deployment / HPA / PDB /   │
   │ NetworkPolicy / RuntimeClass / PriorityClass / │
   │ LimitRange / ResourceQuota                     │
   └──────────────┬───────────────────────┬───────────────┘
                  │                       │
   ┌──────────────▼───────┐   ┌──────────▼─────────┐
   │   Observability Plane │   │   Security Plane │
   │ Prometheus / Alertmanager │ RBAC / OIDC │
   │ Grafana / Loki / Tempo │ Secret management │
   └───────────────────────┘   └─────────────────────┘
                  │                       │
   ┌──────────────▼─────────────────────────────────────┐
   │                SRE Control Loop                  │
   │ SLI/SLO → Alert → On‑call → Mitigation → RCA      │
   │ Capacity → Chaos → Review → Optimization          │
   └────────────────────────────────────────────────────┘

2.2 Design Principles

Declarative first : All cluster resources, app configs, alert rules, and dashboards are stored as Git‑managed declarative objects.

Control‑plane / data‑plane separation : GitLab builds images, Argo CD applies manifests, Kubernetes schedules workloads, Prometheus/Loki provide observability.

Platform standardization over individual optimization : A unified template enforces health checks, resource limits, monitoring exposure, and alert definitions.

User‑perceived reliability : Alerts are driven by success‑rate, latency, saturation, and error‑budget rather than raw CPU.

Extensibility for high concurrency and multi‑team collaboration : Component choices and layering anticipate future business lines, multi‑environment releases, multi‑region deployment, and capacity elasticity.

3. Core Component Selection and Rationale

3.1 Why Kubernetes as the foundation

Deployment

: maintains desired replica count. Service: stable service discovery. Ingress: north‑south traffic entry. HPA: auto‑scales based on metrics. PDB: protects rolling updates. Node Affinity / Taint‑Toleration: fine‑grained scheduling isolation.

Kubernetes is a distributed control loop that continuously converges the current state to the desired state, providing self‑healing, elasticity, standardized delivery, and safe rollbacks.

3.2 Why Calico instead of a simple overlay network

Three‑layer routing with BGP eliminates extra overlay overhead.

Fine‑grained NetworkPolicy supports multi‑tenant isolation.

Mature integration with the Kubernetes ecosystem keeps operational cost low.

Proven stability in large‑scale pod networks.

3.3 Why Prometheus + Loki

Many teams fall into two pitfalls: only monitoring infrastructure metrics, or collecting logs without a unified label model. Prometheus excels at time‑series metric collection and alerting; Loki stores label‑indexed logs at low cost. Combined with Grafana they enable queries such as “which API 5xx spiked” → “show the related pod logs” → “identify node resource contention”.

3.4 Why GitOps over traditional scripts

Auditability: every configuration change is a Git commit with review.

Rollback: revert by resetting the Git version.

Reproducibility: new environments can be bootstrapped from the same repo.

Collaboration: developers, platform engineers, and SRE share a single source of truth.

When the number of environments grows, manual scripts quickly become error‑prone; GitOps eliminates version drift, unclear ownership, and unstable rollbacks.

4. Production‑Grade Kubernetes Platform Implementation

4.1 Cluster Planning

Control plane: 3 master nodes spread across availability zones.

Worker nodes: separate pools for general workloads, compute‑intensive jobs, and stateful services.

Container runtime: containerd.

CNI: Calico.

Ingress: Nginx Ingress Controller or cloud‑provider load balancer.

Storage: Ceph RBD/CephFS, cloud CSI disks, or highly‑available NFS.

4.2 Multi‑AZ High Availability

Masters distributed across AZs to avoid single‑site control‑plane loss.

Dedicated etcd nodes or high‑performance disks to prevent control‑plane jitter.

Critical services scheduled on at least two nodes and two AZs.

Use topologySpreadConstraints to avoid pod concentration.

Apply PodDisruptionBudget to guarantee a minimum number of replicas during maintenance.

Example production deployment (excerpt):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: prod
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      terminationGracePeriodSeconds: 60
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values: ["payment-service"]
            topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: payment-service
      containers:
      - name: app
        image: harbor.example.com/prod/payment-service:1.3.12
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /actuator/health
            port: 8080
          failureThreshold: 30
          periodSeconds: 5

4.3 Resource Governance & Multi‑Tenant Isolation

ResourceQuota

: caps total CPU, memory, PVC, pods per namespace. LimitRange: default requests/limits to avoid “naked” pods. PriorityClass: gives core services pre‑emptive priority. NetworkPolicy: restricts cross‑service traffic.

Namespace‑level RBAC to reduce accidental privilege escalation.

Example quota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-quota
  namespace: prod
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 80Gi
    limits.cpu: "80"
    limits.memory: 160Gi
    persistentvolumeclaims: "20"
    pods: "200"

4.4 Security Governance

Identity: OIDC integration with corporate IdP.

Authorization: RBAC with least‑privilege principle.

Image security: vulnerability scanning, signing, and admission checks.

Secret management: SealedSecrets or ExternalSecrets backed by Vault.

Runtime hardening: disallow privileged containers, enforce read‑only rootfs, use Seccomp/AppArmor.

OPA Gatekeeper / Kyverno policies example (partial):

Reject Deployments without resources.requests/limits.

Reject images tagged latest.

Reject privileged containers and hostNetwork usage.

5. Observability System Design

5.1 Monitoring Layers

Infrastructure layer : node CPU, memory, disk, network, container restarts, kubelet, etcd, API server.

Platform middleware layer : Nginx Ingress, MySQL, Redis, Kafka, object storage.

Business service layer : request volume, success rate, error rate, P95/P99 latency, thread‑pool saturation, DB connection usage.

If only the first two layers are monitored, you know whether the machine is busy but not whether users are affected.

5.2 SLI / SLO / Error Budget Engineering

For the payment service:

SLI: /api/payments success rate over a 5‑minute window.

SLO: monthly success rate ≥ 99.95%.

Error budget: 0.05% failure allowance per month.

PromQL for success rate:

sum(rate(http_requests_total{app="payment-service",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{app="payment-service"}[5m]))

PromQL for P99 latency:

histogram_quantile(
  0.99,
  sum(rate(http_server_requests_seconds_bucket{app="payment-service"}[5m])) by (le)
)

5.3 Alert Design Principles

Reflect real user impact.

Provide clear remediation steps.

Minimize noise and duplication.

Four‑level severity: P1: core‑path failure – immediate escalation. P2: noticeable degradation – fast response. P3: potential risk or capacity threshold – same‑day handling. Info: trend notification for governance, not on‑call wake‑up.

Example alert for error‑budget burn:

groups:
- name: sre-slo-alerts
  rules:
  - alert: PaymentServiceHighErrorBudgetBurn
    expr: |
      (1 - (sum(rate(http_requests_total{app="payment-service",status!~"5.."}[5m]))
            / sum(rate(http_requests_total{app="payment-service"}[5m]))) ) > 0.02
    for: 10m
    labels:
      severity: critical
      service: payment-service
    annotations:
      summary: "payment-service error budget consumption too fast"
      description: "Error rate has been above threshold for the last 10 minutes; check upstream dependencies, recent releases, and node resources."

5.4 Log System Design

The goal is not just to ingest logs but to enable fast, business‑oriented queries and correlation.

Recommended standard fields:

timestamp, traceId, spanId, level, service, namespace, pod, node, message, errorCode, userId / orderId (business identifiers).

Suitable Loki labels (low cardinality):

cluster, namespace, app, container, level.

Unsuitable high‑cardinality labels: requestId, userId, orderId.

Example Loki alert for timeout exceptions:

sum by (app) (
  count_over_time({namespace="prod", app="payment-service"} |= "TimeoutException" [5m])
) > 20

6. GitOps Continuous Delivery Design

6.1 Release Pipeline

Developer pushes code → GitLab CI runs lint, tests, builds Docker image → image pushed to Harbor → GitOps repo is updated with new tag → Argo CD detects change and syncs to target cluster → Prometheus/Grafana validate post‑release health → on‑error automatic or manual rollback.

6.2 Why GitOps fits multi‑environment governance

With many environments, manual releases cause version drift, unclear ownership, and unstable rollbacks. GitOps stores each environment’s desired state in a Kustomize/Helm overlay, making dev, staging, and prod differences explicit and versioned.

6.3 GitLab CI production‑grade example

stages:
  - test
  - build
  - deploy-config

variables:
  IMAGE_NAME: harbor.example.com/prod/payment-service
  IMAGE_TAG: $CI_COMMIT_SHORT_SHA

unit_test:
  stage: test
  image: maven:3.9-eclipse-temurin-17
  script:
    - mvn -B clean test
  only:
    - merge_requests
    - main

docker_build:
  stage: build
  image: docker:24
  services:
    - docker:24-dind
  script:
    - docker build -t ${IMAGE_NAME}:${IMAGE_TAG} .
    - docker push ${IMAGE_NAME}:${IMAGE_TAG}
  only:
    - main

update_gitops:
  stage: deploy-config
  image: alpine:3.20
  before_script:
    - apk add --no-cache git yq
  script:
    - git clone https://gitlab.example.com/platform/gitops-repo.git
    - cd gitops-repo/apps/payment-service/overlays/prod
    - yq -i '.images[0].newTag = env(IMAGE_TAG)' kustomization.yaml
    - git config user.email "[email protected]"
    - git config user.name "ci-bot"
    - git commit -am "release payment-service ${IMAGE_TAG}"
    - git push origin main
  only:
    - main

6.4 Argo CD Application definition example

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service-prod
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://gitlab.example.com/platform/gitops-repo.git
    targetRevision: main
    path: apps/payment-service/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: prod
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

The selfHeal flag ensures that any manual drift is automatically reconciled back to the Git‑defined desired state.

7. SRE Core Practices

7.1 Health Checks, Self‑Healing, Graceful Termination

startupProbe

: verifies the application has completed its warm‑up. readinessProbe: determines when the pod can receive traffic. livenessProbe: triggers a restart when the container is unhealthy.

Applications must handle SIGTERM to finish in‑flight requests, commit transactions, and shut down cleanly.

7.2 Elastic Scaling based on Business Metrics

CPU‑only HPA is insufficient for high‑concurrency services. Recommended metrics include request queue length, DB connection pool usage, Kafka consumer lag, and gateway QPS. A composite HPA can combine CPU with a custom QPS metric.

Example HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-service
  namespace: prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 4
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "120"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60

7.3 Release Stability: Rolling, Canary, Blue‑Green, Auto‑Rollback

Default rolling updates for most stateless services.

Canary releases for high‑risk changes: route a small traffic slice first.

Blue‑Green for core transaction systems: switch traffic between two fully provisioned environments.

Automatic rollback when key business metrics breach thresholds.

When a service mesh or Ingress supports traffic splitting, you can implement percentage‑based ramp‑up, header/cookie‑based user segmentation, or region‑based phased rollouts.

7.4 Chaos Engineering & Fault Drills

Randomly delete core‑service pods.

Inject node failures.

Simulate network latency and packet loss between services.

Add cache miss spikes during load tests.

Mock database primary‑secondary switch or connection timeouts.

Chaos Mesh example (Pod kill):

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-service-pod-kill
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - prod
    labelSelectors:
      app: payment-service
  duration: "30s"

8. High‑Concurrency Engineering Upgrades

8.1 Four Typical Bottlenecks

Ingress layer : insufficient connections, TLS handshake overhead, uneven load balancing.

Service layer : thread‑pool, connection‑pool, JVM heap, GC, lock contention.

Data layer : MySQL hotspot rows, slow SQL, large Redis keys, cache stampede.

Infrastructure layer : CPU contention, disk I/O saturation, cross‑AZ latency, container network jitter.

8.2 Capacity Governance Method

Extract peak QPS, CPU, memory, network trends from historical Prometheus data.

Establish a per‑pod safe throughput baseline.

Apply business‑day, marketing‑peak, holiday scaling coefficients.

Reserve redundancy (N+1 nodes or AZ‑failure tolerance).

Simple formula:

Target replicas = Peak QPS / Safe per‑pod QPS × Redundancy factor

Example: Peak QPS = 3600, safe per‑pod QPS = 180, redundancy = 1.5 → Target replicas ≈ 30.

HPA maxReplicas, node‑pool capacity, and DB connection limits must be calibrated together.

8.3 Caching, Throttling, Asynchrony

Hot endpoints use local cache or Redis.

Shift heavy synchronous processing to message queues for peak‑shaving.

Apply rate‑limiting and degradation for non‑core paths.

Wrap unstable downstream calls with timeout, retry, circuit‑breaker, and isolation patterns.

Platform support:

Dedicated node pools for high‑priority workloads.

Elastic scaling rules for async consumers tied to lag metrics.

Gateway connection‑limit, request‑size, and rate‑limit policies.

Dedicated dashboards and alerts for critical paths.

9. End‑to‑End Case: Payment Service Stability

9.1 Scenario

Readiness passed before the pod was fully warmed, causing traffic to hit cold instances.

HPA based only on CPU while the real bottleneck was DB connection pool.

High error‑log volume not correlated with API failure rate.

Missing PodDisruptionBudget caused simultaneous pod eviction during node maintenance.

9.2 Solution

Application adds a pre‑heat health‑check flag.

Expose Micrometer metrics for DB pool, thread pool, and request latency.

Replace CPU‑only HPA with a composite CPU + QPS policy.

Add PDB, anti‑affinity, and cross‑AZ spread for resilience.

Build a payment‑service dashboard aggregating success rate, P99 latency, connection‑pool usage, restart count, and error‑log volume.

Adopt a progressive rollout: 10 % → 30 % → 100 % traffic.

Sample PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
  namespace: prod
spec:
  minAvailable: 3
  selector:
    matchLabels:
      app: payment-service

9.3 Outcome

Release failure rate dropped dramatically.

Peak‑time scaling became smoother; connection‑pool saturation is addressed before CPU spikes.

Node maintenance no longer caused service‑wide jitter.

Alerts shifted from generic “machine busy” to user‑impact‑focused “payment success rate drop”.

Mean‑time‑to‑recovery reduced from ~45 min to 5‑10 min.

Quantified improvements: release time 30 min → 5 min, MTTR 45 min → 12 min, configuration‑related incidents down 60 %, log‑search efficiency up 70 %, resource utilization up 25 % while SLOs stay met.

10. Production‑Grade Code & Config Recommendations

10.1 Mandatory Metrics for Java Spring Boot

HTTP request count, status codes, latency distribution.

JVM heap, GC pause, thread‑pool usage.

Database connection pool active / waiting / timeout counts.

Cache hit ratio.

Message‑queue consumer lag or backlog.

Micrometer bean example (excerpt):

@Bean
MeterBinder paymentMetrics(ThreadPoolTaskExecutor executor, DataSource dataSource) {
    return registry -> {
        Gauge.builder("payment_executor_active_count", executor.getThreadPoolExecutor(), ThreadPoolExecutor::getActiveCount)
            .tag("service", "payment-service")
            .register(registry);
        if (dataSource instanceof HikariDataSource hikari) {
            Gauge.builder("payment_db_connections_active", hikari.getHikariPoolMXBean(), bean -> bean.getActiveConnections())
                .tag("service", "payment-service")
                .register(registry);
        }
    };
}

10.2 Graceful Shutdown Implementation (Java)

@Component
public class GracefulShutdown implements SmartLifecycle {
    private final AtomicBoolean running = new AtomicBoolean(false);

    @Override
    public void start() { running.set(true); }

    @Override
    public void stop() {
        running.set(false);
        try { Thread.sleep(15000L); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }
    }

    @Override
    public boolean isRunning() { return running.get(); }
}

The component works together with the readinessProbe so that when SIGTERM is received the pod is first removed from the load‑balancer, then given time to finish in‑flight requests.

10.3 NetworkPolicy Example

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-gateway-to-payment
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: gateway
          podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 8080

11. Project Implementation Roadmap

Phase 1 – Basic Platform : HA Kubernetes cluster, image registry, CI pipeline, ingress, storage, log collection, basic monitoring, first batch of containerized apps.

Phase 2 – Delivery Standardization : Introduce Argo CD & GitOps repo, unified Helm/Kustomize templates, standardized health checks, resource limits, alert templates, release rollback process.

Phase 3 – SRE Governance Loop : Define business‑critical SLI/SLO, error‑budget and alert severity, on‑call/upgrade/post‑mortem procedures, regular chaos drills and capacity reviews.

Phase 4 – Advanced Capabilities : Service mesh for fine‑grained traffic control, multi‑cluster/region delivery, FinOps cost monitoring, automated fault‑mitigation and self‑healing.

12. Interview / Presentation FAQ

Why use SLO instead of raw CPU/memory alerts? – SLO measures user‑impact directly, aligning reliability with business goals.

How to make Prometheus highly available? – Deploy Prometheus Operator with multiple replicas, use Thanos or VictoriaMetrics for long‑term storage and global query.

Loki vs ELK? – Loki is cheaper and ideal for label‑based log retrieval in cloud‑native environments; ELK offers powerful full‑text search and complex processing at higher operational cost.

How does GitOps handle secrets? – Store only encrypted secrets (SealedSecrets) or references; the actual values are injected at runtime from Vault or external secret operators.

Why can HPA fail under burst traffic? – Scaling latency and reliance on CPU alone cause late reaction; combine business metrics, pre‑warm replicas, and capacity reservation.

13. Conclusion

A mature cloud‑native platform delivers three long‑term capabilities: faster delivery of change, reliable handling of traffic and failures, and lower cost growth. The value is not merely the Kubernetes cluster, Prometheus dashboards, or Argo CD UI; it is an engineering system where developers know how to onboard, the platform enforces governance, SRE measures and improves reliability, and the business trusts the system under high‑load and fault scenarios.

When presenting this project on a résumé or in a technical talk, highlight the systematic architecture, the production‑grade implementation that supports high concurrency and multi‑team collaboration, and the quantifiable stability loop that turns reliability into a measurable, repeatable process.

cloud nativeobservabilityKubernetesSREInfrastructureGitOps
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.