Cloud Native 20 min read

Lessons Learned from Two Years of Production Kubernetes at Grofers

This article recounts Grofers' two‑year journey migrating from Ansible‑managed EC2 instances to Kubernetes, detailing the motivations, migration strategy, operational challenges, observability choices, CI/CD tooling, resource management, security practices, cost considerations, and the overall impact on development velocity and platform stability.

Top Architect
Top Architect
Top Architect
Lessons Learned from Two Years of Production Kubernetes at Grofers

About two years ago we decided to stop deploying applications on EC2 using Ansible and move to a container‑based stack with Kubernetes for orchestration. The migration was demanding, involving hybrid infrastructure, new operational paradigms, and extensive team training.

1. Reasons for Migrating to Kubernetes

Containers and serverless are great for new services, but adopting Kubernetes requires sufficient bandwidth, configuration expertise, and a DevOps‑mindset. Even on managed services like EKS, GKE, or AKS, there is a learning curve. The main driver for us was the need for a continuous‑integration‑friendly infrastructure to rebuild micro‑services that had accumulated architectural debt.

2. Migration Approach

It took us about 18 months to stabilise a complex CI pipeline that could spin up integration environments for 21 micro‑services in eight minutes, with test cycles under twelve minutes. We built additional tooling, telemetry, and re‑engineered deployment methods to keep development and production environments consistent.

3. Out‑of‑the‑Box Kubernetes Is Not Enough

Kubernetes is a platform for building PaaS solutions, not a ready‑made PaaS. We needed extra components—metrics, logging, service discovery, distributed tracing, configuration/secret management, CI/CD, local development experience, and custom autoscaling. These decisions formed the basis of our internal Kubernetes platform.

4. Operating a Kubernetes Cluster

We initially used kops on AWS Singapore because EKS was unavailable. Setting up a basic cluster was easy, but configuring autoscaling, resources, and networking for production proved challenging. We learned that operating Kubernetes is complex and often better delegated to managed services.

5. Observability Stack

We chose Prometheus for metrics and Grafana for visualization, replacing InfluxDB. For logging we migrated from ELK to Loki because of its cost‑effectiveness and PromQL‑compatible query language, allowing a unified Grafana UI for metrics and logs.

6. Configuration and Secret Management

ConfigMaps and Secrets were insufficient for our needs, so we adopted Consul, Vault, and Consul‑Template, running the template as an init container (or sidecar) to watch for changes, refresh secrets, and reload applications gracefully.

7. CI/CD Evolution

We continued using Jenkins after the migration but found it suboptimal for cloud‑native workloads. We explored Tekton and Argo Workflows and considered alternatives like Jenkins X, Screwdriver, and Keptn.

8. Deployment Tools

We evaluated Telepresence.io and Skaffold for continuous deployment; Skaffold watches local changes and updates the cluster, while Telepresence enables local services to communicate with the cluster. Both remain viable options.

9. Resource Requests and Limits

Improper requests caused pod evictions due to memory pressure. We learned to set requests high enough to avoid OOM kills but not so high as to waste resources, and to keep limits close to requests to allow burst capacity.

10. Security and Governance

We used Open Policy Agent to enforce security policies, such as preventing public ELB creation without explicit annotations, and to automate change‑management controls.

11. Cost Optimisation

Post‑migration we saw significant cost savings from better resource utilisation, though spot instance usage introduced new considerations for cross‑AZ data transfer. We leveraged spot instances for pre‑release clusters and combined them with reserved instances and savings plans for production.

12. Advanced Platform Features

We built custom controllers and CRDs to automate tasks like converting LoadBalancer services to Ingress, auto‑creating DNS CNAME records, and generating Grafana dashboards via declarative manifests.

Overall, the migration to Kubernetes delivered faster experimentation, lower costs, and improved developer autonomy, but it also required substantial effort in tooling, observability, security, and operational expertise.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud Nativeci/cdobservabilityKubernetesResource ManagementDevOps
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.