Lessons Learned from Two Years of Production Kubernetes at Grofers
This article recounts Grofers' two‑year journey migrating from Ansible‑managed EC2 instances to Kubernetes, detailing the motivations, migration strategy, operational challenges, observability choices, CI/CD tooling, resource management, security practices, cost considerations, and the overall impact on development velocity and platform stability.
About two years ago we decided to stop deploying applications on EC2 using Ansible and move to a container‑based stack with Kubernetes for orchestration. The migration was demanding, involving hybrid infrastructure, new operational paradigms, and extensive team training.
1. Reasons for Migrating to Kubernetes
Containers and serverless are great for new services, but adopting Kubernetes requires sufficient bandwidth, configuration expertise, and a DevOps‑mindset. Even on managed services like EKS, GKE, or AKS, there is a learning curve. The main driver for us was the need for a continuous‑integration‑friendly infrastructure to rebuild micro‑services that had accumulated architectural debt.
2. Migration Approach
It took us about 18 months to stabilise a complex CI pipeline that could spin up integration environments for 21 micro‑services in eight minutes, with test cycles under twelve minutes. We built additional tooling, telemetry, and re‑engineered deployment methods to keep development and production environments consistent.
3. Out‑of‑the‑Box Kubernetes Is Not Enough
Kubernetes is a platform for building PaaS solutions, not a ready‑made PaaS. We needed extra components—metrics, logging, service discovery, distributed tracing, configuration/secret management, CI/CD, local development experience, and custom autoscaling. These decisions formed the basis of our internal Kubernetes platform.
4. Operating a Kubernetes Cluster
We initially used kops on AWS Singapore because EKS was unavailable. Setting up a basic cluster was easy, but configuring autoscaling, resources, and networking for production proved challenging. We learned that operating Kubernetes is complex and often better delegated to managed services.
5. Observability Stack
We chose Prometheus for metrics and Grafana for visualization, replacing InfluxDB. For logging we migrated from ELK to Loki because of its cost‑effectiveness and PromQL‑compatible query language, allowing a unified Grafana UI for metrics and logs.
6. Configuration and Secret Management
ConfigMaps and Secrets were insufficient for our needs, so we adopted Consul, Vault, and Consul‑Template, running the template as an init container (or sidecar) to watch for changes, refresh secrets, and reload applications gracefully.
7. CI/CD Evolution
We continued using Jenkins after the migration but found it suboptimal for cloud‑native workloads. We explored Tekton and Argo Workflows and considered alternatives like Jenkins X, Screwdriver, and Keptn.
8. Deployment Tools
We evaluated Telepresence.io and Skaffold for continuous deployment; Skaffold watches local changes and updates the cluster, while Telepresence enables local services to communicate with the cluster. Both remain viable options.
9. Resource Requests and Limits
Improper requests caused pod evictions due to memory pressure. We learned to set requests high enough to avoid OOM kills but not so high as to waste resources, and to keep limits close to requests to allow burst capacity.
10. Security and Governance
We used Open Policy Agent to enforce security policies, such as preventing public ELB creation without explicit annotations, and to automate change‑management controls.
11. Cost Optimisation
Post‑migration we saw significant cost savings from better resource utilisation, though spot instance usage introduced new considerations for cross‑AZ data transfer. We leveraged spot instances for pre‑release clusters and combined them with reserved instances and savings plans for production.
12. Advanced Platform Features
We built custom controllers and CRDs to automate tasks like converting LoadBalancer services to Ingress, auto‑creating DNS CNAME records, and generating Grafana dashboards via declarative manifests.
Overall, the migration to Kubernetes delivered faster experimentation, lower costs, and improved developer autonomy, but it also required substantial effort in tooling, observability, security, and operational expertise.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
