Cloud Native 22 min read

What We Learned After Two Years of Running Kubernetes in Production

After two years of migrating from Ansible to Kubernetes, we share the hard‑won lessons on why we moved, the challenges of operating a production cluster, decisions on monitoring, logging, CI/CD, security, cost, and how we built an internal platform to streamline development.

MaGe Linux Operations

Aug 7, 2021

What We Learned After Two Years of Running Kubernetes in Production

About two years ago we decided to abandon the Ansible‑based installation and configuration approach for deploying applications on EC2 and switched to containerizing and orchestrating applications with Kubernetes. We have migrated most of our infrastructure to Kubernetes, a daunting task involving hybrid deployments, migration completion, and team training on a new operational paradigm.

In this article we review our experience and share what we learned to help you make better decisions and increase your chances of success.

Clarify Your Reasons for Migrating to Kubernetes

Serverless and containerization are great concepts. If you are building a new business from scratch, you should deploy applications with containers and, if you have the confidence and technical ability to configure and operate Kubernetes, you should use it.

Even with managed services such as EKS, GKE, or AKS, deploying and operating applications on Kubernetes has a learning curve. Your development team must be ready for the challenge and embrace DevOps principles; otherwise the benefits are limited.

If you already deploy on cloud VMs or other PaaS platforms, ask yourself why you need to migrate to Kubernetes and whether it is the only solution.

Our primary reason for migrating was to build a continuous‑integration platform. Accumulated technical debt slowed feature development, so we needed per‑developer integration environments to speed up development and testing without coordination.

Now we can spin up an integrated environment with 21 micro‑services on Kubernetes in eight minutes, and create a fresh environment for each pull request. The entire test cycle (deployment, configuration, test execution) takes less than twelve minutes.

How did we achieve this? It took us about a year and a half of building tools, automation, and refactoring each application to stabilize a complex CI pipeline.

We learned that pushing all micro‑services to production to keep development and production environments consistent actually made continuous integration more complex and slower.

We also discovered that using Kubernetes brings benefits such as service discovery, better cost management, elasticity, and governance, even though these were not our original goals.

“A major lesson for us is that we could have taken a less resistant path to adopt Kubernetes, but we felt forced to adopt it as the only option.”

Operating Kubernetes differs from deploying on cloud VMs or bare metal; your engineering team will face a learning curve, and you should consider whether it is worth it now.

Out‑of‑the‑Box Kubernetes Is Far From Enough for Anyone

Kubernetes is not a PaaS solution; it is a platform for building PaaS solutions such as OpenShift.

Most teams need additional infrastructure components and policies, forming an “Internal Kubernetes Platform”.

Metrics

We chose Prometheus for metrics monitoring, as it is the de‑facto standard in the CNCF ecosystem and integrates well with Grafana.

Logging

We moved from an ELK stack to Loki because Loki’s query language is similar to PromQL and integrates seamlessly with Grafana, providing a unified observability UI.

Configuration and Secret Management

While ConfigMap and Secret can meet basic needs, we opted for Consul, Vault, and Consul Template, running Consul Template as an init container and a sidecar to refresh secrets and reload applications gracefully.

CI/CD

We continued using Jenkins but found it costly to maintain. We are now exploring Tekton, Argo Workflows, Jenkins X, and other options.

Development Experience

We primarily use Skaffold and Telepresence for local development; Skaffold watches source changes and redeploys, while Telepresence proxies local services to the cluster.

Distributed Tracing

We have not yet implemented distributed tracing but plan to integrate it into Grafana.

Application Packaging, Deployment, and Tools

We experiment with Kustomize, Skaffold, and custom CRDs, allowing developers to choose any open‑source tool that follows open standards.

Operating a Kubernetes Cluster Is Hard

When we started, EKS was not available in Singapore, so we built our own cluster with kops on EC2.

Creating a ready‑to‑use cluster is relatively easy, but tuning it for production—auto‑scaling, networking, security—requires extensive research and custom configuration.

“After two years of production experience we find operating Kubernetes complex; many components require careful handling, and it is often better to let cloud providers manage the heavy lifting.”

You also need to consider upgrades. Even with managed services, upgrades are not always smooth, so automating disaster recovery and upgrade processes is essential.

We use GitOps concepts, eksctl, Terraform, and a custom automation pipeline to provision new clusters and apply changes.

Resource Requests and Limits

Improper configuration led to performance and functionality issues, prompting us to add large buffers to resource requests and limits to avoid pod eviction.

“We recommend keeping requests high enough to avoid throttling but not so high that resources are wasted; limits should be close to requests to allow burst capacity without causing evictions.”

In non‑production environments we often set requests low and limits high, effectively over‑committing resources.

Security and Governance

Kubernetes aims to give developers a self‑service cloud platform, reducing the need for dedicated ops teams.

Misconfigurations, such as exposing public ELBs, can introduce risk. We use Open Policy Agent to enforce policies that prevent accidental creation of public load balancers.

Cost

Better Resource Utilization

After migration we saw significant cost savings by using fewer compute, memory, and storage resources while maintaining the same capabilities.

“Initial over‑provisioning caused high costs, but once the cluster stabilized we reduced resource requests and eliminated waste.”

Spot Instances

Running workloads on Spot instances saves money; Kubernetes can quickly reschedule interrupted containers.

“Our test cluster now runs on Spot instances, delivering substantial savings.”

ELB Consolidation

We use Ingress to consolidate ELBs in test environments, reducing ELB costs, and we have built a controller to convert LoadBalancer services to NodePort where appropriate.

Cross‑AZ Data Transfer

While we saved on infrastructure, cross‑AZ data transfer costs increased. Controlling pod placement and service discovery can mitigate this, and service meshes are a potential solution.

CRDs, Operators, and Controllers – Simplify Operations

We have invested in building Operators and CRDs, such as a controller that converts LoadBalancer services to Ingress and another that creates DNS CNAME records automatically.

We also created a CRD for declaratively defining Grafana dashboards, allowing developers to version‑control monitoring alongside application code.

These custom resources and controllers let us focus on building the Grofers Kubernetes Platform to best support our development teams.

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.