Cloud Computing 22 min read

What 4 Years of Startup Infrastructure Taught Me: AWS, Terraform, GitOps & More

After four years running infrastructure at a fast‑growing startup, the author reviews almost every major decision—from choosing AWS over GCP and adopting EKS, RDS, and Redis, to automating post‑mortems with Slack bots, standardising IaC with Terraform and GitOps, and evaluating SaaS tools like DataDog, PagerDuty, and Notion—highlighting the benefits, regrets, and practical lessons learned.

ITPUB
ITPUB
ITPUB
What 4 Years of Startup Infrastructure Taught Me: AWS, Terraform, GitOps & More

Cloud Provider Selection

The team initially ran workloads on both Google Cloud Platform (GCP) and Amazon Web Services (AWS). AWS provided a dedicated account manager, more responsive support, and a stable API surface with minimal backward‑incompatible changes. Over time AWS added native Kubernetes integrations (e.g., external-dns, external-secrets), making it the preferred platform for container workloads.

Managed Services on AWS

Elastic Kubernetes Service (EKS)

EKS is used for a managed control plane. The cost of a managed service is justified unless the organization is extremely cost‑conscious and can afford the operational overhead of self‑hosting the control plane. Deep integration with other AWS services (IAM, Route 53, Load Balancers) outweighs the benefits of alternatives such as ECS.

Relational Database Service (RDS)

RDS is chosen for production databases because data availability is critical. The additional expense of a managed service eliminates the risk of downtime and data loss that would arise from operating a self‑managed database.

ElastiCache (Redis)

Redis is used as a fast, feature‑rich cache and general‑purpose data store. Its rich command set and strong documentation make it suitable for more than simple caching, and AWS’s large customer base suggests continued support.

Elastic Container Registry (ECR)

The team migrated container images from an unstable quay.io setup to ECR. This improved image availability and enabled tighter IAM permissions integration with EKS nodes and developer workstations.

AWS VPN

A simple site‑to‑site VPN managed through Okta satisfies the organization’s networking needs. Alternative zero‑trust solutions (e.g., Cloudflare) were evaluated but deemed unnecessary for the current use case.

Infrastructure Automation

Control Tower Account Factory for Terraform (AFT)

AFT automates AWS Control Tower account provisioning and enforces standardized tagging. Tags are used to drive automated VPC peering decisions and cost allocation.

Process Automation

Slack bot reminders trigger post‑mortem report submissions, reducing manual nudging.

PagerDuty incident templates provide a starting point for incident response documentation.

Bi‑weekly reviews of PagerDuty tickets prioritize critical alerts and prune noise.

Cost‑Tracking Meetings

Monthly cross‑functional meetings review SaaS spend (AWS, DataDog, etc.). Costs are broken down by AWS account and by tag (e.g., environment=prod), enabling identification of high‑spend services such as spot instance usage or network egress.

SaaS Tooling Decisions

Identity Management: Early adoption of Okta (instead of Google Workspace groups) provided unified SSO and fine‑grained permission control.

Documentation: Notion serves as a flexible wiki with database‑driven page organization.

Communication: Slack is the default channel; best practices include using threads, setting response expectations, and preferring public channels over direct messages.

Issue Tracking: Linear replaced JIRA for a lighter workflow.

Terraform Execution: Atlantis replaced Terraform Cloud; custom scripts fill gaps where Atlantis lacks features.

CI/CD: GitHub Actions is the primary pipeline. Self‑hosted runners run on EKS via actions-runner-controller, though support for custom workflows is still limited.

Observability: DataDog offers powerful metrics and tracing but its per‑instance pricing model can overcharge for short‑lived spot or GPU instances.

Incident Management: PagerDuty provides reliable alerting at a reasonable cost.

Software and Tooling Choices

Schema Migrations: Database schemas are stored in Git; a generation tool produces SQL to keep the live database in sync.

Development OS: Ubuntu is used for developer workstations due to broad package support.

Internal Tools UI: AppSmith is self‑hosted to expose internal scripts via a simple web UI.

Helm: Helm v3 is the package manager for Kubernetes manifests. Charts are stored in an OCI registry in ECR, eliminating the need for a custom S3 plugin.

Bazel: Considered overkill for most Go services; its complexity can hinder team adoption.

Telemetry: OpenTelemetry is adopted from day 1 for metrics and tracing, despite early‑stage maturity concerns.

Dependency Updates: Renovatebot is preferred over Dependabot for its configurability, though it requires careful setup.

Kubernetes: Serves as the core platform for long‑running services, with strong AWS integrations (ALB, Route 53, IAM).

IP Management: Purchasing a dedicated CIDR block simplifies partner whitelisting.

GitOps: Flux (v2) is used for declarative deployments; custom tooling visualizes deployment status. ArgoCD was evaluated but not adopted.

Node Autoscaling: Karpenter provides reliable, cost‑effective scaling for EKS clusters, outperforming the default Cluster Autoscaler and SpotInst.

Secret Management: SealedSecrets introduced friction for developers and broke existing AWS secret‑rotation automation.

DNS Automation: ExternalDNS synchronizes Kubernetes Service/Ingress resources to Route 53 without manual intervention.

SSL Management: cert‑manager automates Let’s Encrypt certificate issuance; paid certificates are used only when required by legacy customers.

Node AMI: Standard EKS‑optimized AMIs replaced Bottlerocket after encountering persistent CSI network issues.

Infrastructure as Code: Terraform is preferred over CloudFormation for readability, multi‑cloud extensibility, and a lower learning curve.

Code‑Based IaC: Pulumi and CDK were avoided to keep infrastructure definitions declarative and reduce complexity.

Service Mesh: Istio/Linkerd were deemed over‑engineered for current needs; a “less is more” approach is followed.

Ingress: Nginx is used as a stable, battle‑tested load balancer for EKS ingress traffic.

Script Distribution: Homebrew packages scripts and binaries for Linux and macOS engineers.

Programming Language: Go is the default language for new services because of its fast compile times, low runtime overhead, and suitability for I/O‑bound workloads.

Additional Architectural Reflections

Function‑as‑a‑Service (FaaS): Not fully adopted due to lack of GPU‑compatible options and cost‑comparison pitfalls. Lambda is useful for CPU‑bound jobs and provides fine‑grained cost visibility.

Shared Database: Avoid using a single database for multiple applications without clear ownership; risks include tangled CRUD operations, debugging difficulty, and cross‑team incident impact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingKubernetesDevOpsAWSInfrastructureTerraform
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.