Operations 8 min read

10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know

Drawing from real incidents like SQLite crashes, missing logs, unthrottled APIs, slow container startups, queue bottlenecks, network partitions, unreliable clocks, and weak alerts, this article shares ten concrete infrastructure lessons with code examples, performance data, and practical recommendations to avoid costly pitfalls.

DevOps Coach

Oct 1, 2025

10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know

01 – "It works on my machine" is unreliable

I once deployed a Go application using SQLite that ran flawlessly on my laptop, but it crashed immediately in production because SQLite locks the entire database file under concurrent writes.

Lesson: Simulate real concurrent workloads in a pre‑release environment and use production‑grade storage solutions for cloud services.

// BAD: SQLite under load
sql.Open("sqlite3", "./foo.db") // Not safe under high concurrency

02 – Logs without context are useless

Initially I overused fmt.Println(), producing logs like "Processing request", "Failed", "Retrying" that gave no insight into which request or user caused the issue.

Lesson: Adopt structured logging from day one.

log.WithFields(log.Fields{
    "user_id": 1234,
    "request_id": ctx.Value("reqID"),
}).Error("payment_failed")

03 – Rate limiting is like a seatbelt

We launched an open API without any throttling; a client mistake generated 300,000 requests per minute, crashing Elasticsearch, MongoDB, and the entire system.

Lesson: Implement rate limiting at the gateway layer.

limit_req_zone $binary_remote_addr zone=api_limit:10m rate=1r/s;
server {
    location /api/ {
        limit_req zone=api_limit burst=5;
    }
}

04 – Health checks ≠ simple port ping

Our health endpoint only pinged localhost:8080, so even when the database was down for ten minutes it still returned "200 OK".

Lesson: Health checks must verify critical dependencies.

func HealthHandler(w http.ResponseWriter, r *http.Request) {
    if err := db.Ping(); err != nil {
        http.Error(w, "DB unreachable", http.StatusServiceUnavailable)
        return
    }
    w.Write([]byte("ok"))
}

05 – Slow startup defeats auto‑scaling

Containers took up to 80 seconds to start because of cold‑starting dependencies, causing a surge of errors during traffic spikes.

Lesson: Keep container startup time under 15 seconds and pre‑warm caches when possible.

| Step          | Time |
|---------------|------|
| DB connection | 12s  |
| Redis warm‑up | 18s  |
| Image sync    | 32s  |
| Total boot    | 62s  |

06 – Queues can become bottlenecks

Our RabbitMQ‑based async architecture stalled when a consumer crashed, accumulating over two million messages and overloading retry queues.

Lesson: Configure TTL and dead‑letter queues early.

# Declare DLX
rabbitmqadmin declare exchange name=dlx type=direct
# Queue with DLX
rabbitmqadmin declare queue name=my_queue arguments='{"x-dead-letter-exchange":"dlx"}'

07 – Network partitions are real

In AWS, an unstable availability zone split our cache cluster, causing some nodes to see stale data while others timed out.

Lesson: Use cache solutions with arbitration (e.g., Redis Sentinel or Raft‑based systems).

+--------+       +--------+       +--------+
    | Redis1 | <----> | Redis2 | <----> | Redis3 |
    +--------+       +--------+       +--------+
          ^               |               ^
        read          quorum          read/write

08 – System clocks are unreliable

We relied on time.Now() for request timeouts, but a leap‑second adjustment caused a node’s clock to jump back, leading to severe issues.

Lesson: Use a monotonic clock for interval calculations.

start := time.Now()
// Some operation
elapsed := time.Since(start) // Uses monotonic time internally

09 – Dashboards aren’t enough; alerts are essential

Our five dashboards showed no warning when disk usage hit 99.9%, and we only discovered the problem after 500 errors appeared.

Lesson: Alerts must monitor symptoms, not just static metrics.

# Prometheus alert
- alert: DiskAlmostFull
  expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.05
  for: 2m
  labels:
    severity: critical

10 – Simpler is better in the long run

We once built a flashy event‑driven onboarding pipeline with Kafka, gRPC, and microservices, but only two users submitted the form daily. Re‑implementing with Go and Redis cut infrastructure costs by 80% and reduced bugs by 90%.

Lesson: Design systems for actual usage scale, not imagined complexity.

# Before
[Form API] --> [Kafka] --> [Event Processor] --> [gRPC User Service]

# After
[Form API] --> [Redis Queue] --> [Worker]

Conclusion – Embrace a “wounded” system

The lessons above were learned the hard way: midnight emergency calls, hours of obscure debugging, and shocking cloud bills. Infrastructure isn’t about being flawless; it’s about graceful degradation, rapid recovery, and keeping people in the feedback loop. If these insights help you avoid a mistake, you’re already ahead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Go devops infrastructure

Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.