10 Hard‑Earned Infrastructure Lessons Every Engineer Should Know
Drawing from real incidents like SQLite crashes, missing logs, unthrottled APIs, slow container startups, queue bottlenecks, network partitions, unreliable clocks, and weak alerts, this article shares ten concrete infrastructure lessons with code examples, performance data, and practical recommendations to avoid costly pitfalls.
01 – "It works on my machine" is unreliable
I once deployed a Go application using SQLite that ran flawlessly on my laptop, but it crashed immediately in production because SQLite locks the entire database file under concurrent writes.
Lesson: Simulate real concurrent workloads in a pre‑release environment and use production‑grade storage solutions for cloud services.
// BAD: SQLite under load
sql.Open("sqlite3", "./foo.db") // Not safe under high concurrency02 – Logs without context are useless
Initially I overused fmt.Println(), producing logs like "Processing request", "Failed", "Retrying" that gave no insight into which request or user caused the issue.
Lesson: Adopt structured logging from day one.
log.WithFields(log.Fields{
"user_id": 1234,
"request_id": ctx.Value("reqID"),
}).Error("payment_failed")03 – Rate limiting is like a seatbelt
We launched an open API without any throttling; a client mistake generated 300,000 requests per minute, crashing Elasticsearch, MongoDB, and the entire system.
Lesson: Implement rate limiting at the gateway layer.
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=1r/s;
server {
location /api/ {
limit_req zone=api_limit burst=5;
}
}04 – Health checks ≠ simple port ping
Our health endpoint only pinged localhost:8080, so even when the database was down for ten minutes it still returned "200 OK".
Lesson: Health checks must verify critical dependencies.
func HealthHandler(w http.ResponseWriter, r *http.Request) {
if err := db.Ping(); err != nil {
http.Error(w, "DB unreachable", http.StatusServiceUnavailable)
return
}
w.Write([]byte("ok"))
}05 – Slow startup defeats auto‑scaling
Containers took up to 80 seconds to start because of cold‑starting dependencies, causing a surge of errors during traffic spikes.
Lesson: Keep container startup time under 15 seconds and pre‑warm caches when possible.
| Step | Time |
|---------------|------|
| DB connection | 12s |
| Redis warm‑up | 18s |
| Image sync | 32s |
| Total boot | 62s |06 – Queues can become bottlenecks
Our RabbitMQ‑based async architecture stalled when a consumer crashed, accumulating over two million messages and overloading retry queues.
Lesson: Configure TTL and dead‑letter queues early.
# Declare DLX
rabbitmqadmin declare exchange name=dlx type=direct
# Queue with DLX
rabbitmqadmin declare queue name=my_queue arguments='{"x-dead-letter-exchange":"dlx"}'07 – Network partitions are real
In AWS, an unstable availability zone split our cache cluster, causing some nodes to see stale data while others timed out.
Lesson: Use cache solutions with arbitration (e.g., Redis Sentinel or Raft‑based systems).
+--------+ +--------+ +--------+
| Redis1 | <----> | Redis2 | <----> | Redis3 |
+--------+ +--------+ +--------+
^ | ^
read quorum read/write08 – System clocks are unreliable
We relied on time.Now() for request timeouts, but a leap‑second adjustment caused a node’s clock to jump back, leading to severe issues.
Lesson: Use a monotonic clock for interval calculations.
start := time.Now()
// Some operation
elapsed := time.Since(start) // Uses monotonic time internally09 – Dashboards aren’t enough; alerts are essential
Our five dashboards showed no warning when disk usage hit 99.9%, and we only discovered the problem after 500 errors appeared.
Lesson: Alerts must monitor symptoms, not just static metrics.
# Prometheus alert
- alert: DiskAlmostFull
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.05
for: 2m
labels:
severity: critical10 – Simpler is better in the long run
We once built a flashy event‑driven onboarding pipeline with Kafka, gRPC, and microservices, but only two users submitted the form daily. Re‑implementing with Go and Redis cut infrastructure costs by 80% and reduced bugs by 90%.
Lesson: Design systems for actual usage scale, not imagined complexity.
# Before
[Form API] --> [Kafka] --> [Event Processor] --> [gRPC User Service]
# After
[Form API] --> [Redis Queue] --> [Worker]Conclusion – Embrace a “wounded” system
The lessons above were learned the hard way: midnight emergency calls, hours of obscure debugging, and shocking cloud bills. Infrastructure isn’t about being flawless; it’s about graceful degradation, rapid recovery, and keeping people in the feedback loop. If these insights help you avoid a mistake, you’re already ahead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
