42 Hard‑Earned Lessons for Building Reliable Production Databases

This article translates Mahesh Balakrishnan’s 42‑point guide on building production databases, covering customer focus, project management, design principles, code review practices, strategy, observability, and research, offering concrete advice for engineers and teams creating robust backend systems.

21CTO
21CTO
21CTO
42 Hard‑Earned Lessons for Building Reliable Production Databases

Customers (Users)

(1) Keep your customers happy; otherwise the rest of the article is irrelevant.

(2) Have the right number of customers (start with one) and the right customers (who allow you to build key technology); increase this number carefully.

(3) Interact directly with customers. Many internal conflicts can be resolved by saying, "I just talked to the customer, they said…". When building infrastructure, you don’t need to guess requirements; ask them directly.

(4) Recognize that customers may not be able to articulate their true needs; look beyond surface value and spend time understanding their use cases, even reading their code.

Project Management

(5) Have a clear mission statement. Delos’s mission is: "We will become the reliable foundation of FB infra."

(6) Repeatedly assess task difficulty; decision‑makers often lack time, context, or training and may mis‑estimate by orders of magnitude.

(7) Assign tasks to ICs (Individual Contributors) wisely; stay on the critical path because ICs usually know the codebase and strengths better than managers.

(8) A roadmap is a means, not an end.

(9) If you have a good and/or aligned manager, try to understand, support, and accommodate them; if not, the author admits uncertainty.

(10) Make your project robust to org‑chart changes; ensure manager turnover does not create unfair career outcomes for ICs.

(11) Track how long similar features took in other projects and use that as a baseline for difficulty estimation.

Design

(12) Be conservative on APIs and liberal with implementations.

(13) Introduce new implementations cautiously (gradual, staged rollout).

(14) When designing APIs, code the first implementation, actively plan a second, and hope a third will eventually emerge.

(15) Design APIs with migration to new implementations in mind; custom migrations are costly and unreliable. Each major API should have a single CLI‑driven switch.

(16) Design as a team, implement as individuals; resist the urge to parallelize design.

(17) For storage systems, prioritize consistency and durability over availability early on; consistency is harder to measure and fix.

(18) Maintain multiple API implementations in tests and compare results; the cost is worth the correctness and abstraction protection.

(19) Use late binding: encourage the team to explore the whole design space instead of committing to a specific solution.

(20) After design, any IC should be able to write the code (late binding for implementers).

(21) Have the right amount of abstraction—too few leads to a tangled monolith, too many overwhelms the team.

(22) Avoid using real‑time guarantees or clock comparisons for correctness unless you understand clock error bounds.

(23) Maintain a single source of truth and simple invariants across state types.

(24) Foster a culture where ICs constantly explore radically different designs and keep speculative design discussions alive.

(25) Know your SKU; hardware awareness remains crucial despite the abstraction of cloud computing.

Code Review

(26) In a fast‑review, transparent codebase, APIs can leak implementation details unless guarded.

(27) Encourage ICs to critique diffs and create an environment where feedback is welcomed, not resented.

(28) For critical components, consider informal rules like requiring two LGTM approvals or consensus among a subset of ICs.

(29) Do not equate time‑to‑ship with importance; allow critical work to have longer review cycles when needed.

(30) Resist the "ship first, fix later" impulse; allow ICs to discard code that isn’t the right solution.

Strategy

(31) Periodically ask why the team/project exists, what would fill the gap if it disappeared, and how it adds value now and in the future.

(32) Track every major project in your domain; be able to explain their technical designs better than their own ICs and debate scope with their leads.

(33) Avoid competing on raw performance or efficiency; instead compete on fundamental design qualities.

(34) If someone else has a better system for your use case, consider stepping aside.

Observability

(35) Measurement is a means, not an end.

(36) You should be able to detect service problems before customers do.

(37) Place observability above APIs and outside implementations to enable easy swapping and performance comparison without measurement code errors.

(38) Pay special attention to hard‑to‑measure properties (e.g., consistency) that are often forgotten.

(39) Push critical checks (e.g., consistency) as close to deployment as possible, minimizing reliance on external services.

Research

(40) Track research results in your field; maintain a quick reference to accelerate communication (e.g., "What if we tried X from project Y?").

(41) Experiment with new ideas; favor novelty within feasible solutions and resist copying designs verbatim.

(42) Write papers for audiences without background; this forces you to clarify assumptions, aids hiring, and makes your work more understandable.

Original source: https://maheshba.bitbucket.io/blog/2021/10/19/42Things.html
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

project managementobservabilitySoftware Engineeringcode reviewProduction Systems
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.