Fundamentals 12 min read

Common System Design Pitfalls and How to Avoid Them

The article shares real‑world examples of hidden risks in system design—such as unbounded buffers, concurrent map deadlocks, hash collisions, email floods, single points of failure, disk‑full issues, and cache miss overloads—and explains why early, thorough design thinking can prevent costly failures.

ITPUB
ITPUB
ITPUB
Common System Design Pitfalls and How to Avoid Them

Importance of Early System Design

Designing a system comprehensively from the outset is difficult, so many teams split design into stages and rely on iterative refinement. While iteration is valuable, developers often avoid true iterative design and optimization, leading to hidden problems that surface later. Clear early design reduces technical debt, improves resilience to change, and makes later refactoring safer.

Example 1: Unbounded In‑Memory Buffer for User‑Behavior Persistence

A linked‑list buffer was used to collect user‑behavior events and flush them to the database every ten minutes. The design suffers from three critical issues:

Because the flush is triggered only by a timer, a sudden traffic spike can cause the list to grow without bound.

If the cleanup task throws an exception or is slowed down, the list is never cleared, leading to ever‑increasing memory usage.

The flush operation locks the list; a large list or a slow database write blocks other threads, causing lock contention and possible out‑of‑memory (OOM) crashes.

Mitigation strategies include:

Bounding the buffer size and applying back‑pressure or dropping/segmenting excess events.

Using a reliable queue (e.g., a bounded BlockingQueue) with a separate consumer thread that acknowledges successful writes before discarding entries.

Employing lock‑free data structures or partitioned buffers to reduce contention.

Example 2: Concurrent Access to a Non‑Thread‑Safe HashMap

Concurrent reads and writes to a standard java.util.HashMap can corrupt its internal bucket structure, causing an infinite loop during rehashing and driving CPU usage to 100 %. The problem is irreversible once the map is corrupted.

Recommended fixes:

Replace HashMap with ConcurrentHashMap or another thread‑safe collection.

If a plain map is required, protect all accesses with a single lock or use Collections.synchronizedMap.

Perform thorough concurrency testing (e.g., with jcstress) to detect hidden race conditions.

Example 3: Cache Key Collisions When Using MD5 Digest

Using an MD5 hash of a long string as a cache key reduces key length but introduces a non‑zero probability of collisions. In large‑scale systems, a collision can cause incorrect cache hits.

Safe approaches:

Store the original string alongside the cached value and verify equality after a cache hit.

Prefer a stronger, longer hash (e.g., SHA‑256) if key length permits.

Design the cache to tolerate occasional collisions, for example by falling back to a database lookup on mismatch.

Example 4: Uncontrolled Email Flood and Time‑Based Data Consistency Assumptions

A batch job processes a multi‑line file; for each erroneous line it sends an email notification. Two problems arise:

When many lines fail, thousands of emails are sent, overwhelming the mail server.

The job assumes that two databases become consistent one hour after an update, using this as a trigger for further processing. If the assumption is false, the job may operate on stale or inconsistent data.

Mitigations:

Introduce rate‑limiting or aggregation of error notifications (e.g., send a single summary email per batch).

Replace time‑based consistency checks with explicit data‑synchronization mechanisms such as change data capture, version stamps, or two‑phase commit.

Example 5: Single Point of Failure in Load‑Balancing Hardware

A telecom service deployed redundant application servers but omitted a dedicated load‑balancer (e.g., F5). When the load‑balancer failed, all traffic stopped despite server redundancy.

Best practice:

Include a highly available load‑balancing layer with active‑passive or active‑active redundancy.

Monitor health of the load‑balancer and automate failover to a backup device.

Example 6: Unlimited Log or GC‑Log File Growth Leading to Disk Exhaustion

Unbounded file writes (e.g., JVM GC logs, application logs) can fill the disk during high load or when log‑rotation scripts fail, causing the service to become unresponsive.

Preventive measures:

Configure size‑based rotation (e.g.,

-Xloggc:gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M

for JVM).

Deploy log aggregation tools (e.g., ELK, Fluentd) that ship logs off‑node.

Set up disk‑space alerts and automatic cleanup policies.

Example 7: Cache Warm‑Up Failure After Power Outage

When an Amazon data‑center experienced a power loss, cache servers restarted with empty caches. The cache‑hit rate dropped to near zero, causing a sudden surge of direct database requests that overwhelmed the database and crashed the service.

Resilience strategies:

Implement a warm‑up routine that pre‑loads critical data into the cache on startup.

Use multi‑layer caching (e.g., local in‑process cache plus distributed cache) to reduce immediate load on the database.

Design the database to handle traffic spikes (e.g., connection pooling, rate limiting) and monitor cache health.

Takeaway

System‑design experience grows over time. Developers should treat any “intuition” that something feels off as a signal to investigate deeper, question assumptions such as “usually this works”, and apply classic patterns (bounded buffers, explicit synchronization, health checks, graceful degradation). Recognizing and addressing these hidden traps early prevents costly failures in production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

software architectureScalabilitysystem designDesign Pitfalls
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.