Backend Development 16 min read

14 Common System Design Mistakes and Lessons Learned from Eight Years of Service Framework Development

Over eight years of building and evolving a service framework, the author reflects on fourteen critical design mistakes—from intrusive XML configurations and poor technology choices to insufficient versioning, load‑balancing flaws, and inadequate monitoring—highlighting the importance of comprehensive, forward‑looking architecture for backend engineers.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
14 Common System Design Mistakes and Lessons Learned from Eight Years of Service Framework Development

In a follow‑up to the previous article "Architect Portrait," the author reviews eight years of system design experience, focusing on fourteen major mistakes made while developing three foundational technology products and three multi‑year projects, many of which required complete rewrites.

Mistake 1: Designing a non‑intrusive service framework using an external XML file to declare Spring beans caused deployment confusion because developers did not know where to place the file. The solution was to replace the XML with a Spring FactoryBean configuration.

Mistake 2: Selecting JBoss Remoting without understanding its 60‑second default timeout led to thread starvation in front‑end web applications. The framework was later rebuilt on Mina, delaying a stable release by over two months.

Mistake 3: Omitting a version number in the communication protocol forced a hacky runtime check. The error was corrected by redesigning the protocol based on existing standards, emphasizing the need for broad protocol knowledge.

Mistake 4: Using a single long‑lived connection through hardware load balancers caused severe load imbalance after service restarts. A temporary fix broke connections after 10,000 requests; the final fix removed the load‑balancer middle‑point.

Mistake 5: Lack of version visibility in production meant the team could not identify which machines ran which framework version, leading to a cumbersome network‑wide scan. Adding the version to the connection handshake solved the problem.

Mistake 6: Attempting a fully dynamic, zero‑downtime deployment required two people half a year of effort only to be abandoned, revealing poor detail control and slow decision‑making.

Mistake 7: Implementing a seven‑layer method‑based routing rule file initially helped resource‑heavy methods but later became hard to maintain, illustrating the need for sustainable design.

Mistake 8: Introducing OSGi to isolate framework JARs caused a two‑month setup struggle and steep learning curve for new developers. The author would now prefer a simple class‑loader isolation strategy.

Mistake 9: Insufficient tracing across services, databases, and caches made multi‑hop failures hard to diagnose. After revisiting a Dapper‑style tracing system, the team realized the importance of end‑to‑end traceability from the start.

Mistake 10: Relying on a heartbeat‑based registration made services callable before they were ready and prevented graceful shutdowns, highlighting incomplete design considerations.

Mistake 11: Replacing Xen with a custom lightweight VM approach without sufficient knowledge led to many operational problems; switching to LXC later resolved many issues, underscoring the value of broad knowledge in technology selection.

Mistake 12: Using an image‑based disk‑quota mechanism caused permanent space consumption and alarms; after a lengthy search, a more flexible solution was adopted, showing the cost of poor initial technical choices.

Mistake 13: Identical UID limits across containers caused thread‑creation limits to affect multiple VMs, a detail missed due to insufficient design scrutiny.

Mistake 14: Overlooking a critical point late in a large project forced a risky weekend push and delayed release, reinforcing that architects must know who the reliable experts are for each subsystem.

Overall, the article stresses that architects must consider development, operations, and future scalability comprehensively, maintain a broad technical perspective, and embed traceability and flexibility early in system design.

distributed systemsBackend Developmentsystem designservice frameworkarchitecture mistakes
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.