Beyond Scale: Rethinking Architecture Boundaries for Massive Services
This article reflects on years of designing large‑scale backend systems at Tencent, discussing how to define clear architecture boundaries, ensure high availability, integrate diverse technologies, and use observability and monitoring to continuously evolve and improve massive service architectures.
This article was written in 2018 and originally published in part by Tencent Cloud Developer. It revisits practical insights from projects between 2013‑2018, focusing on system architecture methodology during the development of the "App Store" and the "Honor of Kings" smart robot.
The "Massive Service" series of courses captured Tencent's experience serving billions of users and shaped a generation of backend developers. As massive‑scale techniques become standard, the article asks what architectural considerations remain to demonstrate refined design.
Since 2009 Tencent gathered top experts to create the "Massive Service" curriculum, initially limited to senior engineers. The first edition became a cornerstone for backend teams, while the 2.0 version in 2015 had less impact.
Core to massive services is high availability, which supports massive users and requests through various methods, values, and tools.
To aid recollection, the article lists the three most important distributed theories alongside the course outline.
With the rise of massive‑scale internet services, many products now see daily active users in the tens of millions, making the lessons broadly relevant.
Recent years have seen a flood of technologies—RPC frameworks, Docker, cloud computing, micro‑services, Service Mesh, distributed storage, DevOps, NoSQL, big data, various compute frameworks—driving the ability to support ever larger services. Mobile internet’s explosion further accelerated this trend.
What can we still improve in architecture design after massive‑scale techniques become standard?
The author believes post‑massive‑scale architecture should focus on system‑level logic, internal and external relationships, and complexity. The following points summarize key considerations:
Massive‑scale architectures involve many layers: Android, iOS, H5 front‑ends; backend access layers, business logic layers, foundation layers; caches, NoSQL and RDBMS databases; message queues, and various third‑party dependencies, leading to complex inter‑module relationships.
Defining module boundaries, ensuring high cohesion and low coupling, and isolating subsystems are the first architectural concerns.
Architecture Boundaries
Key points for defining boundaries include:
Boundary thinking and awareness, exploring and expanding boundaries Responsibility separation and fire‑wall isolation Contractual spirit High cohesion, low coupling, clear layering
First Example
When developing an app that communicates with the backend via HTTP, the following questions must be clarified:
How to define the protocol layout (header and body) What requests and responses occur in each interaction Which serialization format to use (JSON, JCE, Protocol Buffers) How to define error codes, including secondary codes for overall and per‑request errors
In the project, HTTP POST is used with a JCE‑defined layout. The JCE structures are:
struct ReqHead {
Int cmdId;
...
}
struct Request {
ReqHead head;
vector<char> body;
}
struct PkgReq {
PkgReqHead head;
Request request;
}The HTTP body consists of three parts:
Three magic bytes identifying the protocol Four‑byte version number in network byte order The actual payload structure
Each request uses ReqHead.cmdId to distinguish commands; the body contains the corresponding JCE request/response structures. All JCE definitions reside in a single Protocol.jce file, serving as the sole contract between client and server.
Second Example
The backend also supports TCP long‑connections for efficient request/response and server‑push, complementing HTTP. Because Google Cloud Messaging is unavailable in China, the system integrates multiple vendor push channels (Xiaomi, Huawei, Oppo, Vivo, Meizu) alongside its own long‑connection channel.
A unified PushAPI abstracts these complexities, offering methods such as pushSingleDevice, pushMultiDevice, pushAllOnlineDevice, and pushAllDevice. The interface definition is:
interface PushAPI {
// Single device push
int pushSingleDevice(PushSingleDeviceReq req, out PushSingleDeviceRsp rsp);
// Multi‑device push
int pushMultiDevice(PushMultiDeviceReq req, out PushMultiDeviceRsp rsp);
// Push to all online devices
int pushAllOnlineDevice(PushAllOnlineDeviceReq req, out PushAllOnlineDeviceRsp rsp);
// Push to all devices
int pushAllDevice(PushAllDeviceReq req, out PushAllDeviceRsp rsp);
};The Push subsystem follows high cohesion and low coupling, persisting messages to avoid loss and providing monitoring, alerts, and status queries.
Architectural Politics
System architecture is tied to organizational structure and team division When organizational and system boundaries coincide, boundary clarity and cohesion become critical Conway's Law explains why communication patterns manifest in system design Architecture politics is the art of collaborative problem solving
The article references a past piece titled "Where Does the System Architecture Boundary End?" for deeper discussion.
Feedback Loops and Observability
Architecture feedback includes health, performance, call chains, data trends, etc. No measurement, no improvement; what you measure is what you get
Monitoring and alerting are essential but insufficient; a broader set of signals constitutes architectural feedback, now commonly called observability.
Case Study: App Search
When integrating Elasticsearch with TAF, the team added extensive monitoring for query volume, latency, and abnormal traffic, distinguishing legitimate traffic from keyword‑spamming bots.
Case Study: Honor of Kings Smart Robot
To trace end‑to‑end interactions across device, app, AI backend, and cloud services, the team introduced Zipkin for distributed tracing, collecting millions of spans and logs.
Collected data includes tens of millions of logs and spans, enabling real‑time anomaly detection and performance monitoring.
Architecture Evolution
Human‑driven proactive architecture improvements Business‑driven reactive evolution
An example from the App Store access layer shows six iterative stages that steadily increased connection success rates.
Balancing Architecture
Balance is an art of choices present everywhere Big mistakes are "dropping the watermelon and picking up sesame"; small mistakes are "picking the gourd and getting a bowl" Balance reflects systemic thinking
Performance, experience, security trade‑offs Time vs. space considerations Scalability vs. time cost Quality vs. efficiency
The classic CAP theorem is revisited to illustrate inherent trade‑offs between consistency, availability, and partition tolerance.
Choosing between consistency and availability depends on system goals; both cannot be maximized simultaneously under partition scenarios.
Empowering Architecture
Good architecture endures over time It is highly compatible, extensible, and inclusive It enables programmer growth and career advancement Closed‑loop feedback gives architecture vitality
Designing thoughtful architectures benefits both the system and the engineers who build and maintain it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tech Architecture Stories
Internet tech practitioner sharing insights on business architecture, technology, and a lifelong love of tech.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
