Design and Implementation of a High‑Performance Message Notification System
This article presents a comprehensive design of a high‑performance, fault‑tolerant message notification system, covering service partitioning, system architecture, idempotent processing, dynamic error detection, thread‑pool management, retry mechanisms, and stability measures such as traffic‑spike handling, resource isolation, third‑party protection, monitoring, and active‑active deployment.
1. Service Partitioning
The system is divided into several layers: a configuration layer for managing sending configurations, an interface layer exposing RPC and MQ interfaces, a core service layer handling first‑time and retry sending, routing, and execution wrappers, a common component layer, and a storage layer consisting of a cache layer and a persistence layer (Redis, MySQL, Elasticsearch).
2. System Design
2.1 First‑time Message Sending
Requests are processed via RPC or MQ; RPC avoids message loss while MQ provides asynchronous decoupling and traffic shaping.
2.1.1 Idempotency Handling
Idempotency is achieved by using Redis keys to quickly detect duplicate messages within a short time window, avoiding costly database lookups.
private boolean isDuplicate(MessageDto messageDto) {
String redisKey = getRedisKey(messageDto);
boolean isDuplicate = false;
try {
if (!RedisUtils.setNx(redisKey, messageDto, 30*60L)) {
isDuplicate = true;
}
if (isDuplicate) {
MessageDto oldDTO = RedisUtils.getObject(redisKey);
if (Objects.equals(messageDto, oldDTO)) {
log.info("消息重复了");
} else {
isDuplicate = false;
}
}
} catch (Exception e) {
isDuplicate = false;
}
return isDuplicate;
}2.1.2 Dynamic Problem‑Service Detector
Using Sentinel, the system monitors request counts and failure counts within a configurable time window; when thresholds are exceeded, the service is marked as problematic and routed to an exception executor. Automatic recovery is performed after a silent period followed by a half‑circuit‑breaker phase.
public void loadExecuteHandlerRules(Long durationInSec, Long requestCount, Long failCount) {
List
rules = new ArrayList<>();
rules.add(ofParamFlowRule(REQUEST_RESOURCE, requestCount, durationInSec));
rules.add(ofParamFlowRule(FAIl_RESOURCE, failCount, durationInSec));
ParamFlowRuleManager.loadRules(rules);
}
public ParamFlowRule ofParamFlowRule(String resource, Long failCount, Long durationInSec) {
ParamFlowRule rule = new ParamFlowRule();
rule.setResource(FAIl_RESOURCE);
rule.setGrade(RuleConstant.FLOW_GRADE_QPS);
rule.setCount(failCount);
rule.setDurationInSec(durationInSec);
rule.setParamIdx(0);
return rule;
} public static boolean judge(String key, boolean reqSuc) {
return isBlock(REQUEST_RESOURCE, reqSuc, key) && isBlock(FAIl_RESOURCE, reqSuc, key);
}
public Boolean isBlock(String resource, boolean reqSuc, String key) {
boolean block = false;
Entry failEntry = null;
try {
failEntry = entry(resource, EntryType.IN, reqSuc ? 0 : 1, key);
} catch (BlockException e) {
block = true;
} finally {
if (failEntry != null) {
failEntry.exit();
}
}
return block;
}2.1.3 Sentinel Sliding‑Window Implementation (Ring Buffer)
The sliding window divides the time window into multiple slots; each slot is identified by a windowsId and has a start time. Requests are counted in the appropriate slot based on the current timestamp.
2.1.4 Dynamic Thread‑Pool Adjustment
Two separate thread pools are used for normal and abnormal services. The pool size can be adjusted at runtime via configuration (Apollo/Nacos). A custom busy‑check prevents task submission when the queue exceeds a configurable threshold, falling back to MQ persistence.
ThreadPoolExecutor pool = new ThreadPoolExecutor(poolSize, poolSize, 0L, TimeUnit.MILLISECONDS, new LinkedBlockingQueue
());
Notifier notifier = getNotifier();
if (!notifier.isBusy()) {
notifier.execute(msgContent);
}
public boolean isBusy() {
return notifyPool.getQueue().size() >= config.getMaxHandlerSize() * 2;
}2.2 Retry Message Sending
Failed or throttled messages are retried using a distributed scheduled task framework with shard‑broadcast execution. Locks ensure that the same task is not processed concurrently on multiple nodes.
public void init() {
ScheduledExecutorService scheduledService = new ScheduledThreadPoolExecutor(taskRepository.size());
for (Map.Entry
entry : taskRepository.entrySet()) {
final String taskName = entry.getKey();
final TaskHandler handler = entry.getValue();
scheduledService.scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
try {
if (handler.isBusy()) {
return;
}
handleTask(taskName, handler);
} catch (Throwable e) {
logger.error(taskName + " task handler fail!", e);
}
}
}, 30, 5, TimeUnit.SECONDS);
}
}
public void handTask(String taskName, TaskHandler handler) {
Lock lock = LockFactory.getLock(taskName);
List
taskList = null;
try {
if (lock.tryLock()) {
taskList = getTaskList(taskName, handler);
}
} finally {
lock.unlock();
}
if (taskList == null) return;
handler.handleData(taskList);
}2.2.1 ES and MySQL Data Synchronization
Message send records are stored in MySQL and indexed in Elasticsearch. Synchronization uses the update_time field as a version to ensure consistency; ES indices are rolled monthly with hot/cold tagging for lifecycle management.
3. Stability Guarantees
3.1 Traffic Spikes
Two‑level degradation is applied: gradual increase triggers thread‑pool busy detection, then MQ buffering and delayed consumption; sudden spikes are handled directly by Sentinel, routing to MQ for delayed processing after resources free up.
3.2 Problem‑Service Resource Isolation
Separate thread pools for problematic and normal services prevent long‑running error‑service calls from starving normal traffic.
3.3 Third‑Party Service Protection
Rate‑limiting and circuit‑breaker mechanisms shield downstream third‑party services from overload and protect the system from cascading failures.
3.4 Middleware Fault Tolerance
Design accounts for temporary middleware outages (e.g., MQ restart) by ensuring graceful degradation and fallback strategies.
3.5 Monitoring System
A comprehensive monitoring suite tracks request counts, failure rates, and resource usage to enable early detection and rapid response.
3.6 Active‑Active Deployment and Elastic Scaling
Services are deployed across multiple data centers with automatic scaling based on load metrics, ensuring high availability and cost‑effective resource utilization.
4. Conclusion
Effective message notification systems require careful consideration of service architecture, functional modules, and stability mechanisms; there is no one‑size‑fits‑all solution, and designs must be tailored to specific business scenarios.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.