Operations 12 min read

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

This article analyzes the stability challenges of a multi‑store chain’s product‑copy mechanism, outlines design goals for isolation and scalability, and presents short‑ and long‑term monitoring, flow‑control, and emergency‑response strategies to ensure reliable large‑scale operations.

Youzan Coder

Jun 4, 2021

How to Tame Massive Product‑Sync Traffic in a Multi‑Store Chain System

Business Background

Yozan Chain is a multi‑store brand management system. Product data is copied from the headquarters to every store, allowing centralized control and per‑store customizations (price, inventory, title). Each product change triggers a creation request for every store and partner store, resulting in a 1 × (N + M) request pattern (N stores, M partners). NSQ is used for asynchronous data synchronization.

Stability Challenges

Traffic Out‑of‑Control

Insufficient traffic‑shaping during sudden spikes.

No source differentiation; actions of large merchants affect small merchants.

Design does not scale to ultra‑large chains.

Uneven Traffic Distribution

A single product change can be amplified thousands of times, causing large merchants to dominate CPU, memory, and network resources.

Lack of Traffic‑Level Visibility

Business‑level monitoring is missing, making it hard to locate high‑traffic events without deep developer knowledge.

Support for Ultra‑Large Chains

The original copy‑based model cannot meet the timeliness and resource‑usage requirements of chains with tens of thousands of stores.

Design Goals

Chain merchants must not impact non‑chain merchants.

Chain merchants must not impact each other.

The system must support chains with thousands of stores.

Solution Approach

Short‑Term Measures

Monitor and rate‑limit amplification nodes in the call chain (by channel, store, product, etc.).

Apply upstream channel throttling.

Merge asynchronous messages to reduce volume.

Long‑Term Measures

Isolate resources between chain and non‑chain services.

Provide VIP isolation for chain traffic.

Systematize hotspot detection and isolation mechanisms.

Key Mechanisms

Traffic Discovery – Monitoring & Alerts

Problems: detection relies on log analysis; alerts are coarse‑grained.

Solution: set pre‑alert thresholds for traffic volume and response time (90 %/95 % percentiles) and enable top‑N monitoring per scenario to quickly locate hot stores, products, or groups.

Effect: internal detection replaces external reports, reducing detection latency by >30 minutes and cutting issue‑location time by two‑thirds.

Traffic Isolation – VIP Isolation

Problem: large merchants share the same application and storage layers with regular merchants, causing resource contention.

Solution: split request paths for chain and non‑chain traffic, add source/store tags to APIs that cannot be separated, and perform dynamic channel isolation before message consumption.

Effect: >20 business flows now have VIP isolation, preventing large‑merchant spikes from affecting other merchants.

Traffic Control – Rate Limiting & Message Merging

Rate Limiting

Identify the bottleneck (“shortest plank”) from the link‑time diagram, calculate per‑chain traffic quotas, and back‑propagate limits to amplification nodes.

Support per‑action limits (e.g., product creation/edit), flow‑rate control, temporary interruption, and forced skipping of abnormal traffic.

Message Merging

Batch binlog messages in fixed‑time windows (e.g., 10 s) to reduce sync traffic. Example: inventory deduction at the C‑end averages 100 times/min; merging yields a ~16.7:1 compression ratio.

Effect: high‑frequency operations now have action‑level throttling; >20 flows support flow control, enabling rapid chain‑scale drills.

Emergency Plans

Goal: enable on‑call staff to execute low‑threshold runbooks and restore normal service quickly.

Principles: define observable metrics that trigger a plan, list responsible parties, and provide step‑by‑step actions.

Result: response time for chain spikes reduced from 40 minutes to under 5 minutes, with impact confined to specific stores or products.

Production Drills

Regular full‑link stress tests during low‑traffic windows simulate large‑scale promotions to validate isolation and control mechanisms.

Outcomes: identified 20+ issues, adjusted thresholds ~10 times, and continuously improved monitoring coverage; recent drills automatically detected hotspots and applied predefined plans.

Future Outlook

Short‑Term

Automated detection of abnormal traffic and hotspot auto‑reporting.

Algorithmic rule integration to suggest appropriate runbooks.

On‑call staff can trigger runbooks directly from the monitoring dashboard.

Long‑Term

The copy‑based approach scales poorly: each product replicated per store consumes massive storage and hampers consistency.

Proposed reference model: store a single master product at headquarters and only store differential data at stores, dramatically reducing sync traffic and storage usage.

Monitoring operations scalability system design Flow Control message merging product sync

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.