Log-Based Replay and Comparison System for Gateway Migration Verification
The team built a log‑based replay framework that pulls and cleans old gateway logs, samples requests, concurrently replays them against both old and new gateways, automatically compares JSON responses with configurable ignore rules, retries failures, and uses matching response distributions to safely verify and migrate thousands of APIs.
Background : In the first half of the year the company rebuilt its gateway system and needed to migrate thousands of retail‑business API calls to the new gateway. Manual functional verification was impractical due to the volume (over 1,000 interfaces) and the risk of service disruption.
System Design : A replay‑and‑compare framework was built that records traffic from the old gateway logs, replays it against both the old and new gateways, and automatically compares the responses. The process consists of four main steps: log pulling & cleaning, replay execution, result comparison, and failure handling.
2.1 Log Pulling and Cleaning : Logs are fetched from the test‑environment gateway within a short time window (e.g., the previous hour) to avoid data staleness. After retrieval the logs are cleaned by filtering out irrelevant interfaces and sampling to keep a manageable yet representative subset. Filtering removes interfaces that are not part of the migration or internal traffic generated by the replay itself. Sampling follows a Pareto principle, selecting interfaces + scenarios and treating different response codes as distinct scenarios.
2.2 Replay Execution : Both the old and new gateways are called concurrently for each sampled request. Concurrency is achieved with Python’s multiprocessing module, for example:
p = multiprocessing.Pool(8)
while line:
p.apply_async(process_line, args=(...))
line = file.readline()
p.close()
p.join()Benchmarks on a 6‑core machine showed significant speed‑up (e.g., pool(1) ≈ 1292 s, pool(8) ≈ 155 s, pool(12) ≈ 102 s).
2.3 Result Comparison : The core comparison recursively walks the JSON responses from the two gateways. A dictionary ignore_dict defines per‑interface fields that should be ignored or treated specially (e.g., timestamps, generated IDs, unordered lists).
ignore_dict = {
'youzan.retail.stock.receiving.order.export.1.0.0': ['response'],
'youzan.retail.trademanager.refundorder.export.1.0.0': ['response'],
'youzan.retail.trade.api.service.pay.qrcode.1.0.1': ['url'],
'youzan.retail.product.spu.queryone.1.0.0': ['list']
}The comparison function compare_data handles dictionaries, lists, and primitive values, applying a COMP_SWITCH flag to enable a compatibility mode when strict equality fails.
def compare_data(data_1, data_2, COMP_SWITCH, ignore_list):
if isinstance(data_1, dict) and isinstance(data_2, dict):
diff_data = {}
only_data_1_has = {}
only_data_2_has = {}
d2_keys = list(data_2.keys())
for d1k in data_1.keys():
if COMP_SWITCH and __doignore(d1k, ignore_list):
continue
if d1k in d2_keys:
d2_keys.remove(d1k)
t1, t2, td = compare_data(data_1.get(d1k), data_2.get(d1k), COMP_SWITCH, ignore_list)
if t1:
only_data_1_has[d1k] = t1
if t2:
only_data_2_has[d1k] = t2
if td:
diff_data[d1k] = td
else:
only_data_1_has[d1k] = data_1.get(d1k)
for d2k in d2_keys:
if COMP_SWITCH and __doignore(d2k, ignore_list):
continue
only_data_2_has[d2k] = data_2.get(d2k)
return only_data_1_has, only_data_2_has, diff_data
else:
if data_1 == data_2:
return None, None, None
else:
if COMP_SWITCH and isinstance(data_1, list) and isinstance(data_2, list):
if __process_list(data_1, data_2, COMP_SWITCH, ignore_list):
return None, None, None
else:
return None, None, [data_1, data_2]
else:
return None, None, [data_1, data_2]Compatibility mode allows ignoring fields such as timestamps or IDs and performing unordered list comparison, reducing false positives.
2.4 Failure Handling : Each replay run generates a batch ID. Failed comparisons are re‑tried after a short interval (e.g., 30 s) to avoid race conditions on resources that may have been created by the previous attempt. Persistent failures are escalated for manual investigation.
2.5 Result Judgment & Migration Strategy : Successful replays record the response codes and their frequencies (e.g., {'200': 94, '234000001': 16, '234000002': 1} ). When the distribution of codes matches between old and new gateways and no replay failures remain, the interface is considered safe to migrate. After batch migration, live traffic is gradually shifted to the new gateway with monitoring to confirm stability.
3. Solution for General Regression Scenarios : By configuring the target environment via a custom X-Service-Chain header, the same framework can be used to compare any two backend environments (e.g., branch vs. main, pre‑prod vs. prod) without code changes.
Conclusion : The log‑based replay and comparison system enabled verification of over 1,200 interfaces, generated ~300 k replay records, uncovered 30+ issues, and ensured a smooth production cut‑over. It also provides a reusable foundation for broader regression testing and automated validation of backend services.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.