Comprehensive Message Traceability and Real-Time Log Processing for Xianyu
Xianyu’s new Message Quality Platform links client, API, and server logs by a unique messageId, cleans and clusters real‑time telemetry, correlates user behavior, and visualizes abnormal nodes, giving end‑to‑end traceability that cuts incident investigation time by over 90 % and can be applied to other pipelines.
Background: Xianyu handles over a hundred million messages daily; reliable messaging is critical for second‑hand transactions. Users need chat to negotiate, and message loss or delay can affect deals and even lead to fraud.
Problem definition: From the user side, issues appear as lost messages or delayed delivery. Technically, loss stems from client‑side architecture where messages pulled via APIs and pushed via ACCS long‑connection may be merged or fail to persist when many arrive simultaneously. Delays are caused by ACCS channel latency and blockage.
Key questions: How to detect problems early before release? How to discover online issues quickly? How to locate public‑opinion incidents effectively?
Full‑link traceability construction: To diagnose incidents, the team aggregates server‑side node logs, API logs, client message‑status logs, and behavior logs, linking them by a unique messageId . This creates a complete trace of a message from client to server and back, without storing actual message content for privacy.
Log reporting: Critical nodes such as message merging, persistence, display, domain sync, and updates are instrumented. Each node reports its status together with the messageId . The client reuses existing telemetry pipelines to avoid heavy SDK integration; server logs are sent through the existing SLS pipeline.
Real‑time log cleaning: Minute‑level telemetry is subscribed, filtered for message‑related events, and clustered by messageId and utdid , reducing volume by dozens of times. Cleaned data are written back to SLS for traceability and to TDDL for monitoring dashboards.
User behavior logs: Click and page‑exposure events are correlated with message failures to reconstruct the user path leading to an anomaly. Combined with server API call logs, this helps reproduce and pinpoint the exact scenario.
Front‑end interaction: The UI groups trace results by client uplink, server processing, and client downlink, highlighting abnormal nodes so developers can quickly see where the problem lies.
Summary and outlook: The Message Quality Platform now provides end‑to‑end visibility, cutting investigation time by over 90 %. The approach is reusable for other pipelines, and future work includes automated testing, intelligent alerts, and deeper integration with CI/CD.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.