Artificial Intelligence 5 min read

How WeChat’s Ekko Achieves Ultra‑Low‑Latency Model Updates for Billion‑User Recommendations

At the 16th OSDI conference, Tencent’s WeChat team presented the award‑winning Ekko system—a groundbreaking, ultra‑low‑latency model‑update solution for massive recommendation workloads that dramatically speeds up updates, supports over a trillion‑scale models, and has already boosted user engagement across billions of daily users.

WeChat Backend Team

Aug 5, 2022

How WeChat’s Ekko Achieves Ultra‑Low‑Latency Model Updates for Billion‑User Recommendations

The 16th Operating Systems Design and Implementation (OSDI) conference, often called the “Oscars” of computer systems, announced its annual best‑paper list, featuring a paper from Tencent’s WeChat team titled “Ekko: A Model‑Low‑Latency Update Solution for Ultra‑Large‑Scale Recommendation Systems.”

OSDI is a premier venue that brings together top researchers from academia and industry to advance operating system technologies, with an acceptance rate of about 19.4% (49 of 253 submissions) this year.

Ekko originates from WeChat’s internal WePS project and addresses the need for rapid model updates in real‑time social scenarios for a user base exceeding one billion. Existing solutions could not keep up with the scale, prompting the development of Ekko.

Key Components of the Ekko Solution

Efficient P2P model‑update transmission service: Coordinates thousands of globally deployed parameter servers to perform real‑time updates, introduces an improved version‑vector algorithm, and implements a log‑free data‑synchronization mechanism that avoids interference from slow or failed machines.

SLO‑aware model‑update scheduler: Prioritizes important gradients on congested networks using freshness‑SLO and quality‑SLO metrics, allowing Ekko to select updates that most affect recommendation quality for prioritized P2P delivery.

Model state manager: Supports disaster recovery for models exceeding 1 TB in size and more than 1,000 instances, enabling rapid detection of performance regressions and incremental rollback in distributed environments within 2.4 seconds.

In comprehensive tests, Ekko demonstrated several orders of magnitude faster model‑update throughput compared with state‑of‑the‑art deep‑learning recommendation systems, achieving up to a 100× improvement over previous best solutions.

The system has been deployed in WeChat for two years, storing hundreds of terabytes of models across thousands of machines and serving over one billion daily users in scenarios such as Channels, “Look‑One,” and Subscription accounts. After full adoption in Channels, global model‑update latency dropped below 2.4 seconds, driving a 40 % increase in daily active users and an 87 % rise in total video plays within six months.

Existing recommendation system update schemes & Ekko architecture

Accelerated effect of Ekko under heterogeneous bandwidth networks

The paper’s results have been recognized by an international top conference, reflecting Tencent’s long‑term investment in computer‑system research. The WeChat team plans to continue advancing foundational theories and key technologies, applying them to diverse business scenarios to improve product experience and drive industry services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

recommendation systems large scale WeChat low-latency Model Update

Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.