Artificial Intelligence 18 min read

Huya Live Streaming Recommendation Architecture: Business Background, System Design, Vector Retrieval, and Ranking

This article presents a comprehensive overview of Huya's live‑streaming recommendation system, covering business background, overall architecture, vector‑based retrieval, detailed ranking pipeline, technical challenges, deployment strategies, scalability, and future outlook.

DataFunSummit
DataFunSummit
DataFunSummit
Huya Live Streaming Recommendation Architecture: Business Background, System Design, Vector Retrieval, and Ranking

Guest Speaker: Li Cha (Huya Live)

Editor: Luo Zhuang (Soul)

Platform: DataFunTalk

Introduction: Hello, I am Li Cha from Huya Live's recommendation engineering team. Huya's live‑stream recommendation focuses on top streamers, emphasizing relationship graphs, textual cues, and long‑term value, which leads to distinct engineering requirements compared with other recommendation scenarios.

The talk will cover the following topics: Business Background, System Architecture, Vector Retrieval, Ranking, and Summary & Outlook.

Business Background

System Architecture

Vector Retrieval

Ranking

Summary and Outlook

01

Business Background

Huya's recommendation scenarios include homepage live recommendations, square video recommendations, and live‑room ad recommendations. Live streaming is a top‑streamer‑centric scenario that values relationship chains, textual cues, and long‑term value, resulting in unique business demands that are reflected in the system architecture.

02

System Architecture

Huya's recommendation pipeline follows the typical industry architecture with some customizations. The ingestion layer handles transparent passing, fusion, degradation, and deduplication. The profiling layer provides long‑term, short‑term, and real‑time user and streamer features. Downstream modules include recall, ranking, re‑ranking, and supporting platform services.

Compared with typical image/video recommendation, Huya requires higher‑frequency deduplication because streamer attributes can change rapidly (e.g., a gamer switching to a talent stream). This imposes stricter timeliness requirements on the deduplication process.

Later sections will dive into vector retrieval and ranking, which cover most of the technical depth of the recommendation system.

03

Vector Retrieval

1. Background

In 2016 Google published the vector‑based retrieval architecture used in YouTube recommendation and search, showing significant gains. Many modern recommendation systems now improve business metrics by optimizing embeddings.

Huya initially used brute‑force retrieval due to a small number of streamers. As the platform grew, brute‑force became infeasible, prompting a shift to vector retrieval at the beginning of this year.

We evaluated Facebook's open‑source Faiss and Google's open‑source ScaNN; ScaNN offered algorithmic optimizations that suited our needs.

2. Technical Challenges

Production requires a high‑throughput, low‑latency, highly available system.

Data must be updated quickly to meet vector‑retrieval business needs, and the system must tolerate failures.

Efficient data‑building pipelines are needed to guarantee service quality.

3. Architecture Implementation

We designed a read‑write‑separated, file‑based architecture:

The index builder produces vector embeddings and writes them to binary .npy files, reducing size and simplifying debugging. The builder interacts with models via an SDK and can be used independently for testing.

File distribution uses Alibaba's open‑source Dragonfly for P2P delivery, integrating with the company's file system.

The online server is split into a retrieval engine and an operator module, both accessed via SDK.

Retrieval Engine: Supports ANN and brute‑force search, with load/unload and double‑buffer switching for stability.

Operator Module: Designed with a generic input‑output interface for easy extension and reuse.

Deployment is managed through a control platform, improving iteration speed.

Online queries use a lock‑free double‑buffered index load, batch processing, pure‑memory computation, LRU caching, and CPU instruction optimizations to achieve high throughput and low latency. Builder and server are decoupled, and the service is stateless for rapid scaling.

Data updates are fast: a 2‑million‑record dataset can be loaded into memory within 5 seconds and distributed in 10 seconds. Files are versioned by timestamp, supporting multi‑version online loading with validation and alerting, completing the whole update cycle within a minute.

Offline builder optimizations include a semi‑automatic hyper‑parameter search tool, distributed locking for task acquisition, multi‑process parallel building, and extensive metric validation (latency, recall, etc.). Currently, Top‑20 ANN recall reaches 0.99 coverage, with over 90 % success rate and the ability to handle 50+ tasks across three builder nodes within minutes.

Scalability is achieved at service, data, and engine layers: stateless services, distributed lock‑based builder, configurable data shards, standard data‑read APIs, compute‑storage integration, and heterogeneous file distribution.

04

Ranking

1. Data Flow

The ranking pipeline consists of offline training, online scoring, and feature processing. Feature processing extracts long‑term, short‑term, and real‑time user/streamer interests. User profile service uses LRU caching and graceful degradation; streamer profile service employs local double‑buffer caching to handle high read amplification.

2. Features

Features are stored in clear‑text TFRecord files with Protocol Buffers schemas for validation. Offline feature extraction uses JNI to call the same extractor as the online path, ensuring consistency.

3. Inference Optimizations

Integrated the gRPC‑based inference service as a dynamic library to fit the company’s ecosystem.

Applied common community optimizations: model warm‑up and dedicated thread pools.

Bandwidth throttling during peak periods to control model download traffic.

Moved user‑feature copying from the client side to the inference service, reducing bandwidth by over 50 %.

After these optimizations, the ranking service achieves four‑nine availability and saves more than 50 % of data‑transfer bandwidth.

05

Summary and Outlook

We conclude with a brief outlook: the architecture still has many optimization opportunities. We will continue to follow business trends, refine the platform, and improve iteration efficiency.

Thank you for listening.

Please like, share, and give a triple‑click at the end!

Guest Speaker:

Free Resources:

Download the PPT of the core AI algorithm treasure book (electronic version)

Download the Big Data classic collection PPT e‑book

Event Recommendation:

About Us:

DataFun focuses on big data and AI technology sharing and交流. Founded in 2017, it has held over 100 offline and 100 online salons, forums, and conferences in Beijing, Shanghai, Shenzhen, Hangzhou, etc., inviting nearly 1,000 experts and scholars. Its WeChat public account DataFunTalk has produced over 500 original articles, with millions of reads and over 130,000 followers.

🧐 Share, like, and watch , give a triple‑click ! 👇

Live StreamingAIscalabilityrecommendation systemRankingVector Retrieval
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.