R&D Management 16 min read

Testing Challenges and Quality Assurance Improvements for News Recommendation Systems

To cope with the multi‑stage, feature‑rich and rapidly iterated NetEase News recommendation pipeline, the QA team introduced detailed stage logging, controllable recall and user‑profile injection, configurable filters, forced‑push mechanisms, an integrated performance‑testing platform, automated case configuration, centralized requirement tracking, and self‑test tools, dramatically boosting testing quality and efficiency while outlining future automation goals.

NetEase Media Technology Team

Dec 6, 2022

Testing Challenges and Quality Assurance Improvements for News Recommendation Systems

Background

As recommendation strategies are continuously iterated, the content forms in NetEase News become increasingly diverse, features become more fine‑grained, and user characteristics become richer. Consequently, the recommendation strategies grow more complex, making the stability of online recommendation services a key focus for the quality assurance team. By exploring and summarizing the business characteristics of the news recommendation system, the team has devised a set of testing methods that better suit iterative business updates, thereby improving testing quality and efficiency.

The main topics covered are:

Basic characteristics and testing difficulties of the news recommendation system;

Solutions adopted by the testing team to address these difficulties;

Measures taken to improve testing efficiency;

Future directions for business testing.

News Recommendation Business Testing Challenges

1. Multiple stages in the recommendation chain

A user request passes through recall, filtering, coarse ranking, fine ranking, and re‑ranking stages. Each stage applies strategies based on user and content features, and the final result alone cannot verify the effectiveness of intermediate strategies.

2. Complex user profile features

Recommendation strategies are tailored to a wide range of user attributes, including basic information, activity levels, and preference signals derived from clicks, browsing, etc. Testing must select appropriate user profiles that match the target audience of each strategy.

3. Diverse content forms and features

News articles include text, video, and various UGC formats. Different article styles trigger different recommendation policies, and articles are enriched with multiple feature dimensions via model inference or manual labeling.

4. Short iteration cycles

The service is pure backend; most requirements do not depend on client releases. On average there are more than 10 weekly demands, with an average testing time of 1.5 h per demand, resulting in very short testing windows.

Quality Assurance Improvements

To tackle the above challenges, the team introduced targeted improvements in business testing and performance testing.

1. Business testing improvements

1.1 Increase logging at each stage

When a test targets intermediate stages such as recall, filtering, or ranking, the results are not visible in the final recommendation list. To make these stages observable, detailed logs with a unified format were added, enabling easier analysis and debugging.

Logic such as article promotion, demotion, or truncation cannot be verified without explicit logs.

Problem localization previously relied on developers' debugging, which was inefficient.

1.2 Controllable testing

1.2.1 Test document recall controllable

The recall module selects content based on user features and request parameters. Because recall results are large and complex, testers could not manually construct the needed articles. By allowing custom article injection into the recall request, specific logic can be verified.

Similarly, user profile features can be overridden to construct the required user model, and user behavior (e.g., click history) can be simulated by writing to Redis.

Test user feature controllable

By rewriting feature vectors, the required user model can be constructed for testing.

User behavior can be simulated by inserting recommendation and click histories into Redis.

1.3 Control variables

Because many strategies may apply to the same article, it is difficult to isolate the logic under test. The solution is to make filters configurable: blacklists prevent certain filters from running, while whitelists force only specific filters to run, allowing testers to target the exact logic.

1.4 Content forced push

To avoid reliance on normal recommendation results during front‑end integration, a forced‑push mechanism was added. By configuring a specific article and its display style, the content can be guaranteed to appear, enabling front‑end verification without extensive data preparation.

2. Performance testing improvements

2.1 Load‑test data transformation

Initially, load tests used randomly generated parameter combinations, which lacked realistic coverage. Later, real traffic data from production was incorporated, improving scenario coverage and allowing targeted load generation for specific user groups.

2.2 Recommendation Quality‑Efficiency Platform – Performance Testing Platform Implementation

Previous performance testing suffered from several issues:

All NPT platform users required training; each test required creating a new task, leading to accidental script changes.

NPT only recorded basic performance metrics; recommendation‑specific metrics (recall count, filter count, etc.) had to be recorded separately.

Lack of standardized thresholds made it hard for non‑owners to judge whether results met expectations.

The new platform integrates NPT load triggering with the Sentinel monitoring system, maps task IDs to columns, and configures required parameters and thresholds. After a test, data are automatically stored in a database, formatted for front‑end display, and flagged against predefined thresholds.

Benefits after deployment:

Operation simplification – only essential inputs (server IP, time, concurrency) are needed.

Standardized and customizable performance criteria per column, with automatic report generation and alerts for out‑of‑range metrics.

Visual comparison of performance data via automatically generated charts.

Direct linkage of performance results to JIRA tickets, facilitating review and retrospectives.

Efficiency Gains

Beyond improving testing quality, the team focused on boosting efficiency through several concrete measures.

1. Test case configuration and automation

Most recommendation demands are validated via A/B experiments, which are cumbersome to modify manually on the Kunlun platform. After migrating the engine to the Noah container, IP addresses became dynamic, and the existing test framework could not quickly adapt. By extracting mutable parameters (e.g., version) into configurable files and adding new API parameters to control experiment inclusion/exclusion, test execution time was dramatically reduced.

2. Centralized requirement management

Before full rollout, each recommendation demand undergoes an online A/B test. When the test passes, both code and regression cases must be added. High demand volume leads to missed cases. A centralized requirement dashboard was introduced to clearly display case status, preventing omissions.

3. Provide self‑test tools to free QA resources

Modules with frequent weekly changes (e.g., recall) required extensive QA effort. A self‑test tool was built for developers, defining clear pass criteria and reducing the need for manual QA verification. After deployment, QA workload decreased significantly and testing efficiency increased.

Future Plans

Going forward, the focus will remain on enhancing testing efficiency and quality. The team will continue to uncover untested business scenarios, ensure comprehensive coverage of all logic paths, extract common testing patterns, and develop automated tools tailored to the recommendation system to further streamline testing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

R&D Management automation Testing performance testing quality assurance news recommendation

Written by

NetEase Media Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.