How We Scaled Headline Recommendation Data with MySQL, Redis, and Pipeline Optimizations
This article details the architecture and evolution of a headline recommendation system, covering data aggregation, storage strategies using MySQL and Redis, challenges with reload latency and memory usage, and the optimizations—including data separation, Redis migration, and query pipeline improvements—that enabled scalable, efficient backend operations.
Business Introduction
The headline data for the Chinese Calendar app is aggregated by recommendation algorithms and includes ALS algorithm data, user‑profile data, timeliness data, non‑timeliness data, fixed‑investment data, surprise data, channel data, hot‑list data, and user‑reading‑recommendation data. Startup modes are cold start and user‑profile start.
Cold start: no user profile or profile score < 8.
User profile: a series of tags generated from the user's headline browsing, represented by Long numbers (e.g., entertainment285L, travel1127L).
Timeliness data: time‑related data that disappears automatically (e.g., news, entertainment).
Non‑timeliness data: time‑independent data that persists long‑term (e.g., health).
Fixed‑investment data: manually placed data from the admin backend, usually fixed‑position items such as ads or posts.
Surprise data: data excluded from the user profile.
Channel data: data composed of multiple tags; a channel is a parent of many tags, and tags are the basic units of a user profile.
Hot‑list data: high‑scoring data calculated from real‑time user click logs.
User‑reading‑recommendation data: related data calculated from real‑time user click logs.
Data Storage
Headline data is fetched from partners via scheduled third‑party API calls. After classification by channel tags, the data is stored in a MySQL database. The headline service periodically reloads the MySQL data into Redis, and then reloads from Redis into local memory. Aggregation assembles the in‑memory data according to the algorithms.
Two reload steps are used to support horizontal scaling: direct reload from MySQL would increase connection pressure and cause inconsistent loads across service nodes, whereas using Redis as an intermediate reduces database load because Redis offers higher concurrency and faster access.
In local memory, data is divided into several pools, each with a specific data structure:
New pool: stores newly fetched non‑timely data. Structure: Set<Long> Old pool: stores items with click‑through and PV data. Structure: List<Long> Video pool: stores all video items. Structure: List<WnlLifeCardItemBean> Non‑timely tag pool: stores IDs of non‑timely entries for each tag. Structure: Multimap<Long, Long> Timely tag pool: stores IDs of timely entries for each tag. Structure: Multimap<Long, Long> Almanac pool: stores data under almanac tags. Structure: List<WnlLifeCardItemBean> Constellation pool: stores data under constellation tags. Structure: List<WnlLifeCardItemBean> Future reminder pool: stores reminder data for movies, sports, etc. Structure: List<WnlLifeCardItemBean> TotalMap: a map of all IDs to bean objects.
Additional recommendation data from the big‑data platform is kept in Redis with the structure Set<Long>. Note: WnlLifeCardItemBean is the bean returned by the headline service; Long values represent bean IDs or tag IDs.
Early Data Update Method
Data updates occur in two places: Redis and Local memory. New data is read from Redis by a Spring Quartz job in each API service and synchronized to Local memory. Redis data is refreshed by a separate background module, also using Spring Quartz, which reads from MySQL and writes to Redis. Additionally, each API service runs a per‑second task to update PV and click counts.
Because some content is manually approved (e.g., ads) and some is automatically published, the above batch update cannot reflect changes instantly, leading to latency for manual or removal actions.
Problems Encountered
Data‑update loss: when the background task updates Redis while an API service is reloading data to Local, the Local data can be overwritten.
Excessive reload time: API services load data lazily on user request; a cache limits concurrent reloads, but if the cache timeout is shorter than the load time, concurrent reloads cause contention.
High memory and CPU consumption: as non‑timely data accumulates, deserialization from Redis becomes CPU‑intensive, and limiting the number of items read may omit valuable data.
Business Data Separation
To avoid reloading all data for a single change, data is separated by business domain. Each SQL statement now loads only its specific data type, reducing memory spikes and CPU usage. Updates are performed per business category, and Redis‑based publish/subscribe keeps Redis and Local memory in sync.
Migrate Recommendation Data to Redis
Even with data separation, loading all non‑timely data into memory creates large temporary objects and frequent GC. Options considered were increasing machine memory, implementing incremental updates, or moving data to Redis. The chosen solution migrates recommendation data to Redis, storing both base data and index data. Indexes are ordered ID sets (e.g., news, channel, exposure, CTR) whose scores are updated via the big‑data platform.
During recommendation, enough data is pre‑computed and placed into a user‑reading cache; when the cache depletes, aggregation is triggered again.
Data Capture
Previously, each data source required a separate background module and Quartz configuration, lacking monitoring and visual management. The new framework introduces a dedicated data‑capture project (ulike) with backend management for configurable tasks.
Mgr: backend UI for managing data‑source and task configurations, viewing captured data and monitoring.
MySQL: stores Mgr configuration data.
Scheduler: handles task scheduling.
Redis: publishes commands to Processor for task execution.
Processor: processes capture commands and business logic.
Engine: parses source data to extract required fields.
Recommendation Data Query Optimization
Convert multiple Redis commands into pipeline mode.
Cache multiple pages of recommendation results in a single calculation.
Use iterator pattern for tag index access, resetting cursor after prolonged continuous access to ensure fresh data.
Introduce multithreaded asynchronous computation.
Afterword
The value of headline information to users hinges on the recommendation algorithm. Continuous big‑data analysis drives algorithm adjustments and aggregation optimizations to deliver the best user experience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
