How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce
This article explains the design of a distributed, real‑time price‑update service that handles massive product data, combines query‑driven crawling, observer‑pattern notifications, and multiple data sources to keep e‑commerce price and inventory information fresh within minutes.
1. What is the requirement?
Many internet applications need real‑time data updates, such as search results, shopping price and inventory. When data volume is huge, how can updates stay stable and timely? This article uses Youdao shopping search (惠惠网) price updates as an example to describe a server‑side design for a real‑time data update system.
1.1 Pain point one: big data
Both web search and shopping search must handle hundreds of billions or billions of items, which a single machine cannot sustain. The scale challenges system scalability and operational stability.
1.2 Pain point two: real‑time
Batch updates are insufficient when users need the latest information within minutes, such as price and stock for hot items. Real‑time updates are essential.
Examples of product pages show real‑time price (red ellipse), price‑trend (blue ellipse) and stock status (yellow ellipse). Screenshots illustrate price history and real‑time competitor prices.
Figure 1: Price service data snapshot on the search detail page
Figure 2: Price history screenshot of a Sony HX10 camera
Figure 3: Shopping assistant display example
Traditional search engines update data via scheduled batch crawls, causing unacceptable latency. In e‑commerce, thousands of price changes occur per minute, so batch crawling cannot keep up. A dedicated price server is built to provide minute‑level updates.
2. Overall Architecture Design
The price server combines distributed and real‑time characteristics.
2.1 Distributed characteristic
To handle the rapid growth of product numbers, a centralized architecture is insufficient. The system adopts a distributed model with three services: task scheduler, crawler, and result writer (PriceServer, PriceUpdateServer, PriceWriteBackServer). They communicate via RPC, and each service can be scaled horizontally. The distributed design improves performance, scalability, and fault tolerance.
Figure 4: Hierarchical structure of the price server
2.2 Real‑time characteristic
Real‑time is reflected in two aspects: immediate data updates and immediate propagation of changes to other services.
2.2.1 Query‑driven real‑time crawling
Because crawling capacity is limited, the system prioritizes hot items identified from user browsing and search logs. The workflow is: user browses an item → shopping assistant sends the item URL list to the task scheduler → scheduler assigns a crawl task → crawler fetches the latest price → result writer updates the database and notifies listeners. This strategy is called Query Triggered Crawling (QTC).
2.2.2 Real‑time data change notification
The price server must push price change events to downstream services (price‑drop alerts, global price comparison, etc.). It uses the Observer pattern (publish/subscribe) to achieve low‑coupling, high‑cohesion event delivery.
Two mature systems that use a similar pattern are Google Caffeine and LinkedIn Databus.
Figure 5: Google Caffeine architecture (Percolator observer)
Google Caffeine uses BigTable‑based Percolator to trigger updates when specific columns change, avoiding the long latency of MapReduce batch updates.
Figure 6: LinkedIn Databus data‑transfer bus
Databus captures database update logs and streams change events to consumers within microseconds, allowing selective consumption.
Inspired by these designs, the price server’s result writer detects price changes, updates the database, and publishes events via the Observer pattern to services such as price‑drop reminders and global price comparison.
Figure 7: Data flow from result writer to other services
The result writer also logs all price updates for offline analysis, enabling price‑history reports and e‑commerce analytics.
3. Data Sources
The price server obtains data from three sources: (1) real‑time crawling by the crawler service, (2) proactive data feeds from partner e‑commerce platforms, and (3) scheduled batch crawling and parsing.
Figure 8: Sources of product price data
Real‑time crawling provides the freshest data for hot items, while batch crawling and partner feeds improve coverage for the majority of products. Additional triggers, such as user‑subscribed price‑drop lists and global price‑comparison updates, also feed the system.
4. Conclusion
The article presented a real‑time big‑data update framework using a price server as a case study, drawing on Google Caffeine and LinkedIn Databus. The common design points are the Observer pattern and fine‑grained data propagation, which overcome the latency of traditional MapReduce batch updates and address both big‑data scale and real‑time requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
