Big Data 14 min read

How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce

This article explains the design of a distributed, real‑time price‑update service that handles massive product data, combines query‑driven crawling, observer‑pattern notifications, and multiple data sources to keep e‑commerce price and inventory information fresh within minutes.

21CTO

Dec 1, 2015

How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce

1. What is the requirement?

Many internet applications need real‑time data updates, such as search results, shopping price and inventory. When data volume is huge, how can updates stay stable and timely? This article uses Youdao shopping search (惠惠网) price updates as an example to describe a server‑side design for a real‑time data update system.

1.1 Pain point one: big data

Both web search and shopping search must handle hundreds of billions or billions of items, which a single machine cannot sustain. The scale challenges system scalability and operational stability.

1.2 Pain point two: real‑time

Batch updates are insufficient when users need the latest information within minutes, such as price and stock for hot items. Real‑time updates are essential.

Examples of product pages show real‑time price (red ellipse), price‑trend (blue ellipse) and stock status (yellow ellipse). Screenshots illustrate price history and real‑time competitor prices.

Figure 1: Price service data snapshot on the search detail page

Figure 2: Price history screenshot of a Sony HX10 camera

Figure 3: Shopping assistant display example

Traditional search engines update data via scheduled batch crawls, causing unacceptable latency. In e‑commerce, thousands of price changes occur per minute, so batch crawling cannot keep up. A dedicated price server is built to provide minute‑level updates.

2. Overall Architecture Design

The price server combines distributed and real‑time characteristics.

2.1 Distributed characteristic

To handle the rapid growth of product numbers, a centralized architecture is insufficient. The system adopts a distributed model with three services: task scheduler, crawler, and result writer (PriceServer, PriceUpdateServer, PriceWriteBackServer). They communicate via RPC, and each service can be scaled horizontally. The distributed design improves performance, scalability, and fault tolerance.

Figure 4: Hierarchical structure of the price server

2.2 Real‑time characteristic

Real‑time is reflected in two aspects: immediate data updates and immediate propagation of changes to other services.

2.2.1 Query‑driven real‑time crawling

Because crawling capacity is limited, the system prioritizes hot items identified from user browsing and search logs. The workflow is: user browses an item → shopping assistant sends the item URL list to the task scheduler → scheduler assigns a crawl task → crawler fetches the latest price → result writer updates the database and notifies listeners. This strategy is called Query Triggered Crawling (QTC).

2.2.2 Real‑time data change notification

The price server must push price change events to downstream services (price‑drop alerts, global price comparison, etc.). It uses the Observer pattern (publish/subscribe) to achieve low‑coupling, high‑cohesion event delivery.

Two mature systems that use a similar pattern are Google Caffeine and LinkedIn Databus.

Figure 5: Google Caffeine architecture (Percolator observer)

Google Caffeine uses BigTable‑based Percolator to trigger updates when specific columns change, avoiding the long latency of MapReduce batch updates.

Figure 6: LinkedIn Databus data‑transfer bus

Databus captures database update logs and streams change events to consumers within microseconds, allowing selective consumption.

Inspired by these designs, the price server’s result writer detects price changes, updates the database, and publishes events via the Observer pattern to services such as price‑drop reminders and global price comparison.

Figure 7: Data flow from result writer to other services

The result writer also logs all price updates for offline analysis, enabling price‑history reports and e‑commerce analytics.

3. Data Sources

The price server obtains data from three sources: (1) real‑time crawling by the crawler service, (2) proactive data feeds from partner e‑commerce platforms, and (3) scheduled batch crawling and parsing.

Figure 8: Sources of product price data

Real‑time crawling provides the freshest data for hot items, while batch crawling and partner feeds improve coverage for the majority of products. Additional triggers, such as user‑subscribed price‑drop lists and global price‑comparison updates, also feed the system.

4. Conclusion

The article presented a real‑time big‑data update framework using a price server as a case study, drawing on Google Caffeine and LinkedIn Databus. The common design points are the Observer pattern and fine‑grained data propagation, which overcome the latency of traditional MapReduce batch updates and address both big‑data scale and real‑time requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data distributed architecture Real-time Data observer pattern price server

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.