How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

This article describes a QCon talk that combines causal inference with large language models to build a retrieval‑augmented generation pricing system for e‑commerce, detailing the three‑step algorithm, LLM‑driven modeling challenges, process‑reward tree search, reinforcement‑learning fine‑tuning, and experimental gains in accuracy and speed.

JD Tech
JD Tech
JD Tech
How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

Introduction

The author presented a talk titled “Causal Inference Meets Large Models: Transforming E‑commerce Pricing Strategies” at the QCon Global Software Development Conference hosted by InfoQ. The presentation explains how large‑model techniques can address e‑commerce pricing challenges, optimize product pricing, and improve decision‑making precision.

Three‑Step Pricing Algorithm

To mimic consumer behavior, the system generates reasonable price suggestions for a target product based on similar‑item prices. The workflow consists of:

Input the description of the product whose price needs to be evaluated.

Retrieve from the database a set of similar products and their prices.

Produce a price suggestion for the target product and output the reasoning logic.

This capability is already deployed in the internal new‑product price‑review pipeline, where thousands of new items are submitted daily by suppliers and reviewed by the procurement team. By automating the process, manual review costs are dramatically reduced.

LLM‑Based Modeling Method

During modeling, three main difficulties were encountered:

Full‑category coverage: Hundreds of categories exist, each with distinct price‑comparison logic (e.g., unit‑price conversion, material‑based price impact).

Complex product information: Sellers use bundles, gifts, or special SKUs, increasing comparison difficulty.

Explainability: The pricing process must clearly state which similar items were referenced and why.

Large language models (LLMs) provide a new solution:

Rich domain knowledge enables handling diverse category‑specific comparison logic.

Strong comprehension of complex product descriptions.

Unlike traditional machine‑learning models, LLMs can output both price predictions and explanatory text.

RAG Architecture

The retrieval‑augmented generation (RAG) pipeline is designed as follows:

RAG pricing workflow diagram
RAG pricing workflow diagram

Retriever: Retrieves the most similar competitor products from the product pool using text similarity and embedding similarity, then feeds them as prompts to the generation model.

Generator: Uses an inference model to derive the main product price from the similar‑product prices, improving accuracy and explainability.

Reinforcement‑Learning Reward Design: Constructs rewards from three aspects: (a) pricing error (difference between model price and actual transaction price), (b) price gap among similar items (aiming for minimal variance), and (c) attribute extraction accuracy.

Process Reward and Tree Search

During inference model training, Chain‑of‑Thought (CoT) cannot rely on manually labeled steps and must generate them autonomously. Pure exploration yields many low‑quality attempts, while pure exploitation gets stuck in existing reasoning patterns. By combining process rewards with a tree‑search mechanism, the model can explore new reasoning paths while effectively leveraging learned knowledge, significantly boosting reasoning correctness and training efficiency.

Implementation Details

The price calculation proceeds in three steps:

Convert unit prices to a common basis (e.g., yuan per jin) and compute a reward based on the coefficient of variation among similar items.

Sort prices, simplifying the calculation to a ranking problem; reward depends on the discrepancy between the model’s ranking and the ground‑truth ranking.

Compute the final price; reward is based on the deviation between the predicted price and the actual price.

Step‑wise price calculation diagram
Step‑wise price calculation diagram
# Step 1: Calculate unit price
unit = "jin"
unit_count = {"B7": 150, "B1": 500, ...}

# Step 2: Compute ranking
order = ["B7", "B1", ...]

# Step 3: Compute final price
price = 0.04  # example estimated price per gram

Pre‑training

Because CoT is generated via a specific template, a prompt is first designed to produce a large batch of samples that strictly follow the CoT structure. These samples are then used for supervised fine‑tuning (SFT) of the base model, ensuring that the model outputs the expected CoT format and improves the rationality and accuracy of each of the three reasoning steps.

Reinforcement Learning

A breadth‑first‑search (BFS) strategy is employed to expand and collect CoT samples. At each reasoning step, the model generates multiple candidate solutions; the process‑reward selects high‑quality candidates for the next step. PPO is used as the RL algorithm, with the reward function modified to incorporate process rewards. Traditional PPO rewards decay quickly for early critical tokens; our design adds forward‑looking rewards (αₖᵣ₂ from step 2 and βₖ₊ₜᵣ₃ from step 3) to early tokens, ensuring they receive sufficient signal.

The final reward formula is illustrated below:

Reward calculation diagram
Reward calculation diagram

Experimental Results

Accuracy: Traditional deep‑learning models tailored for a few categories (e.g., stationery) achieved only 44% accuracy on a random sample of common third‑level categories. The proposed method raised overall accuracy to 74%.

Speed: Prompt‑engineering with high‑quality inference models can achieve similar accuracy but requires more than ten minutes per inference and often gets stuck in loops. Our 7B open‑source base model, after training, performs inference on a single GPU in a few seconds.

Future Optimizations

End‑to‑end RAG + LLM joint training is planned to allow the retrieval stage to be informed by final pricing outcomes, enabling “explore‑exploit” strategies during similarity retrieval.

Adversarial learning will be introduced to automatically select primary items from the candidate pool, focusing training on poorly performing categories and improving both retrieval quality and overall pricing performance.

Related Work

DeepSeek‑R1 (2025) explores incentivizing reasoning capability in LLMs via reinforcement learning (arXiv:2501.12948). Silver et al. (2016) demonstrate AlphaGo’s combination of deep neural networks and tree search. Zhang et al. (2024) propose ReST‑MCTS, a self‑training LLM framework guided by process rewards and tree search (arXiv:2406.03816). Lightman et al. (2023) introduce step‑by‑step verification (arXiv:2305.20050). The PRM800K dataset (OpenAI) provides 800 k labeled CoT steps for process‑reward modeling, showing significant accuracy gains on the MATH benchmark.

large language modelsRetrieval-Augmented GenerationReinforcement learningcausal inferencee‑commerce pricing
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.