Artificial Intelligence 14 min read

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

At QCon 2025, the author presented a novel approach that integrates causal inference with large language models using Retrieval‑Augmented Generation, process rewards, and tree‑search to generate explainable, accurate e‑commerce pricing recommendations, dramatically improving accuracy from 44% to 74% while cutting inference time to seconds.

JD Cloud Developers

Jul 23, 2025

How Causal Inference Meets Large Language Models to Revolutionize E‑commerce Pricing

Introduction

During QCon 2025, the author delivered a talk titled “Causal Inference and Large‑Model Fusion: Transforming E‑commerce Pricing Strategies”, describing how large‑model techniques can address pricing challenges, improve decision‑making accuracy, and foster discussion.

Pricing Workflow

The proposed workflow consists of three steps: (1) input product description, (2) retrieve similar products and their prices from a database, (3) generate a price suggestion with explanatory logic. This capability is already used for self‑operated new‑product price review, reducing manual audit cost.

Modeling with Large Language Models

Key difficulties include covering hundreds of categories with diverse price‑comparison logic, handling complex product information (bundles, gifts, special models), and providing interpretable reasoning. Large language models address these by possessing rich domain knowledge, understanding complex items, and offering explanations beyond traditional machine‑learning predictions.

We adopt a Retrieval‑Augmented Generation (RAG) architecture. The process includes:

Retriever: recall the most similar competing products from the product pool using text similarity and embeddings, and feed them as prompts.

Generator: infer the target product price based on similar‑product prices, improving accuracy and interpretability.

Reinforcement‑learning reward design: three reward components—price error, price gap among similar items, and attribute extraction accuracy.

Process Reward and Tree‑Search Optimization

During CoT model training, pure exploration yields many low‑quality attempts, while pure exploitation limits creativity. By combining process rewards with a tree‑search mechanism, the model explores new reasoning paths while efficiently exploiting learned knowledge, boosting correct reasoning rate and training efficiency.

Implementation Details

Three steps are used to compute rewards:

Step 1: convert unit prices and compute reward using the coefficient of variation among similar items.

Step 2: sort prices and reward the alignment between model‑predicted ranking and true ranking.

Step 3: reward the difference between estimated and actual price.

Code example:

step 1：计算单位价格
好的,先将所有参考商品的总价统一折算到“元/斤”：</code><code>...</code><code>{
  "unit": "斤",
  "unit_count": {"B7": 150, "B1": 500, ...}
}</code><code>step 2：计算排序
我现在需要处理用户的问题,帮助他们将商品A插入到集合C中,并确保单位价格从高到低排序。...</code><code>{
  "order": ["B7","B1",...]
}</code><code>step 3：计算价格
假设A的单位价格预估为0.0450元/克,那么它应该插入在B4(0.04453)之后,B6(0.03993)之前。</code><code>...</code><code>{
  "price": 0.04
}

Pre‑training

We generate a large set of strict CoT‑template samples via prompt engineering, then perform supervised fine‑tuning (SFT) on the base model to ensure correct CoT format and improve reasoning quality at each of the three steps.

Reinforcement Learning

We use a breadth‑first‑search (BFS) strategy to expand CoT samples. At each inference step the model produces multiple candidate solutions; process rewards select high‑quality candidates for the next stage. PPO is employed with adjusted rewards that accumulate forward‑looking signals, ensuring early‑stage tokens receive sufficient reward.

Experimental Results

Accuracy: traditional deep‑learning models customized for a few categories achieved 44% accuracy on a random sample of common three‑level categories. Our method raised overall accuracy to 74%.

Speed: while prompt‑engineering with high‑performance models can reach similar accuracy, inference takes over ten minutes and often loops. Our 7B open‑source base model runs inference on a single GPU in a few seconds.

Future Optimizations

End‑to‑End RAG + LLM Joint Training

Currently, product retrieval and LLM training are separate, preventing the model from learning which retrieved samples are positive or negative based on final pricing outcomes.

Adversarial Learning for Sample Selection

Instead of rule‑based selection of primary products, adversarial learning could automatically pick challenging items for focused training, extending the explore‑exploit strategy throughout the pipeline.

Related Work

Early attempts with Process Reward Modeling (PRM) and Monte‑Carlo Tree Search (MCTS) in DeepSeek‑R1 faced issues such as defining universal step splits, judging intermediate correctness, and reward hacking. MCTS suffered from exponential search space and difficulty training accurate value functions.

AlphaGo’s supervised pre‑training on 30 million human games and subsequent self‑play reinforcement learning inspired our use of retrieval‑augmented generation, process rewards, and tree‑search for reasoning tasks.

Recent OpenAI work on Process Reward Modeling (PRM800K) generated 800 k labeled CoT steps, enabling a model to predict the probability of eventual correctness from any prefix, which significantly improves multi‑step reasoning reliability.

Conclusion

The integration of causal inference, large language models, RAG, process rewards, and tree‑search offers a scalable, explainable solution for e‑commerce pricing, achieving higher accuracy and orders‑of‑magnitude faster inference.

reinforcement learning causal inference retrieval‑augmented generation e‑commerce pricing

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Pricing Workflow

Modeling with Large Language Models

Process Reward and Tree‑Search Optimization

Implementation Details

Pre‑training

Reinforcement Learning

Experimental Results

Future Optimizations

End‑to‑End RAG + LLM Joint Training

Adversarial Learning for Sample Selection

Related Work

Conclusion

JD Cloud Developers

How this landed with the community

Was this worth your time?

0 Comments

End‑to‑End RAG + LLM Joint Training