Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies

This article describes how integrating causal inference with large language models and Retrieval‑Augmented Generation can automate and explain e‑commerce product pricing, detailing the three‑step workflow, reinforcement‑learning rewards, experimental results, and future directions for end‑to‑end RAG‑LLM training.

JD Tech Talk
JD Tech Talk
JD Tech Talk
Causal Inference + LLMs: Transforming E‑Commerce Pricing Strategies

In April 2025, at the InfoQ QCon Global Software Development Conference, the author delivered a talk titled “Causal Inference and Large Model Fusion: Transforming E‑Commerce Pricing Strategies”. The presentation explained how large‑model techniques can address e‑commerce pricing challenges, improve the scientific basis of price decisions, and enable precise, explainable recommendations.

Introduction

With the rapid growth of e‑commerce and increasing price transparency, consumers compare multiple products before purchase. To emulate this behavior, a three‑step algorithm was designed: (1) input the product description to be priced; (2) retrieve similar products and their prices from a database; (3) generate a price suggestion and output the reasoning logic. This capability is already used for new‑product price review, dramatically reducing manual audit costs.

Modeling with Large Language Models

Key challenges in modeling include covering hundreds of categories with diverse price‑comparison logic, handling complex product information (bundles, gifts, special models), and providing interpretable price derivations. Large language models (LLMs) address these challenges by offering rich domain knowledge, understanding complex product details, and delivering explanations beyond traditional machine‑learning predictions.

The proposed workflow adopts a Retrieval‑Augmented Generation (RAG) architecture:

RAG architecture diagram
RAG architecture diagram

Components:

1. Retriever: retrieves the most similar competitor products based on textual similarity and embeddings, feeding them as prompts to the generator. 2. Generator: uses an inference model to derive the target product price from similar‑product prices, improving accuracy and explainability. 3. Reinforcement‑learning reward design: three reward aspects – pricing error, price gap among similar items, and attribute extraction accuracy.

Process Reward and Tree Search Optimization

During CoT (Chain‑of‑Thought) training, pure exploration yields many low‑quality attempts, while pure exploitation can get stuck. By combining process rewards with a tree‑search mechanism, the model explores new reasoning paths while efficiently exploiting learned knowledge, boosting inference correctness and training efficiency.

Implementation of Process Reward and Tree Search

The price inference builds a prompt from the target product description and similar‑product data, then generates a price estimate through the LLM. The CoT process consists of three steps:

Step 1: Convert unit prices to a common unit; reward is based on the coefficient of variation among similar items.

Step 2: Rank prices; reward reflects the difference between the model’s ranking and the true ranking.

Step 3: Compute the final price; reward is the deviation between the estimated and actual price.

Code snippets illustrating the three steps are retained below:

step 1:计算单位价格</code><code>好的,先将所有参考商品的总价统一折算到“元/斤”:</code><code>...</code><code>{
  "unit": "斤",
  "unit_count": {"B7": 150, "B1": 500, ...}
}</code><code>step 2:计算排序</code><code>我现在需要处理用户的问题,帮助他们将商品A插入到集合C中,并确保单位价格从高到低排序。首先,我得仔细阅读用户提供的任务要求和输入信息,确保自己理解正确。</code><code>...</code><code>{
  "order": ["B7", "B1", ...]
}</code><code>step 3:计算价格</code><code>假设A的单位价格预估为0.0450元/克,那么它应该插入在B4(0.04453)之后,B6(0.03993)之前。</code><code>...</code><code>{
  "price": 0.04
}

Pre‑training

During pre‑training, a strict CoT template is generated via prompt engineering, and supervised fine‑tuning (SFT) aligns the base model to produce the desired three‑step reasoning format, improving step‑wise rationality and accuracy.

Reinforcement Learning

A breadth‑first search (BFS) strategy expands all active prefixes, retaining the top‑B candidates after scoring. This approach is used both in pre‑training data collection and in preference‑based fine‑tuning (DPO), allowing the model to automatically select high‑quality main products and apply the explore‑exploit paradigm throughout the pipeline.

Experimental Results

Accuracy: Traditional deep‑learning models tailored to a few categories achieved only 44% accuracy on a random sample of common third‑level categories. The proposed method raised overall accuracy to 74%.

Speed: Prompt‑engineering with high‑quality inference models can reach similar performance but requires over ten minutes per inference and often gets stuck in loops. Our 7B open‑source base model, after training, completes the entire pipeline in a few seconds on a single GPU.

Future Optimizations

End‑to‑End RAG + LLM Joint Training

Currently, product retrieval and LLM training are separate, preventing the model from learning which retrieved samples are positive or negative based on final pricing outcomes. Joint training would propagate reward signals back to the retrieval stage.

Adversarial Learning for Sample Selection

Instead of rule‑based selection of main products, adversarial learning could automatically pick challenging items, focusing training on weak categories and extending the explore‑exploit strategy throughout the workflow.

Related Work

Earlier attempts such as Process Reward Modeling (PRM) and Monte‑Carlo Tree Search (MCTS) were explored in DeepSeek‑R1 but faced issues like defining universal step splits, judging intermediate correctness, and reward hacking. MCTS suffered from exponential search space and difficulty training a reliable value function.

Recent advances like PRM800K—a dataset of ~800 k labeled CoT steps—enable supervised fine‑tuning of reward models that predict the probability of eventual correctness from any prefix. Using these models for step‑wise scoring and Best‑of‑N selection improves multi‑step reasoning reliability, achieving 78.2% accuracy on a MATH benchmark subset.

Overall, the integration of causal inference, RAG, process rewards, and reinforcement learning offers a powerful framework for scalable, explainable e‑commerce pricing.

RAGreinforcement learningcausal inferencee‑commerce pricingprocess reward
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.