Fundamentals 7 min read

How Heterogeneous Treatment Effect Analysis Uncovers Sub‑Group Performance

This article explains the concept of heterogeneous treatment effects, outlines how to select dimensions for HTE analysis, describes a Python‑based MVP tool for automated CATE exploration, and showcases a real‑world experiment case where sub‑group insights turned a non‑significant overall result into actionable findings.

JD Retail Technology

Nov 20, 2025

How Heterogeneous Treatment Effect Analysis Uncovers Sub‑Group Performance

What Is Experiment Heterogeneity?

In A/B testing, the same treatment can produce different effects for distinct user groups. Heterogeneous Treatment Effects (HTE) analysis quantifies this variation, enabling a fine‑grained view of treatment impact across sub‑populations.

Key Definitions

ATE (Average Treatment Effect) : the mean effect of the treatment across all subjects.

CATE (Conditional Average Treatment Effect) : the mean effect for subjects that satisfy a specific condition.

ITE (Individual Treatment Effect) : the effect for a single subject.

Why Heterogeneity Matters for Business

Identify how a strategy performs for different user segments, revealing hidden business logic and informing subsequent experiments.

Discover optimal sub‑populations so that a globally ineffective strategy can be deployed selectively, while avoiding loss from negatively impacted groups.

Model experiment results to predict and serve dynamic optimal‑audience recommendations in production.

In a June 2023 JD.com retail experiment, only about 23 % of tests exhibited detectable heterogeneity from the audience perspective.

Selecting Dimensions for HTE Analysis

A candidate dimension X can be used when it satisfies the unconfoundedness condition T ⟂ X (the dimension is independent of random assignment). In practice, analysts often take the value of a user tag recorded on the day before the user first enters the experiment as X. This ensures that the tag is not influenced by the treatment.

Simple derivations and visual examples (see images below) illustrate why the pre‑experiment tag satisfies the independence assumption.

Overview of HTE Analysis Methods

HTE can be explored with a spectrum of techniques, ranging from simple stratified analysis (group‑by dimension and compare treatment effects) to advanced machine‑learning approaches such as causal forests, meta‑learners, and uplift trees. The choice of method depends on data volume, dimensionality, and the need for interpretability.

MVP Python Package for CATE Exploration

The data‑science team built a lightweight Python package that automates CATE analysis in roughly six lines of code. Its core workflow consists of:

SQL Generation & Data Retrieval

Provide a YAML configuration that specifies experiment split tables, metric definitions, and user‑tag tables used for CATE studies.

The package translates the configuration into SQL, executes the query, and returns a pandas DataFrame containing treatment assignment, metric values, and tag columns.

Automatic Sub‑Population Search

Using the retrieved data, the tool searches for sub‑populations that maximize the difference in CATE for a target metric.

Parameters such as maximum depth, minimum subgroup size, and regularization strength can be tuned via the API.

Statistical Validation & Reporting

Per‑subgroup hypothesis tests (e.g., t‑test or bootstrap) assess whether observed CATE differences are statistically significant.

Multi‑metric breakdowns and feature‑importance descriptions are generated to aid profiling.

Typical usage example:

from cate_tool import CATEExplorer

explorer = CATEExplorer(config_path="config.yaml")
results = explorer.run()
print(results.summary())

Real‑World Showcase

A production experiment with an overall negative and non‑significant lift was re‑examined using the package. The analysis identified two mutually exclusive sub‑populations:

A subgroup with a statistically significant positive CATE.

A subgroup with a statistically significant negative CATE.

Subsequent profiling (see image) revealed distinct user characteristics for each subgroup, enabling product managers to target the negative segment for future improvements.