Artificial Intelligence 13 min read

Estimating Clustered Data Causal Effects with DiConfounder: A Double‑Difference Framework

This article presents a comprehensive approach to estimating causal effects on clustered data using a double‑difference method, introduces the DiConfounder algorithm built on Rubin Causal Model extensions, details data characteristics, model assumptions, six‑step pipeline, and reports competitive results on the ACIC2022 challenge.

DataFunSummit
DataFunSummit
DataFunSummit
Estimating Clustered Data Causal Effects with DiConfounder: A Double‑Difference Framework

The article introduces the concept of clustered data, where samples are grouped (e.g., by city, product category, or hospital) and experiments assign entire clusters to treatment or control, highlighting the need for cluster‑level analysis due to natural grouping, regulatory, ethical, or cost constraints.

Two motivating examples are given: (1) price‑discrimination‑free product pricing experiments, and (2) policy analysis of medical insurance effects where the policy can only be applied at the hospital level, which is the background of the ACIC2022 data challenge.

The ACIC2022 dataset contains four years of medical‑policy data for 3,400 hospitals, each with over one million rows (total >100 GB). The data are highly imbalanced, contain missing values, and exhibit large distributional differences between control and treatment groups, measured by the ASMD metric.

The authors adopt the Rubin Causal Model (RCM) with three core assumptions—no unmeasured confounding, positivity, and SUTVA—modified to a “differential no‑confounding” condition for pre‑ and post‑policy periods. They define three related estimands (yearly SATT, group‑level SATT, hospital‑level SATT) and derive a double‑difference estimator.

To estimate the causal effect, they employ a causal‑forest based R‑Learner (Susan Athey’s method) and design the DiConfounder algorithm, which consists of six steps: (1) feature engineering with hospital‑level and patient‑level attributes, including trends; (2) handling missing data and imputing gaps; (3‑5) training outcome, propensity‑score, and auxiliary models and feeding them into the causal forest; (6) aggregating SATT estimates and quantifying uncertainty via bootstrap or analytical formulas.

The final model achieved a coverage of 80 % (target 90 %) and an RMSE around 10, outperforming the average competition error. Analysis shows the method works best when group differences are small and individual differences are large, and less well when group heterogeneity dominates.

In summary, DiConfounder transforms the estimand into a differential form, applies a double‑robust causal‑forest estimator, and provides uncertainty quantification, offering a practical solution for large‑scale clustered causal inference problems such as healthcare policy evaluation.

Machine Learningcausal inferencedouble differenceclustered dataDiConfounderhealthcare policy
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.