How Machine Learning Predicts Genetic Variant Penetrance Across Populations

Researchers at Mount Sinai Icahn School of Medicine used gradient‑boosting trees on over one million electronic health records to build machine‑learning models for ten hereditary diseases, quantifying the penetrance of genetic variants and demonstrating how probabilistic risk scores can improve clinical interpretation and patient management.

Data Party THU
Data Party THU
Data Party THU
How Machine Learning Predicts Genetic Variant Penetrance Across Populations

Background

Variable penetrance describes the phenomenon where the same genetic variant causes disease in some individuals but remains benign in others. Traditional variant interpretation relies on limited cohort data and often yields uncertain pathogenicity classifications.

Objective

The study aimed to develop a scalable, data‑driven approach to estimate the penetrance of genetic variants across large populations using machine‑learning techniques.

Data

More than one million de‑identified electronic health records (EHRs) from a multi‑institutional health system were aggregated. For each patient, the dataset included:

Genotype information (variant calls and allele frequencies).

Demographic variables (age, sex, ancestry).

Clinical phenotypes extracted from diagnosis codes, laboratory measurements, and medication histories.

Ten hereditary diseases with well‑characterized genetic etiologies were selected as modeling targets.

Method

Gradient‑boosting tree (GBT) models were trained separately for each disease. GBTs were chosen because they capture non‑linear interactions among genetic, environmental, and demographic factors without requiring explicit feature engineering.

Key steps:

Curate variant‑level annotations (e.g., loss‑of‑function, missense, population allele frequency) and link them to patient phenotypes derived from the EHR.

Split the data into training (80 %) and hold‑out validation (20 %) sets, stratified by disease status.

Train a GBT classifier to predict disease presence given the combined genotype‑phenotype feature vector.

Convert the classifier’s probability output into a continuous penetrance score ranging from 0 (no risk) to 1 (certain disease).

Validate the penetrance scores against known pathogenic and benign variants from ClinVar and functional assays.

Figure 1: Study design and workflow
Figure 1: Study design and workflow

Results

The models produced a continuous penetrance score (0–1) for each variant. Across 1,600 evaluated variants, the ML‑derived penetrance correlated with functional categories:

Loss‑of‑function (LoF) variants showed the highest median scores.

Benign variants had the lowest median scores.

Detailed LoF analysis (n = 228):

48 variants (21 %) received high penetrance (≥ 0.75).

41 variants (18 %) received low penetrance (≤ 0.25).

High‑penetrance LoF carriers exhibited significantly reduced glomerular filtration rate (GFR); six were diagnosed with polycystic kidney disease (PKD) and one with chronic kidney disease (CKD), with several progressing to end‑stage renal disease or requiring dialysis.

Low‑penetrance carriers remained clinically stable, supporting the hypothesis that lower penetrance attenuates phenotypic expression.

Figure 2: ML penetrance validation across variant categories
Figure 2: ML penetrance validation across variant categories

Implications

The probabilistic penetrance estimates enable nuanced risk communication. For example, carriers of BRCA1 variants receive a quantified probability range rather than a binary “high‑risk” label, allowing clinicians to prioritize follow‑up, tailor preventive interventions, and avoid unnecessary anxiety for low‑risk carriers.

Limitations and Future Work

The approach depends on the quality, completeness, and demographic representativeness of the underlying EHR dataset; under‑represented populations may limit generalizability. Planned extensions include:

Incorporating additional multi‑disease, multi‑ethnic cohorts to improve model robustness.

Embedding the penetrance model directly into electronic medical record systems for real‑time decision support.

Code example

来源:ScienceAI
本文
约1800字
,建议阅读
5
分钟
来自西奈山伊坎医学院的研究人员开发了一种强大的新方法,基于机器学习,利用超过 100 万份电子健康记录,为 10 种遗传病构建 ML 模型,成功预测了不同变异在群体中实际致病的可能性。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

gradient boostingRisk Predictionclinical geneticselectronic health recordsgenetic penetrance
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.