Carnegie Team Uses Random Forests on 406 Samples to Detect 3.3‑Billion‑Year‑Old Life
An interdisciplinary Carnegie research team combined pyrolysis‑GC‑MS with supervised random‑forest machine learning on 406 modern and ancient samples, achieving up to 100% accuracy in distinguishing biogenic from abiotic organic matter and successfully identifying molecular biosignatures dating back 3.3 billion years.
Background
Decoding organic molecules buried in ancient rocks is crucial for understanding Earth’s history and the evolution of life, especially the origins of photosynthesis and atmospheric oxygen. Traditional methods such as fossil morphology and isotope analysis are limited to relatively recent samples, leaving a gap in the record of early biosignatures.
Dataset
The Carnegie Earth and Planetary Laboratory assembled 406 carbon‑containing samples spanning from ~3.8 Ga Archean rocks to 10 Ma recent sediments. The collection includes 141 sedimentary rocks, 65 fossils, 123 modern organisms, 42 meteorites (mostly carbonaceous chondrites), and 35 laboratory‑synthesized organic mixtures. Nine primary categories (e.g., modern animals, modern plants – photosynthetic and non‑photosynthetic, fossilized cyanobacteria, coal, oil shale, animal fossils, modern fungi, meteorites, synthetic samples) were defined for supervised learning, with an additional three auxiliary samples to aid discrimination of photosynthetic versus non‑photosynthetic organisms.
Methodology
The workflow consisted of four steps:
Collect 406 diverse carbon samples.
Extract macromolecular organic matter from meteorites and ancient rocks.
Analyze each extract with pyrolysis‑gas chromatography coupled to electron‑impact mass spectrometry (py‑GC‑MS) using a CDS 6150 probe, Agilent 8860 GC, Agilent 5999 quadrupole MS, and a 30 m 5% phenyl‑methylpolysiloxane column. The pyrolysis program heated samples at 500 °C s⁻¹ to 610 °C, held 10 s, while the GC ramped from 50 °C to 300 °C (5 °C min⁻¹) and the MS scanned m/z 45‑700 at 70 eV.
Convert each chromatogram into a 3 240 × 150 matrix (time × m/z) and retain 8 149 normalized features after preprocessing.
Train a supervised random‑forest classifier (Leo Breiman’s algorithm) on the feature set.
Model Training and Validation
272 samples with clear phylogenetic or physiological labels were split 75 %/25 % for training and testing, preserving class proportions. Ten‑fold cross‑validation was repeated ten times to estimate generalisation error. Four random‑forest models were built to address different binary classification tasks: modern biogenic vs. abiotic, ancient biogenic vs. abiotic, ancient biogenic (excluding coal/wood) vs. abiotic, and photosynthetic vs. non‑photosynthetic.
Results
In preliminary pairwise tests across the nine known categories (36 combinations), 25 of 36 models achieved ≥90 % accuracy, with 19 exceeding 95 %. Specific outcomes included:
Model #1 (modern organisms vs. abiotic): 98 % overall accuracy, AUC 0.977 (train) and 1.000 (test), 10‑fold CV 98.3 %.
Model #2 (ancient biogenic vs. abiotic): 95 % accuracy on 87 biogenic samples (80 % with probability > 0.6), 90 % accuracy on 69 abiotic samples, AUC 0.924/0.926, CV 92.7 %.
Applying Model #2 to 109 unknown ancient rocks classified 61 % as biogenic (probability > 0.5) and 29 % with probability > 0.6.
Model #3 (ancient biogenic vs. abiotic, excluding coal/wood): 100 % correct on biogenic samples, 80 % high‑confidence biogenic probability, 77 % correct on abiotic samples, AUC 0.873/0.863, CV 91.6 %.
Combining Models #2 and #3 identified 11 ancient samples as biogenic, the oldest being a 3.33 Ga greenstone from the Barberton (South Africa) Josefsdal chert.
Trend analysis revealed a decline in the proportion of biogenic samples with geological age: 93 % biogenic in Phanerozoic, 73 % in Proterozoic, and 47 % in Archean specimens, suggesting progressive degradation of organic molecules or increasing abiotic input over time.
Implications
The study, published in PNAS under the title “A robust, agnostic molecular biosignature based on machine learning,” demonstrates that integrating py‑GC‑MS with random‑forest classification can reliably detect highly degraded biosignatures, extending the searchable window of life on Earth to >3 Ga and offering a template for extraterrestrial sample analysis.
While the approach breaks the limits of traditional paleobiology, the authors note remaining challenges—such as optimizing feature selection and expanding the reference library—that define future research directions.
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
