Scanning 100 Million Hubble Images in 3 Days: ESA’s AnomalyMatch Finds Over 1,000 Rare Objects
ESA’s ESAC team introduced AnomalyMatch, a semi‑supervised active‑learning framework that, with fewer than ten labeled anomalies, processed roughly 100 million Hubble cutouts in just 2–3 days, uncovering 1,339 distinct anomalous astrophysical objects such as merging galaxies, gravitational lenses, and jellyfish galaxies.
Background and Motivation
Modern large‑scale sky surveys generate billions of images, making the systematic discovery of rare astrophysical anomalies—objects that test galaxy‑evolution models, gravity theories, and cosmology—increasingly data‑intensive. Traditional supervised methods struggle because labeled examples of such anomalies are extremely scarce and the class distribution is heavily imbalanced.
Dataset Construction from the Hubble Legacy Archive
The study used source cutouts generated by O’Ryan et al. from the Hubble Legacy Archive. Only F814W‑band Wide Field Camera 3 images that were already calibrated and mosaicked were retained, yielding about 9,960 × 10⁴ (≈99.6 million) 150 × 150 pixel cutouts. Each cutout was linearly stretched using Astropy’s ZScaleInterval and saved as a grayscale JPEG. The images were stored in roughly a thousand HDF5 files for efficient access.
Initial training data comprised three edge‑aligned protoplanetary‑disk anomalies, 128 manually verified normal cutouts, and the remaining unlabeled pool. Active‑learning iterations expanded the labeled set to 1,400 images (375 anomalies, 1,025 normals), with anomalies dominated by merging galaxies (178) and strong gravitational‑lens systems (63).
AnomalyMatch Framework
AnomalyMatch treats rare‑object detection as an extremely imbalanced binary classification problem and fuses semi‑supervised learning with an active‑learning loop. The backbone is an EfficientNet model trained on both labeled and unlabeled data. The supervised branch uses focal loss with dynamic class‑weighting and smart oversampling of the rare class. The unsupervised branch generates weakly augmented pseudo‑labels and applies consistency regularization on strongly augmented views, encouraging the network to learn robust morphological features.
Training proceeds in stages: a supervised warm‑up on the few labeled samples, followed by semi‑supervised training that incorporates pseudo‑labels from the unlabeled pool. After each epoch, the model infers an “anomaly score” for every unlabeled cutout; scores are calibrated to improve ranking reliability.
Interactive Active‑Learning Loop
A web interface presents the highest‑scoring candidates to domain experts, who quickly label, reject, or confirm them. Expert feedback is fed back into the training cycle, updating class weights and pseudo‑label thresholds, thus forming a “model‑recommend‑expert‑confirm‑model‑iterate” closed‑loop that continuously refines performance.
Scalability and Performance
Applying the trained model to the full Hubble archive required only about 2.5 days of compute time for a single inference pass, with support for checkpointing and incremental updates. This efficiency demonstrates the framework’s suitability for upcoming ultra‑large surveys such as Euclid and the Roman Space Telescope.
Scientific Results
From the top 5,000 high‑scoring candidates, aggressive radial matching (10 arcsec radius) removed duplicates, leaving 1,339 unique anomalous objects. Expert visual inspection and cross‑matching with SIMBAD and ESASky classified them:
629 merging or interacting galaxies (≈50 % of the total).
Numerous strong gravitational‑lens candidates, including 39 lens arcs and many new potential lenses.
35 jellyfish galaxies, 11 clump‑type galaxies, and several overlapping systems.
13 lensed quasars and 13 relativistic‑jet host galaxies discovered without prior training on these classes, illustrating the model’s ability to generalize.
Three broader categories: “special galaxies” (irregular morphologies), “normal galaxies” (false‑positive anomalies, ~10 % of detections), and 43 “unknown” objects that defy current classification.
Publication
The full study, titled “Identifying astrophysical anomalies in 99.6 million source cutouts from the Hubble legacy archive using AnomalyMatch,” appears in Astronomy & Astrophysics (DOI: https://doi.org/10.1051/0004-6361/202555512).
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
