Artificial Intelligence 12 min read

Pixel‑Level Foundation Model for Earth Observation Sets New SOTA Across Tasks, Excelling with Sparse Labels

A joint team from Cambridge, Aalto and Bristol introduces TESSERA, a pixel‑level remote‑sensing foundation model that leverages a Barlow‑Twins self‑supervised scheme and a novel d‑pixel data organization to achieve state‑of‑the‑art accuracy on classification, segmentation and regression tasks, especially when annotations are scarce.

HyperAI Super Neural

Jun 10, 2026

Pixel‑Level Foundation Model for Earth Observation Sets New SOTA Across Tasks, Excelling with Sparse Labels

Background and Motivation

Earth‑observation satellites provide long‑term, large‑scale data crucial for agriculture, forestry, ecology and land‑management, but raw observations suffer from clouds, irregular revisit intervals, resolution mismatches and sensor noise, making direct high‑precision analysis difficult, particularly for phenology and short‑term disturbances.

Conventional pipelines use cloud‑removal and denoising to synthesize clean images, which improves usability but often erases fine temporal details needed for precise monitoring.

Most existing remote‑sensing foundation models are trained on heavily filtered, idealized data (cloud‑free composites or temporal averages), discarding valuable information present in imperfect observations and limiting robustness to sparse, noisy time series.

New Temporal Feature Learning Paradigm

The research team applies the Barlow Twins self‑supervised algorithm to construct a new temporal feature learning paradigm that does not filter out cloudy data. By enforcing feature consistency across different observation subsets of the same pixel, the model learns stable spatio‑temporal surface patterns, yielding representations invariant to sampling variations.

Building on this, they propose TESSERA, a pixel‑level foundation model for multimodal Sentinel‑1 (SAR) and Sentinel‑2 (optical) time series.

Dataset Construction

Two datasets are created:

A global pre‑training set covering 2017‑2024, spanning over 3,000 grid tiles and containing roughly 800 million d‑pixel samples . Each d‑pixel aggregates optical and SAR observations over time, accompanied by binary masks indicating cloud or missing data.

Downstream evaluation sets drawn from six public benchmarks across classification, segmentation and regression, covering regions in Germany, France, Austria, Finland, Malaysia, etc., with both large‑scale and fine‑grained subsets.

Additionally, the team builds two new benchmarks: an Austrian parcel‑level crop mapping dataset and a Southeast Asian forest canopy height dataset derived from LiDAR‑corrected measurements.

Model Architecture and Training

TESSERA uses a dual‑branch encoder (separate optical and SAR streams) that embeds valid observations, adds learnable intra‑year day position encodings, and processes the sequence with a Transformer followed by a gated recurrent unit (GRU) to capture long‑range dependencies. The fused multimodal representation is a 128‑dimensional vector, later quantized to 8‑bit integers, reducing storage by ~75% with negligible accuracy loss.

During pre‑training, each d‑pixel yields two random observation subsets (different “views”). The model is trained to map both views to consistent embeddings, encouraging learning of underlying stable surface rules rather than snapshot features. Mixed regularization and global shuffling further improve robustness to observation perturbations and spatial autocorrelation.

Experimental Evaluation

Experiments compare TESSERA against several remote‑sensing foundation models and classic vision models under three annotation ratios (1%, 30%, 100%). Light‑weight adapters are used for downstream inference to ensure fair comparison.

Classification : Across national‑scale tree species and fine‑grained crop classification tasks, TESSERA maintains stable performance even with only 1% labeled data, achieving ~8 percentage‑point higher accuracy than the best baseline, thanks to its modeling of long‑term phenological cycles.

Segmentation : On large‑scale farmland segmentation, TESSERA reaches leading accuracy with full labels and outperforms all baselines in low‑label settings, using only a lightweight decoder while preserving deployment efficiency. Boundaries are clearer and inter‑crop confusion is reduced.

Regression : For above‑ground biomass estimation and forest canopy height inversion, TESSERA consistently yields the lowest prediction errors and the most continuous spatial outputs, closely matching LiDAR ground truth in the canopy height task.

Overall, TESSERA shows stable advantages across all three task families, with pronounced gains under sparse annotations, data sparsity and missing observations, indicating stronger robustness and generalization than models reliant on high‑quality training data.

Implications

The work challenges the belief that remote‑sensing foundation models require ideal data, demonstrating that self‑supervised learning on raw, imperfect observations can produce robust, high‑performing representations. While data cleaning remains valuable, shifting focus toward models that can directly handle noisy, incomplete data may accelerate progress toward universal Earth‑observation AI.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

self-supervised learning remote sensing foundation model earth observation d-pixel Sentinel-1 Sentinel-2

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.