Ant Group Unveils SkySense: 2.06‑Billion‑Parameter Multimodal Remote‑Sensing Foundation Model Accepted at CVPR 2024
Ant Group introduced SkySense, a 2.06‑billion‑parameter multimodal remote‑sensing foundation model that outperformed 18 international rivals across 17 benchmark tasks, was accepted to CVPR 2024, and aims to support applications such as agriculture, urban planning, and disaster response.
On February 28, Ant Group launched SkySense, a 2‑billion‑parameter multimodal remote‑sensing foundation model, the latest multimodal research result of the Ant Baoling large‑model platform, and its paper has been accepted by the top computer‑vision conference CVPR 2024.
Data shows that SkySense exceeds all international peer products in 17 test scenarios, making it the largest‑scale, most comprehensive, and most accurate multimodal remote‑sensing foundation model to date. It can be used for terrain and crop observation and interpretation, effectively assisting agricultural production and management.
Caption: SkySense outperforms the latest international remote‑sensing models in all 17 evaluations.
With the rapid development of artificial intelligence, the combination of large‑model technology and satellite remote sensing has produced new breakthroughs. SkySense is a multimodal remote‑sensing model built on Ant Group’s Baoling large‑model platform.
SkySense was evaluated on 17 internationally recognized public datasets covering seven common remote‑sensing perception tasks such as land‑use monitoring, high‑resolution target recognition, and change detection. It was compared with 18 leading global models, including IBM‑NASA’s Prithvi.
The results show SkySense ranked first in all 17 evaluations; for example, on the FAIR1M2.0 high‑resolution remote‑sensing object detection leaderboard, SkySense’s mean average precision (mAP) exceeds the runner‑up by more than 3%.
SkySense’s research was also selected for inclusion in the CVPR 2024 conference, the premier IEEE‑organized venue for computer‑vision and pattern‑recognition research, one of the three top conferences in the field.
Traditional remote‑sensing image understanding focuses on single‑modality, single‑task modeling, lacking integrated modeling of multimodal data, time series, and geographic priors, which limits generalization across massive data and diverse tasks.
SkySense overcomes these limitations by jointly modeling text, infrared, visible‑light, and SAR radar modalities at multiple resolutions and temporal sequences, delivering superior performance across varied tasks. Leveraging Baoling’s multimodal capabilities, researchers pre‑trained the model on an internally built 1.9‑billion‑image remote‑sensing dataset, resulting in a 2.06‑billion‑parameter model—the largest, most task‑comprehensive, and most accurate multimodal remote‑sensing model worldwide. SkySense can be applied to urban planning, forest protection, emergency rescue, green finance, agricultural monitoring, and more, with data and inference services currently offered via Ant’s internal MEarth platform.
Ant Group also plans to open SkySense’s model parameters to the industry to foster collaborative development of intelligent remote‑sensing technology.
SkySense was jointly developed by Ant’s AI innovation team NextEvo and Wuhan University. NextEvo leads Baoling’s development and focuses on computer vision, NLP, multimodal AI, AIGC, digital humans, and AI engineering.
Last year, the multimodal team was upgraded under the leadership of Dr. Yang Ming, a Northwestern University Ph.D. and founding member of Facebook AI Research (FAIR), who previously worked at NEC’s U.S. labs, FAIR, and Horizon Robotics, and is a world‑renowned computer‑vision expert.
Currently, Ant’s multimodal research results are applied in large‑scale AI interactions for Alipay’s “Five Blessings” festival and Ant Medical’s digital‑human scenarios.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.