2020 Computer Vision Breakthroughs: Self‑Supervised Learning, Transformer Attention Modeling, and Neural Radiance Fields
The talk reviews three major 2020 advances in computer vision—self‑supervised learning surpassing supervised pre‑training, the successful adoption of Transformer‑based attention models for detection and classification, and the emergence of Neural Radiance Fields for view synthesis—while highlighting related research from Microsoft Research Asia and the broader community.
In this presentation, Hu Han (MSRA) and editor Zhu Yushi introduce the most impactful computer‑vision research of 2020, focusing on three breakthroughs: self‑supervised learning, Transformer‑based attention modeling, and Neural Radiance Fields (NeRF).
1. Self‑Supervised Learning – 2020 saw the first self‑supervised methods (MoCo, SimCLR) outperform supervised pre‑training on downstream tasks, marking a milestone. The importance of self‑supervision is illustrated by Yann LeCun’s “cake” analogy and its relevance to human infant learning. The traditional supervised pre‑training + fine‑tuning paradigm is described, followed by the emergence of self‑supervised pre‑training + fine‑tuning, exemplified by MoCo’s success across seven downstream tasks.
Subsequent developments include PIC (a single‑branch unsupervised feature learner) and PixPro (pixel‑level self‑supervision), which improve dense prediction tasks such as object detection and segmentation. PixPro introduces pixel smoothing and removes the contrastive branch, achieving notable gains on Pascal VOC and other benchmarks.
2. Transformer Attention Modeling in Vision – Transformers, originally dominant in NLP, were successfully applied to vision in 2020 through works like DETR (end‑to‑end object detection) and Vision Transformer (ViT) for image classification. RelationNet++ uses a Transformer decoder to fuse multiple object‑representation schemes, achieving 52.7% mAP on COCO. The talk also surveys earlier attention‑based works (NLNet, non‑local networks) and recent advances that replace or complement convolutions with attention mechanisms for pixel‑pixel, object‑object, and object‑pixel relationships.
Additional topics cover video‑based pre‑training, multimodal self‑supervision, and the unification of CV and NLP modeling via Transformers, highlighting the shift toward a common modeling framework across modalities.
3. Neural Radiance Fields (NeRF) – NeRF is presented as a landmark achievement for low‑level vision, enabling high‑quality view synthesis by representing scenes as continuous radiance fields.
The presentation concludes that computer vision is entering an era dominated by self‑supervised and attention‑based models, which are poised to unify visual and language understanding.
Overall, the talk emphasizes that computer vision is moving toward self‑supervised and Transformer‑based approaches, which are likely to become the unified modeling paradigm for both vision and language tasks.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.