Artificial Intelligence 16 min read

Unlocking Vision AI: Inside Alibaba’s EasyCV All‑in‑One Self‑Supervised & Transformer Framework

EasyCV is Alibaba’s open‑source, PyTorch‑based visual modeling platform that unifies self‑supervised learning and Transformer techniques, offering a comprehensive algorithm suite, pre‑trained models, high‑performance training/inference optimizations, extensible architecture, and seamless cloud deployment for a wide range of computer‑vision tasks.

Alibaba Cloud Developer

Apr 26, 2022

Unlocking Vision AI: Inside Alibaba’s EasyCV All‑in‑One Self‑Supervised & Transformer Framework

Introduction

In recent years, self‑supervised learning and Transformers have revolutionized computer‑vision. To bring these advances to Alibaba Cloud, the PAI team built EasyCV, an all‑in‑one visual modeling toolbox that bundles a rich set of self‑supervised algorithms and state‑of‑the‑art Vision Transformers, covering image classification, metric learning, object detection, and key‑point detection. EasyCV offers out‑of‑the‑box training and inference capabilities with deep performance optimizations and full compatibility with Alibaba’s Lingjie system.

What is EasyCV

EasyCV is an open‑source, PyTorch‑based framework focused on self‑supervised learning and Transformer techniques. It powers many Alibaba business units (search, e‑commerce, video, travel) and serves enterprise customers on Alibaba Cloud. The project is hosted at https://github.com/alibaba/EasyCV .

Project Background

Self‑supervised pre‑training now rivals supervised methods on many vision tasks, while Transformers have set new SOTA results across the board. However, fragmented codebases and inconsistent implementations hinder reproducibility and performance. EasyCV consolidates SOTA self‑supervised algorithms and Transformer models into a unified, easy‑to‑use framework, adding IO optimizations, training acceleration, quantization, and model management via PAI.

Main Features

Comprehensive self‑supervised algorithm suite (SimCLR, MoCo v1/v2, SwAV, MoBY, DINO, MAE) with benchmark tools.

Rich model zoo with both CNN backbones (ResNet, ResNeXt, HRNet, DarkNet, Inception, MobileNet, etc.) and Vision Transformers (ViT, Swin, Timm models).

Highly extensible architecture: configurable training, evaluation, export, and inference APIs; modular design supports custom necks, heads, data pipelines, and evaluators.

High‑performance training: multi‑node, multi‑GPU, fp16 support, DALI‑accelerated data loading, TFRecord format for large‑scale self‑supervised workloads.

Technical Architecture

The engine is built on PyTorch with integration of the Pytorch training accelerator. The framework consists of:

Framework layer reusing openmmlab/mmcv interfaces, providing Trainer, Hooks, Evaluators, and visualization utilities.

Data layer abstracting various data sources (CIFAR, ImageNet, COCO) and supporting raw images and TFRecord formats, with configurable preprocessing pipelines.

Model layer offering modular backbones, losses, necks, and task‑specific heads; the ModelZoo includes self‑supervised, classification, metric learning, detection, and key‑point algorithms.

Inference layer delivering end‑to‑end APIs, optimized by PAI‑Blade for both online and offline scenarios.

Ease of Use

Training can be launched via configuration files or Python APIs. Example command‑line usage:

# Config file method
python tools/train.py configs/classification/cifar10/r50.py --work_dir work_dirs/classification/cifar10/r50 --fp16

# Simple argument method
python tools/train.py --model_type Classification --model.num_classes 10 \
    --data.data_source.type ClsSourceImageList --data.data_source.list data/train.txt

Python API example:

import easycv.tools
config_path = 'configs/classification/cifar10/r50.py'
easycv.tools.train(config_path, gpus=8, fp16=False, master_port=29527)

Inference example:

import cv2
from easycv.predictors.classifier import TorchClassifier
output_ckpt = 'work_dirs/classification/cifar10/r50/epoch_350_export.pth'
tcls = TorchClassifier(output_ckpt)
img = cv2.imread('aeroplane_s_000004.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
output = tcls.predict([img])
print(output)

Extensibility

All modules are registered and can be instantiated via configuration files, allowing users to swap backbones, heads, or evaluators without code changes. Custom necks, heads, and data pipelines can be added by registering them with the framework.

@NECKS.register_module()
class Projection(nn.Module):
    """Customized neck."""
    def __init__(self, input_size, output_size):
        self.proj = nn.Linear(input_size, output_size)
    def forward(self, input):
        return self.proj(input)

model = dict(
    type='Classification',
    backbone=dict(...),
    neck=dict(type='Projection', input_size=2048, output_size=512),
    head=dict(type='ClsHead', embedding_size=512, num_classes=1000)
)

Performance Highlights

Training benchmarks show substantial speedups when using DALI + TFRecord versus raw JPEG loading (e.g., 140% faster for dino_deit_small_p16 with fp16 batch size 32×8). Multi‑node, fp16, and large‑batch configurations further improve throughput across various models such as MoCo‑v2, SwAV, and MAE.

Application Scenarios

Enterprise BU pipelines leverage self‑supervised pre‑training on millions of images, fine‑tuning downstream tasks to achieve measurable gains (e.g., +1% over baselines).

Public‑cloud users can run end‑to‑end workflows—from data annotation to model serving—across classification, detection, segmentation, and key‑point tasks with minimal configuration.

Specific use cases include smart inspection of worker installations, image‑based recommendation feature extraction (CTR boost >10%), and custom defect detection models for panel manufacturers.

Roadmap

Monthly releases focusing on Transformer classification performance and benchmarks.

Expand self‑supervised benchmarks to detection and segmentation.

Develop more Transformer‑based downstream tasks (detection & segmentation).

Add dataset download and training APIs for common image tasks.

Integrate further inference optimizations and edge‑deployment support.

Long‑term: explore efficient Transformers, multimodal pre‑training, and lightweight joint training‑inference optimizations.

References

Model compression tutorial: https://github.com/alibaba/EasyCV/blob/master/docs/source/tutorials/compression.md

PAI‑Blade: https://www.aliyun.com/activity/bigdata/blade

Similar image matching solution: https://help.aliyun.com/document_detail/313270.html

PAI product page: https://www.aliyun.com/product/bigdata/learn?spm=5176.19720258.J_3207526240.78.e9392c4aJWW64C

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Deep Learning self-supervised learning AI Framework visual transformer

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.