Artificial Intelligence 13 min read

How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook

This tutorial walks through using DataWorks Notebook with GPU support to combine the open‑vocabulary visual model RAM and the language model BERT for zero‑shot multimodal image captioning, covering environment setup, model installation, dataset preparation, tagging code, and result visualization.

Alibaba Cloud Big Data AI Platform

Mar 21, 2025

How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook

Overview

DataWorks is an all‑in‑one intelligent data development and governance platform that integrates more than ten years of Alibaba’s big‑data construction methodology. Its Notebook provides an interactive environment that supports GPU resources, enabling end‑to‑end data cleaning, feature engineering, model training and inference in a single platform.

Goal

This tutorial demonstrates how to use the open‑vocabulary visual model RAM together with the language model BERT to perform zero‑shot multimodal image captioning within a DataWorks Notebook.

Preparation

Enter DataWorks Gallery and load the “Image Tagging with RAM and BERT” case.

Create a workspace and a personal development environment, selecting a GPU instance (e.g., 24 GB A10).

Choose the official DSW image modelscope:1.18.0-pytorch2.3.0-gpu-py310-cu121-ubuntu22.04.

Install RAM

! pip install git+https://github.com/xinyu1205/recognize-anything.git

Download Dataset

import os
region = os.getenv('DATAWORKS_REGION')
!wget https://dataworks-notebook-{region}.oss-{region}-internal.aliyuncs.com/public-datasets/Image_Tagging_GPU/Qwen2-VL-History.zip
!mkdir data && unzip -q Qwen2-VL-History.zip -d ./data

Download Models

!mkdir models
!wget -P ./models https://dataworks-notebook-{region}.oss-{region}-internal.aliyuncs.com/public-datasets/Image_Tagging_GPU/ram_plus_swin_large_14m.pth
!wget https://dataworks-notebook-{region}.oss-{region}-internal.aliyuncs.com/public-datasets/Image_Tagging_GPU/bert-base-uncased.zip
!unzip -q bert-base-uncased.zip -d ./models

Tagging Operator

from collections import Counter
from datasets import Image
import os, numpy as np, torch
from ram.models import ram_plus
from ram.transform import get_transform
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s:%(name)s:%(levelname)s:%(message)s')

IMAGE_BASE_DIR = './data'

def load_image(path):
    img_feature = Image()
    img = img_feature.decode_example(img_feature.encode_example(path))
    return img.convert('RGB')

class ImageTaggingMapper(object):
    """Generate image tags."""
    def __init__(self, image_filed='images', tag_field_name='image_tags', *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.image_filed = image_filed
        logging.info('Loading recognizeAnything model...')
        self.model = ram_plus(pretrained='./models/ram_plus_swin_large_14m.pth',
                              text_encoder_type='./models/bert-base-uncased',
                              image_size=384, vit='swin_l', threshold=0.68)
        self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device).eval()
        self.transform = get_transform(image_size=384)
        self.tag_field_name = tag_field_name

    def process_single(self, sample):
        if self.tag_field_name in sample:
            return sample
        if self.image_filed not in sample or not sample[self.image_filed]:
            sample[self.tag_field_name] = np.array([[]], dtype=np.str_)
            return sample
        image_paths = sample[self.image_filed]
        image_tags = []
        for img_path in image_paths:
            img_path = os.path.join(IMAGE_BASE_DIR, img_path)
            image = load_image(img_path)
            image_tensor = torch.unsqueeze(self.transform(image), dim=0).to(self.device)
            with torch.no_grad():
                tags, _ = self.model.generate_tag(image_tensor)
            words = [w.strip() for w in tags[0].split('|')]
            word_count = Counter(words)
            sorted_word_list = [item for item, _ in word_count.most_common()]
            image_tags.append(np.array(sorted_word_list, dtype=np.str_))
        sample[self.tag_field_name] = image_tags
        return sample

Run Tagging on the Dataset

from datasets import load_dataset

data_path = './data/train.json'
tag_data_path = './out_tag_data.json'

dataset = load_dataset('json', data_files=data_path)
image_tagging_op = ImageTaggingMapper()

dataset = dataset.map(function=image_tagging_op.process_single)

dataset['train'].to_json(tag_data_path, force_ascii=False)

Visualize Results

import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageFont
import json, random, os

TAG_FONT_SIZE = 30
FONT_PATH = './DingTalk JinBuTi.ttf'
TAG_COLOR = 'red'

def visualize_with_tags(image_path, tags):
    img = Image.open(image_path)
    draw = ImageDraw.Draw(img)
    try:
        font = ImageFont.truetype(FONT_PATH, TAG_FONT_SIZE)
    except Exception:
        font = ImageFont.load_default()
    tag_text = '标签：
' + ', 
'.join(tags)
    draw.text((10, 10), tag_text, fill=TAG_COLOR, font=font)
    plt.figure(figsize=(10, 8))
    plt.imshow(img)
    plt.axis('off')
    plt.title(os.path.basename(image_path))
    plt.show()

with open(tag_data_path, 'r') as f:
    data_list = f.readlines()
random.shuffle(data_list)
for item in data_list[:3]:
    item = json.loads(item)
    for img_path, tags in zip(item['images'], item['image_tags']):
        visualize_with_tags(os.path.join(IMAGE_BASE_DIR, img_path), tags)

Sample Output

Image path: images/instance_1579398113589784578.jpg

Tags: ['calligraphy', 'ink', 'manuscript', 'mark', 'pen', 'scroll', 'text', 'write', 'writing']

Image path: images/instance_1579398113581395972.jpg

Tags: ['artifact', 'hole', 'metal', 'rust', 'writing']

Image path: images/instance_1586990758474346497.jpg

Tags: ['architecture', 'building', 'city', 'pillar', 'entrance', 'palace', 'place', 'plaza', 'red', 'shrine', 'sky', 'structure', 'temple', 'worship']

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python RAM Multimodal DataWorks BERT image tagging

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.