How to Build Multimodal Image Tagging with RAM and BERT in DataWorks Notebook
This tutorial walks through using DataWorks Notebook with GPU support to combine the open‑vocabulary visual model RAM and the language model BERT for zero‑shot multimodal image captioning, covering environment setup, model installation, dataset preparation, tagging code, and result visualization.
Overview
DataWorks is an all‑in‑one intelligent data development and governance platform that integrates more than ten years of Alibaba’s big‑data construction methodology. Its Notebook provides an interactive environment that supports GPU resources, enabling end‑to‑end data cleaning, feature engineering, model training and inference in a single platform.
Goal
This tutorial demonstrates how to use the open‑vocabulary visual model RAM together with the language model BERT to perform zero‑shot multimodal image captioning within a DataWorks Notebook.
Preparation
Enter DataWorks Gallery and load the “Image Tagging with RAM and BERT” case.
Create a workspace and a personal development environment, selecting a GPU instance (e.g., 24 GB A10).
Choose the official DSW image modelscope:1.18.0-pytorch2.3.0-gpu-py310-cu121-ubuntu22.04.
Install RAM
! pip install git+https://github.com/xinyu1205/recognize-anything.gitDownload Dataset
import os
region = os.getenv('DATAWORKS_REGION')
!wget https://dataworks-notebook-{region}.oss-{region}-internal.aliyuncs.com/public-datasets/Image_Tagging_GPU/Qwen2-VL-History.zip
!mkdir data && unzip -q Qwen2-VL-History.zip -d ./dataDownload Models
!mkdir models
!wget -P ./models https://dataworks-notebook-{region}.oss-{region}-internal.aliyuncs.com/public-datasets/Image_Tagging_GPU/ram_plus_swin_large_14m.pth
!wget https://dataworks-notebook-{region}.oss-{region}-internal.aliyuncs.com/public-datasets/Image_Tagging_GPU/bert-base-uncased.zip
!unzip -q bert-base-uncased.zip -d ./modelsTagging Operator
from collections import Counter
from datasets import Image
import os, numpy as np, torch
from ram.models import ram_plus
from ram.transform import get_transform
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s:%(name)s:%(levelname)s:%(message)s')
IMAGE_BASE_DIR = './data'
def load_image(path):
img_feature = Image()
img = img_feature.decode_example(img_feature.encode_example(path))
return img.convert('RGB')
class ImageTaggingMapper(object):
"""Generate image tags."""
def __init__(self, image_filed='images', tag_field_name='image_tags', *args, **kwargs):
super().__init__(*args, **kwargs)
self.image_filed = image_filed
logging.info('Loading recognizeAnything model...')
self.model = ram_plus(pretrained='./models/ram_plus_swin_large_14m.pth',
text_encoder_type='./models/bert-base-uncased',
image_size=384, vit='swin_l', threshold=0.68)
self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device).eval()
self.transform = get_transform(image_size=384)
self.tag_field_name = tag_field_name
def process_single(self, sample):
if self.tag_field_name in sample:
return sample
if self.image_filed not in sample or not sample[self.image_filed]:
sample[self.tag_field_name] = np.array([[]], dtype=np.str_)
return sample
image_paths = sample[self.image_filed]
image_tags = []
for img_path in image_paths:
img_path = os.path.join(IMAGE_BASE_DIR, img_path)
image = load_image(img_path)
image_tensor = torch.unsqueeze(self.transform(image), dim=0).to(self.device)
with torch.no_grad():
tags, _ = self.model.generate_tag(image_tensor)
words = [w.strip() for w in tags[0].split('|')]
word_count = Counter(words)
sorted_word_list = [item for item, _ in word_count.most_common()]
image_tags.append(np.array(sorted_word_list, dtype=np.str_))
sample[self.tag_field_name] = image_tags
return sampleRun Tagging on the Dataset
from datasets import load_dataset
data_path = './data/train.json'
tag_data_path = './out_tag_data.json'
dataset = load_dataset('json', data_files=data_path)
image_tagging_op = ImageTaggingMapper()
dataset = dataset.map(function=image_tagging_op.process_single)
dataset['train'].to_json(tag_data_path, force_ascii=False)Visualize Results
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw, ImageFont
import json, random, os
TAG_FONT_SIZE = 30
FONT_PATH = './DingTalk JinBuTi.ttf'
TAG_COLOR = 'red'
def visualize_with_tags(image_path, tags):
img = Image.open(image_path)
draw = ImageDraw.Draw(img)
try:
font = ImageFont.truetype(FONT_PATH, TAG_FONT_SIZE)
except Exception:
font = ImageFont.load_default()
tag_text = '标签:
' + ',
'.join(tags)
draw.text((10, 10), tag_text, fill=TAG_COLOR, font=font)
plt.figure(figsize=(10, 8))
plt.imshow(img)
plt.axis('off')
plt.title(os.path.basename(image_path))
plt.show()
with open(tag_data_path, 'r') as f:
data_list = f.readlines()
random.shuffle(data_list)
for item in data_list[:3]:
item = json.loads(item)
for img_path, tags in zip(item['images'], item['image_tags']):
visualize_with_tags(os.path.join(IMAGE_BASE_DIR, img_path), tags)Sample Output
Image path: images/instance_1579398113589784578.jpg
Tags: ['calligraphy', 'ink', 'manuscript', 'mark', 'pen', 'scroll', 'text', 'write', 'writing']
Image path: images/instance_1579398113581395972.jpg
Tags: ['artifact', 'hole', 'metal', 'rust', 'writing']
Image path: images/instance_1586990758474346497.jpg
Tags: ['architecture', 'building', 'city', 'pillar', 'entrance', 'palace', 'place', 'plaza', 'red', 'shrine', 'sky', 'structure', 'temple', 'worship']
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
