Big Data 18 min read

How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks

Facing the challenge of importing billions of image assets, we redesigned the pipeline using Dataworks open‑API, clustered tables, data sharding, cube tables, and custom key generation, achieving faster parallel processing, fault tolerance, and flexible attribute storage, and share practical insights on scheduling, view parametrization, and output services.

Alibaba Cloud Developer

Oct 29, 2024

How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks

Introduction

The author participated in the design and development of an industrial AI asset library that handles billions of image assets, and shares engineering practices learned during its construction.

Import Practice Summary

Initially the platform imported only hundreds of thousands of images, but demand quickly grew to the billions‑level. Using Dataworks open‑API alone could not finish the import within 24 hours, leading to a complete redesign of the import chain.

1. Using Clustered Tables to Improve Parallelism

ODPS supports HASH and RANGE clustered tables. HASH clustering distributes rows into buckets based on a hash of the specified column; for join optimization bucket numbers should be powers of two (e.g., 512, 1024, 2048). Both source and target tables must be clustered with consistent bucket settings.

clustered by | range clustered by (<col_name> [, <col_name> ...]) [sorted by (<col_name> [asc|desc] ...)] into <number_of_buckets> buckets

2. Using Data Sharding for Parallelism and Fault Tolerance

Even with clustered tables, importing billions of images remains slow, and a single node failure aborts the whole workflow. By splitting the job into multiple shards, each handling at most ten million images, we gain fault isolation, controllable parallelism, and finer‑grained control.

Only the failed shard needs to be re‑run.

Resources are utilized efficiently with many concurrent shards.

Task control becomes more granular.

3. Image Key Generation Choices

Requirements: a globally unique key per image with an extremely low probability of collision. Several algorithms were evaluated:

Perceptual hash (pHash) : DCT‑based, robust to scaling, enables similarity calculation via Hamming distance. Cons: relatively high computational cost and longer hash.

Average hash (aHash) : Simple average‑gray method, easy to compute. Cons: loses detail, higher collision risk.

Difference hash (dHash) : Compares adjacent pixel brightness, simple. Cons: sensitive to image size and can be crafted to collide.

Image MD5 : Direct MD5 of the image bytes. Pros: simple and low collision probability. Cons: any change alters the hash, cannot be used for similarity.

In practice we chose image MD5 for the unique primary key and omitted fingerprint computation due to cost considerations.

Cube Table for Extensible Attributes

Basic metadata (e.g., file size, format) is stable, while business attributes vary and may be unbounded. Cube tables provide a flexible, scalable way to store arbitrary attributes compared with wide tables.

Scheduling Practice Summary

Dataworks offers periodic node scheduling, but our scenarios require trigger‑based execution. We built a custom task scheduling framework on top of Dataworks open‑API.

1. Open‑API

Access requires a BU‑C account; detailed integration steps are documented in the official Dataworks Open‑API guide.

2. Task Execution Framework Design

We separate task definitions from execution instances and introduce trigger records for instantiation. A typical image‑asset import task consists of:

Material preparation

Data sharding & bucket creation

Key generation & OSS upload

Base attribute write

Extended attribute write

Each task also includes type, scheduling attributes (one‑time or periodic, timing), dependencies (configuration, source filters), and status. Only triggered and instantiated tasks can run.

2.1 Trigger & Trigger Record

Triggers fire tasks and generate a trigger record, which then instantiates task instances.

2.2 Task Instance

The minimal executable unit. Instances have a type (ODPS node, ODPS SQL, Java), dependency nodes, grouping for concurrency control, and execution status.

2.3 Trigger Example

Asset Output Capability Summary

After storing massive images, fast, flexible, and secure access is essential. We employ:

OpenSearch : Supports vector search for image‑by‑image queries with low latency.

Holo : Based on PostgreSQL, builds billion‑row indexes in minutes and supports simple attribute queries.

MySQL : Sharded tables for primary‑key lookups.

For offline output we adopt parameterized VIEWs, which encapsulate complex SQL logic while allowing callers to pass parameters, thus improving reuse and ensuring data isolation.

MaxCompute traditional VIEW cannot accept parameters; the new 2.0 engine supports parameterized VIEWs that can receive arbitrary tables or variables.

Conclusion

Two major version iterations revealed the trade‑off between over‑design and under‑design. Practical engineering requires balancing future scalability with current complexity, guided by deep system and business understanding.

Appendix

Python implementations of perceptual hash, average hash, difference hash, and hash comparison.

import cv2
import numpy as np
from tfsClient import tfsClient
from PIL import Image
from io import BytesIO
# perceptual hash algorithm
def pic_p_hash(img, hash_size = 32):
    img = cv2.resize(img,(hash_size, hash_size))
    # convert to gray
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    img = img.astype(np.float32)
    # DCT
    dct = cv2.dct(np.float32(gray))
    # top‑left hash_size x hash_size
    dct = dct[:hash_size, :hash_size]
    # mean
    avg = np.mean(dct)
    # generate hash
    phash = (dct > avg).astype(int)
    phash = phash.flatten()
    phash_str = ''.join([str(x) for x in phash.flatten()])
    phash_hex = hex(int(phash_str, 2))[2:].zfill(hash_size // 4)
    return phash_hex
# average hash algorithm
def pic_avg_hash(img):
    # resize to 8x8
    img = cv2.resize(img, (8, 8), interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    s = 0
    for i in range(8):
        for j in range(8):
            s += gray[i, j]
    avg = s / 64
    hash_str = ''
    for i in range(8):
        for j in range(8):
            hash_str += '1' if gray[i, j] > avg else '0'
    return hash_str
# difference hash algorithm
def pic_dif_hash(img):
    img = cv2.resize(img,(9,8),interpolation=cv2.INTER_CUBIC)
    gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    hash_str = ''
    for i in range(8):
        for j in range(8):
            hash_str += '1' if gray[i,j] > gray[i,j+1] else '0'
    return hash_str
# hash comparison
def hash_cmp(hash1,hash2):
    if len(hash1) != len(hash2):
        return -1
    n = 0
    for i in range(len(hash1)):
        if hash1[i] != hash2[i]:
            n += 1
    return n

if __name__ == '__main__':
    img1 = cv2.imread('/path/a.jpeg')
    img2 = cv2.imread('/path/b.jpeg')
    img3 = cv2.imread('/path/c.jpeg')
    img4 = cv2.imread('/path/d.png')
    img5 = cv2.imread('/path/e.jpeg')
    imgHash1 = pic_p_hash(img1, 32)
    imgHash2 = pic_p_hash(img2, 32)
    imgHash3 = pic_p_hash(img3, 32)
    imgHash4 = pic_p_hash(img4, 32)
    imgHash5 = pic_p_hash(img5, 32)
    print(imgHash5)
    cmp1 = hash_cmp(imgHash1, imgHash2)
    cmp2 = hash_cmp(imgHash1, imgHash3)
    cmp3 = hash_cmp(imgHash1, imgHash4)
    cmp4 = hash_cmp(imgHash1, imgHash5)
    print(cmp1)
    print(cmp2)
    print(cmp3)
    print(cmp4)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Image processing data ingestion hash algorithm cube table parameterized view

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.