How We Scaled Billion‑Image Asset Ingestion with Dataworks: Lessons & Tricks
Facing the challenge of importing billions of image assets, we redesigned the pipeline using Dataworks open‑API, clustered tables, data sharding, cube tables, and custom key generation, achieving faster parallel processing, fault tolerance, and flexible attribute storage, and share practical insights on scheduling, view parametrization, and output services.
Introduction
The author participated in the design and development of an industrial AI asset library that handles billions of image assets, and shares engineering practices learned during its construction.
Import Practice Summary
Initially the platform imported only hundreds of thousands of images, but demand quickly grew to the billions‑level. Using Dataworks open‑API alone could not finish the import within 24 hours, leading to a complete redesign of the import chain.
1. Using Clustered Tables to Improve Parallelism
ODPS supports HASH and RANGE clustered tables. HASH clustering distributes rows into buckets based on a hash of the specified column; for join optimization bucket numbers should be powers of two (e.g., 512, 1024, 2048). Both source and target tables must be clustered with consistent bucket settings.
clustered by | range clustered by (<col_name> [, <col_name> ...]) [sorted by (<col_name> [asc|desc] ...)] into <number_of_buckets> buckets
2. Using Data Sharding for Parallelism and Fault Tolerance
Even with clustered tables, importing billions of images remains slow, and a single node failure aborts the whole workflow. By splitting the job into multiple shards, each handling at most ten million images, we gain fault isolation, controllable parallelism, and finer‑grained control.
Only the failed shard needs to be re‑run.
Resources are utilized efficiently with many concurrent shards.
Task control becomes more granular.
3. Image Key Generation Choices
Requirements: a globally unique key per image with an extremely low probability of collision. Several algorithms were evaluated:
Perceptual hash (pHash) : DCT‑based, robust to scaling, enables similarity calculation via Hamming distance. Cons: relatively high computational cost and longer hash.
Average hash (aHash) : Simple average‑gray method, easy to compute. Cons: loses detail, higher collision risk.
Difference hash (dHash) : Compares adjacent pixel brightness, simple. Cons: sensitive to image size and can be crafted to collide.
Image MD5 : Direct MD5 of the image bytes. Pros: simple and low collision probability. Cons: any change alters the hash, cannot be used for similarity.
In practice we chose image MD5 for the unique primary key and omitted fingerprint computation due to cost considerations.
Cube Table for Extensible Attributes
Basic metadata (e.g., file size, format) is stable, while business attributes vary and may be unbounded. Cube tables provide a flexible, scalable way to store arbitrary attributes compared with wide tables.
Scheduling Practice Summary
Dataworks offers periodic node scheduling, but our scenarios require trigger‑based execution. We built a custom task scheduling framework on top of Dataworks open‑API.
1. Open‑API
Access requires a BU‑C account; detailed integration steps are documented in the official Dataworks Open‑API guide.
2. Task Execution Framework Design
We separate task definitions from execution instances and introduce trigger records for instantiation. A typical image‑asset import task consists of:
Material preparation
Data sharding & bucket creation
Key generation & OSS upload
Base attribute write
Extended attribute write
Each task also includes type, scheduling attributes (one‑time or periodic, timing), dependencies (configuration, source filters), and status. Only triggered and instantiated tasks can run.
2.1 Trigger & Trigger Record
Triggers fire tasks and generate a trigger record, which then instantiates task instances.
2.2 Task Instance
The minimal executable unit. Instances have a type (ODPS node, ODPS SQL, Java), dependency nodes, grouping for concurrency control, and execution status.
2.3 Trigger Example
Asset Output Capability Summary
After storing massive images, fast, flexible, and secure access is essential. We employ:
OpenSearch : Supports vector search for image‑by‑image queries with low latency.
Holo : Based on PostgreSQL, builds billion‑row indexes in minutes and supports simple attribute queries.
MySQL : Sharded tables for primary‑key lookups.
For offline output we adopt parameterized VIEWs, which encapsulate complex SQL logic while allowing callers to pass parameters, thus improving reuse and ensuring data isolation.
MaxCompute traditional VIEW cannot accept parameters; the new 2.0 engine supports parameterized VIEWs that can receive arbitrary tables or variables.
Conclusion
Two major version iterations revealed the trade‑off between over‑design and under‑design. Practical engineering requires balancing future scalability with current complexity, guided by deep system and business understanding.
Appendix
Python implementations of perceptual hash, average hash, difference hash, and hash comparison.
import cv2
import numpy as np
from tfsClient import tfsClient
from PIL import Image
from io import BytesIO
# perceptual hash algorithm
def pic_p_hash(img, hash_size = 32):
img = cv2.resize(img,(hash_size, hash_size))
# convert to gray
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
img = img.astype(np.float32)
# DCT
dct = cv2.dct(np.float32(gray))
# top‑left hash_size x hash_size
dct = dct[:hash_size, :hash_size]
# mean
avg = np.mean(dct)
# generate hash
phash = (dct > avg).astype(int)
phash = phash.flatten()
phash_str = ''.join([str(x) for x in phash.flatten()])
phash_hex = hex(int(phash_str, 2))[2:].zfill(hash_size // 4)
return phash_hex
# average hash algorithm
def pic_avg_hash(img):
# resize to 8x8
img = cv2.resize(img, (8, 8), interpolation=cv2.INTER_CUBIC)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
s = 0
for i in range(8):
for j in range(8):
s += gray[i, j]
avg = s / 64
hash_str = ''
for i in range(8):
for j in range(8):
hash_str += '1' if gray[i, j] > avg else '0'
return hash_str
# difference hash algorithm
def pic_dif_hash(img):
img = cv2.resize(img,(9,8),interpolation=cv2.INTER_CUBIC)
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
hash_str = ''
for i in range(8):
for j in range(8):
hash_str += '1' if gray[i,j] > gray[i,j+1] else '0'
return hash_str
# hash comparison
def hash_cmp(hash1,hash2):
if len(hash1) != len(hash2):
return -1
n = 0
for i in range(len(hash1)):
if hash1[i] != hash2[i]:
n += 1
return n
if __name__ == '__main__':
img1 = cv2.imread('/path/a.jpeg')
img2 = cv2.imread('/path/b.jpeg')
img3 = cv2.imread('/path/c.jpeg')
img4 = cv2.imread('/path/d.png')
img5 = cv2.imread('/path/e.jpeg')
imgHash1 = pic_p_hash(img1, 32)
imgHash2 = pic_p_hash(img2, 32)
imgHash3 = pic_p_hash(img3, 32)
imgHash4 = pic_p_hash(img4, 32)
imgHash5 = pic_p_hash(img5, 32)
print(imgHash5)
cmp1 = hash_cmp(imgHash1, imgHash2)
cmp2 = hash_cmp(imgHash1, imgHash3)
cmp3 = hash_cmp(imgHash1, imgHash4)
cmp4 = hash_cmp(imgHash1, imgHash5)
print(cmp1)
print(cmp2)
print(cmp3)
print(cmp4)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
