Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization
Taobao’s AI virtual‑try‑on system pre‑computes fitting results offline, writes them into the Item Center via scalable ScheduleX tasks, optimizes pagination, locking and flow‑control, and thereby processes millions of apparel items in under thirty minutes with 99.9% success and reliable checkpoint‑resume monitoring.
With the rapid growth of e‑commerce, users expect more intuitive shopping experiences, especially for apparel. Taobao’s virtual try‑on project uses AI to provide personalized fitting, aiming to improve conversion by efficiently writing try‑on material into the IC (Item Center) extension structure via scheduled tasks.
Background
Clothing items are non‑standard; users cannot estimate fit from model images alone, leading to low conversion for items lacking complete data.
What Taobao Try‑On Has Achieved
Expanded coverage to dresses, tops, etc.
Supported multiple SKUs per product with length guides.
Improved model realism and added user‑photo try‑on.
Enhanced visual quality of try‑on results.
Cooperation Scenarios
LAZADA: replace missing model images with Southeast‑Asian models.
Taobao detail page, cart: add try‑on badge for users.
BC chat: real‑time try‑on entry.
Challenges in Detail Page
Immersive try‑on requires page navigation, breaking the purchase flow.
Recommended items and wardrobe are unsuitable for detail pages.
Real‑time try‑on adds latency and stresses GPU resources.
To address these, the team added an AI try‑on anchor directly on the main image, pre‑computed try‑on results offline, and wrote them into IC.
Offline Task: Writing Try‑On Material to IC
Implementation: @Override public ProcessResult process(final JobContext jobContext) throws Exception { // handle master task if (isRootTask(jobContext)) { return processRootTask(jobContext); } // handle sub‑task if (StringUtils.equals(jobContext.getTaskName(), SUB_TASK_NAME)) { return processDressOfflineDataWritingIcTask(jobContext); } return new ProcessResult(true); } @Override public ProcessResult reduce(final JobContext jobContext) throws Exception {}
Data preprocessing uses ODPS + ScheduleX grid tasks. The preprocessing aggregates multiple images per item into a single JSON field extend_info to reduce downstream QPS.
Example of assembling task context:
/********
* Assemble offline task context
*
* @param jobContext task basic info
* @param context offline task context
********/
public void assembleContextParam(final JobContext jobContext, final DressWritingIcTaskContext context) {
final JSONObject params;
try {
params = JSON.parseObject(jobContext.getInstanceParameters());
} catch (Exception e) {
throw new RuntimeException(e);
}
context.setUpdateType(TaskUpdateTypeEnum.parse(params.getString(UPDATE_TYPE)));
context.setTableName(params.getString(ODPS_TABLE));
context.setProjectName(params.getString(ODPS_PROJECT));
context.setPartition(params.getString(PARTITION));
context.setTaskId(jobContext.getTaskId());
context.setJobInstanceId(jobContext.getJobInstanceId());
}Sub‑task processing logic:
@Override
public ProcessResult process(final JobContext jobContext) throws Exception {
if (isRootTask(jobContext)) {
return processRootTask(jobContext);
}
if (StringUtils.equals(jobContext.getTaskName(), SUB_TASK_NAME)) {
return processDressOfflineDataWritingIcTask(jobContext);
}
return new ProcessResult(true);
}Core sub‑task method:
/**
* Sub‑task main flow
*/
private ProcessResult processdressWritingIcTask(final JobContext jobContext) {
// 1. get sub‑task context
final DressWritingIcTaskContext dataWritingIcTask = (DressWritingIcTaskContext)jobContext.getTask();
// 2. process records by page
final TaskUpdateResult taskUpdateResult = processRecordsByPage(dataWritingIcTask);
// 3. return result
return new ProcessResult(true, JSONObject.toJSONString(taskUpdateResult));
}Result aggregation and notification:
/**
* Aggregate results and send DingTalk/robot notifications
*/
@Override
public ProcessResult reduce(final JobContext jobContext) {
final TaskUpdateResult processResult = dressWritingIcProcessManager.getSuccessCountFromProcessResult(jobContext);
// assemble data ...
// send DingTalk notification
}Counting successful sub‑tasks:
public TaskUpdateResult getSuccessCountFromProcessResult(JobContext jobContext) {
TaskUpdateResult taskUpdateResult = new TaskUpdateResult();
for (String value : jobContext.getTaskResults().values()) {
if (StringUtils.isNotBlank(value)) {
try {
// integrate needed data from sub‑task results
} catch (Exception e) {
LoggerUtil.error(logger, e, "Parse taskUpdateResult failed,value:", value);
}
}
}
return taskUpdateResult;
}Performance Goals
Horizontal scalability: achieve million‑item processing within an hour.
99.9% success rate for labeling.
Breakpoint resume and visualized progress.
Optimization points:
Reduce lock scope and add retry on lock failure.
Dynamic pagination based on total data size and machine count.
Thread‑pool pagination with even distribution.
Replace manual sleep‑based rate limiting with Sentinel’s built‑in flow control.
Sample pagination logic comment:
此此前得知,单线程时请求更新耗平均约为40ms,那么单机单线程qps约为25qps,ic商品中心规定qps千级别以下请求不需要考虑限流问题,因为我们规定单机限流100,线上机器共10台,虽然理论值1000qps,实则可能小于1000qps;
进行子任务分发时,单机限流100,则规定单机4线程并行处理数据,以一分钟单机处理最大数据量为界限:60 * 1000ms /30 * 4 = 8000 条数据,则switch配置的分页阈值为8000;
上述提到的40ms是大部分请求下的执行耗时,30ms是较快请求下的执行耗时。Result after optimization:
Processed millions of records in half an hour with 99.9% success.
Breakpoint resume prevents duplicate updates.
Monitoring via ScheduleX console, DingTalk alerts, and ODPS trace logs.
Future Outlook
The team plans to further improve model realism, main‑image composition, and explore more immersive, multi‑dimensional try‑on experiences.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.