Artificial Intelligence 19 min read

How AI Model Inference Optimization Boosted Address Standardization Speed by 4×

By applying high‑performance operators, quantization, and AI compiler optimizations with Alibaba Cloud PAI Blade and Intel Xeon back‑ends, the address‑standardization service’s deep‑learning models achieved up to 4.11× faster end‑to‑end inference without sacrificing accuracy, enabling more complex models and lower latency.

Alibaba Cloud Big Data AI Platform

Aug 12, 2022

How AI Model Inference Optimization Boosted Address Standardization Speed by 4×

Overview

Deep learning inference performance is critical for services such as address standardization. Optimizing inference can reduce response time, lower cost, and allow more complex models without degrading latency.

Inference Optimization Methodology

Natural language processing tasks such as RNN and BERT face performance challenges on x86 CPUs. The proposed solution combines high‑performance operators, model quantization, and AI compiler optimizations to accelerate inference.

Key Techniques

Model compression: quantization, sparsity, pruning.

High‑performance operators tailored to the model graph.

AI compiler optimizations: graph fusion, operator fusion, code generation.

Address Standardization Service

The service addresses non‑standard address data across many industries. Alibaba DAMO‑Lab provides an address purification service that standardizes address inputs, builds a unified address library, and offers high‑performance search, vector recall, and re‑ranking models.

Blade Optimization Platform

PAI‑Blade offers a unified interface for all the above optimizations, integrating high‑performance operators, Intel Custom Backend, and BladeDISC compiler to deliver end‑to‑end inference acceleration.

High‑Performance Operators on Intel Xeon

Optimizations for LSTM on Intel Xeon leverage AVX‑512 instructions, operator fusion, and cache‑aware scheduling. Input batching is performed with pack_padded_sequence() to handle variable‑length sequences efficiently.

Custom Backend Features

The Intel Custom Backend introduces a primitive cache to reuse compiled primitives, graph fusion to eliminate intermediate tensors, and memory optimizations that reduce runtime overhead.

Performance Evaluation

Two representative address‑search models were evaluated on an Alibaba ECS g7.large instance equipped with an Intel Xeon Platinum 8369B CPU.

ESIM (LSTM‑based) – LSTM‑A latency improved from 0.199 ms to 0.066 ms (+3.02×) and overall end‑to‑end latency dropped from 6.3 ms to 3.4 ms (+1.85×) while maintaining accuracy.

BERT – The 4‑layer INT8‑quantized model reduced latency from 37.0 ms to 9.0 ms (+4.11×). Macro F1 score increased from 77.24 to 78.85, demonstrating that quantization and compiler optimizations can improve both speed and accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model inference ai-optimization address standardization high-performance operators

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.