Risk Detection Model Service Framework and Acceleration for Alibaba Content Risk Control
Alibaba’s new RiskDetection service framework replaces the bulky Inference‑kgb engine with a Triton‑based, Python‑driven kernel that unifies multiple back‑ends, standardizes tensor APIs, and accelerates image, text, and video risk models via HighService and EAS, delivering real‑time content risk control, scalable caching/batching, and significant GPU speedups for Double‑11 promotions.
1. Business background and problems
Content is a key carrier in advertising. High exposure amplifies the impact of risk leakage, making content risk‑control essential. The existing Inference‑kgb engine has become bulky due to the growing number of models, leading to capability, efficiency, quality, and cost issues.
2. Industry comparison and selection
We evaluated Alibaba Cloud EAS, DAMO‑Aquila, CRO Lingjing, Alibaba Mom HighService, and the open‑source Triton Server. The comparison table (language support, quantization, batching, model types, SDK, cloud deployment, etc.) shows that EAS and Triton provide the most comprehensive features for our needs.
3. Model service framework (RD)
The new RiskDetection (RD) kernel follows the NVIDIA Triton Server design and defines a standard business API (Model, Version, Tensor) with dynamic batching. It abstracts multiple back‑ends (EAS, HighService, Aquila) to provide unified model serving and acceleration.
3.1 Standardized interfaces
Business‑side interfaces support image, text, and video inputs/outputs. Data‑side interfaces adopt the Tensor‑in‑Tensor‑out pattern compatible with KServe. Key data structures include TensorDataType, TensorShape, TensorDataContent, InferParameter, TensorEntity, PredictRequest, and PredictResponse.
3.2 RD kernel technical solution
RD consists of three logical components: Predictor (standard Tensor inference), Transformer (pre‑ and post‑processing), and Backends (actual serving engines). This modular design enables flexible deployment and scaling.
3.3 Data consistency guarantees
We ensure feature‑level consistency across online, near‑line, and offline scenarios by synchronizing model images, versions, and resources. Business‑level consistency is achieved by allowing different resource mixes (GPU vs. CPU) where strict numerical parity is not required.
4. Model inference acceleration
Previous Inference‑kgb relied on native TensorRT and required C++ re‑implementation for every model update. The new framework uses Python for model development and supports multiple acceleration back‑ends (HighService, EAS), dramatically reducing development cycles.
4.1 HighService backend integration
HighService is an internal heterogeneous‑computing framework that decouples GPU and CPU workloads, uses multi‑process CPU execution, and integrates TensorRT for PyTorch models, achieving large performance gains.
4.2 EAS backend integration
EAS provides seamless PAI integration, Blade acceleration, and comprehensive service/operation features. We adopted the Mediaflow SDK for lightweight DAG‑based model deployment. Example code:
# pipeline construction
with graph.as_default():
mediaflow.MediaData() \
.map(tensorflow_op.tensorflow, args=cfg) \
.output("output")
# invocation
results = engine.run(data_frame, ctx, graph)Mediaflow enables DAG‑style model composition and inherits Blade acceleration.
5. Service features and effects
Key service/operation features include Caching, Batching, and Scaling, configurable per model. The unified three‑tuple (image + model file + config) simplifies rapid service launch and version management.
Offline support is provided via the Starling scheduler on the Drogo platform, leveraging Hippo resources and ODPS for large‑scale feature extraction.
Business impact: after RD rollout, GPU‑accelerated models achieved dozens‑fold speedup, enabling real‑time risk detection during Double‑11 promotions. The InferenceProxy layer adds QoS‑aware traffic steering based on business tags, ensuring stable service under high load.
6. Future outlook
We plan to continue expanding GPU/CPU acceleration, explore PPU solutions, standardize caching/batching/scaling across back‑ends, streamline model lifecycle management, and improve cost efficiency through better resource utilization and ROI‑driven model selection.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.