How Alibaba Cloud Scales Search Recommendations with Big Data, AI, and LLMs
This article details Alibaba Cloud's end‑to‑end architecture for search and advertising recommendation, covering the data platform, AI services, feature‑store design, training and inference optimizations, and the integration of large language models for new recommendation scenarios.
Overview
The speaker, a product architect from Alibaba Cloud's Machine Learning Platform PAI, introduces a three‑part roadmap: the technical architecture of search‑recommendation advertising, engineering and algorithm practices in that scenario, and explorations that combine large models with the platform.
Search‑Recommendation Architecture
When a user opens an app such as Taobao or Tmall, the front‑end displays a recommendation feed. The back‑end receives an exposure request, routes it through a business engine and an A/B system, and decides which data to use for recall, coarse ranking, and fine ranking. Models like DeepFM and FM consume features from a feature platform (user behavior, product price, sales, clicks, etc.). Real‑time user actions are streamed into Flink, generating real‑time features that are also persisted to an offline big‑data platform for day‑level training samples.
Compute and Resource Layer
At the bottom, Alibaba Cloud provides heterogeneous compute resources (GPU, CPU, high‑bandwidth RDMA networking, fast storage) managed by cluster schedulers such as ODPS Feitian, container services, and the Lingjun intelligent‑compute cluster designed for large‑model workloads.
Big‑Data + AI PaaS Platform
The unified PaaS layer combines real‑time and batch big‑data services: MaxCompute (Hadoop‑like), Hologres (Redis‑like real‑time OLAP), Flink for streaming, and EMR as the cloud counterpart of open‑source big‑data stacks. This platform feeds data into the AI services.
AI Platform Capabilities
Key AI modules include data labeling (PAI‑iTAG), data cleaning, a FeatureStore, interactive development (PAI‑DSW), visual development (PAI‑Designer), distributed training (PAI‑DLC), dataset acceleration, network and operator optimizations, and model serving via PAI‑EAS.
FeatureStore Design
FeatureStore provides a unified repository for structured features, synchronizing data from offline sources (HDFS, MaxCompute) to online stores (Hologres, TableStore, FeatureDB). It supports feature lineage, multi‑level caching, and local‑memory pre‑fetching to meet sub‑second latency requirements in recommendation pipelines.
EasyRec Recommendation Library
EasyRec abstracts common recommendation algorithms (e.g., DeepFM) and supports multiple compute back‑ends (MaxCompute, Hadoop, Spark, local). It offers data source connectors (OSS, MaxCompute, HDFS, Hive), FeatureGenerator for consistent offline‑online logic, AutoML‑HPO, automatic feature generation/selection, model distillation, training acceleration, offline evaluation, and early‑stop mechanisms.
Training and Inference Optimizations
Multi‑level cache and automatic feature eviction to reduce memory and compute load.
WorkQueue pattern for producer‑consumer data feeding across heterogeneous servers.
Feature selection and knowledge distillation to simplify models.
Communication reduction via single‑node fusion and pipeline parallelism.
Hardware acceleration with AVX/AMX matrix ops, AllReduce sync, and incremental embedding updates.
Inference improvements include AVX/AMX‑accelerated embedding lookup, bf16+int8 quantization on GPUs, AutoPlacement between CPU/GPU, GPU multi‑stream SessionGroup, and feature‑cache optimizations that have yielded ~4× QPS over native TensorFlow Serving in e‑commerce workloads.
LLM‑Driven Recommendation Scenarios
Four emerging use cases are highlighted: e‑commerce product guidance, content recommendation, enterprise knowledge‑base assistants, and educational question answering. Prompt engineering examples demonstrate category‑pair recommendations, query rewriting for ad keywords, and synonym generation to improve search relevance.
RAG (Retrieval‑Augmented Generation) Pipeline
The end‑to‑end RAG workflow is modularized into Document Extraction, Indexing, Pre‑Retrieval (query rewriting), Retrieval, Post‑Retrieval, Generation, and Evaluation. Supported data types include multi‑modal documents (PDF, Word, PPT) with OCR, hierarchical structure handling, and vector stores such as Elasticsearch, Hologres, and Milvus. The pipeline can generate evaluation sets via large‑model‑based RefGPT and measures metrics like hit rate, accuracy, semantic similarity, and faithfulness.
Open‑Source PAI‑RAG
PAI‑RAG is released as an open‑source project, offering a Gradio front‑end for easy configuration, data upload, and interactive testing of the RAG chain. It aims to simplify adaptation to various enterprise scenarios.
Conclusion
The presentation concludes with an invitation to join the open‑source community and contribute to the evolving ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
