HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling
HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.
HighService is a high‑performance Pythonic AI service framework developed by Alibaba’s advertising platform to accelerate inference of large models (StableDiffusion, LLM, etc.) and improve cluster resource utilization. It addresses both online scenarios (real‑time user requests) and offline batch processing, aiming to keep low latency for online services while maximizing offline throughput.
The framework’s capabilities focus on three aspects: (1) speeding up large‑model inference and hardware utilization; (2) global resource scheduling to exploit idle GPU capacity; (3) rapid onboarding of new models and business features.
Design Philosophy
HighService is split into three dimensions: a multifunctional service framework, an inference acceleration library, and a global resource scheduler. It uses a generic HTTP interface, automatically enables monitoring (latency, QPM, failure rate, alerts), and provides out‑of‑the‑box support for common acceleration techniques such as parallelism, low‑precision quantization, prefix caching, and MoE.
Main Functions
The architecture consists of a CPU‑GPU separated process model: the CPU process handles request reception and business logic, then forwards model inputs to a GPU process for inference, avoiding Python GIL bottlenecks.
Example of a custom CUDA process:
import high_service as hs
class CudaDetInfer(hs.CudaProcessBase):
def load_model(self, model_paths):
return DetInferModel(model_paths)
# Call the model
out = hs.CPUClient.get('CudaDetInfer').run('forward', [torch.ones([2, 3])])HighService also supports standard LLM loaders (e.g., TBStars, Qwen) and provides a simple API for LLM inference:
import high_service as hs
model = hs.llm.start_llm_engine(model_paths)
out = model.forward(inputs)Global Resource Scheduling
To cope with the massive GPU demand of Alibaba’s advertising workloads, HighService dynamically reallocates GPU resources across clusters. When an online service’s traffic spikes, idle GPUs from less‑busy services are transferred to the hot service, increasing offline task throughput without hurting online latency.
A universal “Busy” metric is introduced to reflect system load: it measures the average active time of CPU processes over a 10‑second window. A high Busy value indicates GPU saturation, prompting automatic scaling decisions.
Distributed Architecture
HighService adopts a master‑worker model: the Master discovers Workers via Vipserver/TRI, registers their model capabilities, and forwards requests to the least‑busy Worker. This design solves uneven request distribution and multi‑model deployment complexities.
Application‑Specific Optimizations
For StableDiffusion, HighService applies pipeline parallelism (e.g., CFG with positive/negative prompts) and FlashAttention to achieve up to 1.8× speed‑up. Batching is used to improve GPU utilization for low‑compute LLM workloads, delivering 1‑10× throughput gains under latency constraints.
LLM services (e.g., AI‑XiaoWan, creative copy generation, content moderation) benefit from integrated support for vLLM, TensorRT‑LLM, streaming/non‑streaming APIs, and advanced techniques such as speculative sampling, prefix caching, quantization, and continuous batching.
Conclusion
HighService has evolved from a single‑GPU inference accelerator for small models to a comprehensive AI service platform supporting multi‑GPU, cluster‑wide dynamic scheduling, and distributed inference for both diffusion and large language models. Future work includes scaling to ultra‑large models (≥671B parameters) and further collaboration with the AI community.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.