16 min read

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService, Alibaba’s Pythonic AI service framework, accelerates large‑model inference and maximizes GPU utilization by separating CPU‑GPU processes, offering out‑of‑the‑box quantization, parallelism and caching, and dynamically reallocating idle GPUs across clusters through a master‑worker scheduler to keep online latency low while boosting offline throughput for diffusion and LLM workloads.

Alimama Tech

Feb 12, 2025

HighService: A High‑Performance Pythonic AI Service Framework for Model Inference and Global Resource Scheduling

HighService is a high‑performance Pythonic AI service framework developed by Alibaba’s advertising platform to accelerate inference of large models (StableDiffusion, LLM, etc.) and improve cluster resource utilization. It addresses both online scenarios (real‑time user requests) and offline batch processing, aiming to keep low latency for online services while maximizing offline throughput.

The framework’s capabilities focus on three aspects: (1) speeding up large‑model inference and hardware utilization; (2) global resource scheduling to exploit idle GPU capacity; (3) rapid onboarding of new models and business features.

Design Philosophy

HighService is split into three dimensions: a multifunctional service framework, an inference acceleration library, and a global resource scheduler. It uses a generic HTTP interface, automatically enables monitoring (latency, QPM, failure rate, alerts), and provides out‑of‑the‑box support for common acceleration techniques such as parallelism, low‑precision quantization, prefix caching, and MoE.

Main Functions

The architecture consists of a CPU‑GPU separated process model: the CPU process handles request reception and business logic, then forwards model inputs to a GPU process for inference, avoiding Python GIL bottlenecks.

Example of a custom CUDA process:

import high_service as hs

class CudaDetInfer(hs.CudaProcessBase):
    def load_model(self, model_paths):
        return DetInferModel(model_paths)

# Call the model
out = hs.CPUClient.get('CudaDetInfer').run('forward', [torch.ones([2, 3])])

HighService also supports standard LLM loaders (e.g., TBStars, Qwen) and provides a simple API for LLM inference:

import high_service as hs

model = hs.llm.start_llm_engine(model_paths)
out = model.forward(inputs)

Global Resource Scheduling

To cope with the massive GPU demand of Alibaba’s advertising workloads, HighService dynamically reallocates GPU resources across clusters. When an online service’s traffic spikes, idle GPUs from less‑busy services are transferred to the hot service, increasing offline task throughput without hurting online latency.

A universal “Busy” metric is introduced to reflect system load: it measures the average active time of CPU processes over a 10‑second window. A high Busy value indicates GPU saturation, prompting automatic scaling decisions.

Distributed Architecture

HighService adopts a master‑worker model: the Master discovers Workers via Vipserver/TRI, registers their model capabilities, and forwards requests to the least‑busy Worker. This design solves uneven request distribution and multi‑model deployment complexities.

Application‑Specific Optimizations

For StableDiffusion, HighService applies pipeline parallelism (e.g., CFG with positive/negative prompts) and FlashAttention to achieve up to 1.8× speed‑up. Batching is used to improve GPU utilization for low‑compute LLM workloads, delivering 1‑10× throughput gains under latency constraints.

LLM services (e.g., AI‑XiaoWan, creative copy generation, content moderation) benefit from integrated support for vLLM, TensorRT‑LLM, streaming/non‑streaming APIs, and advanced techniques such as speculative sampling, prefix caching, quantization, and continuous batching.

Conclusion

HighService has evolved from a single‑GPU inference accelerator for small models to a comprehensive AI service platform supporting multi‑GPU, cluster‑wide dynamic scheduling, and distributed inference for both diffusion and large language models. Future work includes scaling to ultra‑large models (≥671B parameters) and further collaboration with the AI community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Python high performance AI Service model inference Resource Scheduling

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.