How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.

58 Tech
58 Tech
58 Tech
How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference

Background

Since the release of ChatGPT on 2022‑11‑30, 58.com TEG‑AI Lab has built a Model‑as‑a‑Service (MaaS) platform that supports training and inference of large language models (LLMs). The platform launched in May 2023 and hosts a domain‑specific model called Lingxi, which has seen daily inference traffic grow to tens of millions of requests.

Multi‑LoRA Inference Service

To maximize GPU utilization and reduce inference cost, the team extended vLLM 0.8.4 with Multi‑LoRA support. Multi‑LoRA allows multiple LoRA adapters that share the same base model to be loaded on a single GPU, provided the adapters are trained on the same base checkpoint.

1. Single‑LoRA Deployment

A classic deployment merges the LoRA adapter weights into the base model after fine‑tuning, producing a standalone model that runs on a single GPU.

2. Multi‑LoRA Deployment

Multi‑LoRA keeps the base model unchanged and loads several LoRA adapters at runtime. The adapters are stored separately on CPU and are fused with the base model only during inference, which saves memory and enables on‑the‑fly switching.

Key Source Code

vllm/executor/uniproc_executor.py
class UniProcExecutor:
    def __init__(self):
        # No __init__ defined, falls back to ExecutorBase
        self._init_executor()

The executor’s initialization eventually calls collective_rpc to start workers, devices, and model loading.

Model Loading Process

The load_model function creates the model on the target device, gathers the set of parameters that need weights, loads the weights from disk, validates them, performs optional post‑processing (e.g., quantization), and returns the model in evaluation mode.

def load_model(self, vllm_config: VllmConfig) -> nn.Module:
    device_config = vllm_config.device_config
    model_config = vllm_config.model_config
    target_device = torch.device(device_config.device)
    with set_default_torch_dtype(model_config.dtype):
        with target_device:
            model = _initialize_model(vllm_config=vllm_config)
            weights_to_load = {name for name, _ in model.named_parameters()}
            loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
            # ... validation and post‑processing ...
    return model.eval()

Multi‑LoRA Loading

Two entry points exist: static loading at service start and dynamic loading via the /v1/load_lora_adapter API when the environment variable VLLM_ALLOW_RUNTIME_LORA_UPDATING is enabled.

def add_adapter(self, lora_request: LoRARequest) -> bool:
    if lora_request.lora_int_id not in self.list_adapters():
        lora = self._load_adapter(lora_request)
        if len(self._adapter_manager) + 1 > self._adapter_manager.capacity:
            self._adapter_manager.remove_oldest_adapter()
        loaded = self._adapter_manager.add_adapter(lora)
    else:
        loaded = self._adapter_manager.get_adapter(lora_request.lora_int_id) is not None
    self._adapter_manager.activate_adapter(lora_request.lora_int_id)
    return loaded

The _load_adapter method validates the adapter, loads its weights onto CPU, and constructs a LoRAModel instance.

def _load_adapter(self, lora_request: LoRARequest) -> LoRAModel:
    lora_path = get_adapter_absolute_path(lora_request.lora_path)
    peft_helper = PEFTHelper.from_local_dir(lora_path, self.max_position_embeddings)
    peft_helper.validate_legal(self.lora_config)
    lora = self._lora_model_cls.from_local_checkpoint(
        lora_path,
        expected_lora_modules,
        peft_helper=peft_helper,
        lora_model_id=lora_request.lora_int_id,
        device="cpu",
        dtype=self.lora_config.lora_dtype,
        target_embedding_padding=self.vocab_size + self.lora_config.lora_extra_vocab_size,
        embedding_modules=self.embedding_modules,
        embedding_padding_modules=self.embedding_padding_modules,
        weights_mapper=hf_to_vllm_mapper,
    )
    return lora

After registration, _create_merged_loras_inplace packs LoRA weights of modules that were combined (e.g., multiple linear layers) into a single PackedLoRALayerWeights object and replaces the original entries.

def _create_merged_loras_inplace(self, lora_model: LoRAModel) -> None:
    for module_name, new_module_names in self.packed_modules.items():
        replacement_loras = []
        replaced_module = set()
        for r in new_module_names:
            lora = self._get_lora_layer_weights(lora_model, r)
            replacement_loras.append(lora)
            if lora:
                replaced_module.add(r)
        if not any(replacement_loras):
            continue
        lora_model.loras[module_name] = PackedLoRALayerWeights.pack(replacement_loras)
        for module in replaced_module:
            lora_model.loras.pop(module, None)

Internal Communication

All requests are sent through a ZeroMQ socket ( zmq_socket) using EngineCoreRequestType to differentiate message types. LoRA loading uses the UTILITY request type, which is handled by _handle_client_request in the core process.

def _handle_client_request(self, request_type: EngineCoreRequestType, request: Any) -> None:
    if request_type == EngineCoreRequestType.ADD:
        self.add_request(request)
    elif request_type == EngineCoreRequestType.ABORT:
        self.abort_requests(request)
    elif request_type == EngineCoreRequestType.UTILITY:
        call_id, method_name, args = request
        output = UtilityOutput(call_id)
        try:
            method = getattr(self, method_name)
            output.result = method(*self._convert_msgspec_args(method, args))
        except BaseException as e:
            output.failure_message = f"Call to {method_name} failed: {str(e)}"
        self.output_queue.put_nowait(EngineCoreOutputs(utility_output=output))
    else:
        logger.error("Unrecognized input request type: %s", request_type)

Inference Request Flow

When a client calls /v1/completions, the request is transformed into an EngineCoreRequestType.ADD message, placed into the input_queue, and later scheduled by the Scheduler. The scheduler decides which requests run, which are pre‑empted, and how many new tokens each request can generate based on KV‑cache availability.

def schedule(self) -> SchedulerOutput:
    # Process running requests
    while req_index < len(self.running) and token_budget > 0:
        request = self.running[req_index]
        num_new_tokens = min(request.num_tokens_with_spec - request.num_computed_tokens, token_budget)
        new_blocks = self.kv_cache_manager.allocate_slots(request, num_new_tokens)
        if new_blocks is None:
            preempted_req = self.running.pop()
            self.kv_cache_manager.free(preempted_req)
            self.waiting.appendleft(preempted_req)
            continue
        # Record scheduling info …
    # Process waiting requests with similar checks (LoRA limits, token budget, cache)
    # Build SchedulerOutput containing new, resumed, and running requests
    return scheduler_output

The scheduler enforces constraints such as maximum concurrent LoRA adapters ( max_loras), token budget, and KV‑cache size, moving unschedulable requests back to the waiting queue.

Execution of a Scheduled Request

After scheduling, the core calls execute_model on the GPU worker. The worker’s GPUModelRunner prepares inputs, activates the required LoRA adapters, and runs the forward pass.

def execute_model(self, scheduler_output: SchedulerOutput):
    attn_metadata, logits_indices, spec_decode_metadata = self._prepare_inputs(scheduler_output)
    with set_forward_context(attn_metadata, self.vllm_config):
        output = self.model(
            input_ids=input_ids,
            positions=positions,
            intermediate_tensors=intermediate_tensors,
            inputs_embeds=inputs_embeds,
        )
    return output

During the forward pass, each LoRA‑enabled linear layer calls apply, which forwards the activation to the Punica kernel via punica_wrapper.add_lora_linear. The wrapper holds metadata (adapter indices, weight pointers, scaling factors) required by the custom kernel.

def apply(self, x: torch.Tensor, bias: Optional[torch.Tensor] = None) -> torch.Tensor:
    output = self.base_layer.quant_method.apply(self.base_layer, x, bias)
    if x.ndim == 3 and output.ndim == 3:
        output = output.flatten(0, 1)
        x = x.flatten(0, 1)
    self.punica_wrapper.add_lora_linear(
        output, x, self.lora_a_stacked, self.lora_b_stacked,
        self.lora_bias_stacked, 1.0, self.output_slices)
    return output

Model Preparation for LoRA

When the model is first loaded, _create_lora_modules walks through all sub‑modules, replaces supported layers (e.g., Linear, ColumnParallelLinear, RowParallelLinear) with LoRA‑aware counterparts, registers them, and attaches a shared punica_wrapper for later metadata updates.

def _create_lora_modules(self):
    for module_name, module in self.model.named_modules(remove_duplicate=False):
        if not self._match_target_modules(module_name):
            continue
        new_module = replace_submodule(
            self.model, module_name,
            from_layer(module, self.lora_slots, self.lora_config,
                       self.packed_modules_mapping.get(module_name.split('.')[-1], []),
                       self.model.config)
        )
        new_module.set_mapping(self.punica_wrapper)
        self.register_module(module_name, new_module)

The helper replace_submodule simply swaps the attribute on the parent module.

def replace_submodule(model: nn.Module, module_name: str, new_module: nn.Module) -> nn.Module:
    parent = model.get_submodule('.'.join(module_name.split('.')[:-1]))
    setattr(parent, module_name.split('.')[-1], new_module)
    return new_module

Conclusion

The article walks through vLLM 0.8.4’s startup, model loading, Multi‑LoRA deployment, internal messaging, request scheduling, and inference execution, highlighting the core functions and data flows that enable multiple LoRA adapters to share a single base model on one GPU. The detailed code excerpts and diagrams provide a practical reference for engineers building LLM serving platforms with dynamic LoRA support.

图片
图片
图片
图片
图片
图片
vLLMGPU inferenceModel ServingLoRA adaptersMulti-LoRA
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.