How vLLM 0.8.4 Implements Multi‑LoRA for Efficient Large‑Model Inference
This article provides a step‑by‑step technical walkthrough of vLLM 0.8.4 on a single GPU, detailing the platform’s startup, model loading, Multi‑LoRA deployment, internal ZMQ communication, request scheduling, and inference execution, while exposing key source‑code snippets and architectural diagrams.
Background
Since the release of ChatGPT on 2022‑11‑30, 58.com TEG‑AI Lab has built a Model‑as‑a‑Service (MaaS) platform that supports training and inference of large language models (LLMs). The platform launched in May 2023 and hosts a domain‑specific model called Lingxi, which has seen daily inference traffic grow to tens of millions of requests.
Multi‑LoRA Inference Service
To maximize GPU utilization and reduce inference cost, the team extended vLLM 0.8.4 with Multi‑LoRA support. Multi‑LoRA allows multiple LoRA adapters that share the same base model to be loaded on a single GPU, provided the adapters are trained on the same base checkpoint.
1. Single‑LoRA Deployment
A classic deployment merges the LoRA adapter weights into the base model after fine‑tuning, producing a standalone model that runs on a single GPU.
2. Multi‑LoRA Deployment
Multi‑LoRA keeps the base model unchanged and loads several LoRA adapters at runtime. The adapters are stored separately on CPU and are fused with the base model only during inference, which saves memory and enables on‑the‑fly switching.
Key Source Code
vllm/executor/uniproc_executor.py
class UniProcExecutor:
def __init__(self):
# No __init__ defined, falls back to ExecutorBase
self._init_executor()The executor’s initialization eventually calls collective_rpc to start workers, devices, and model loading.
Model Loading Process
The load_model function creates the model on the target device, gathers the set of parameters that need weights, loads the weights from disk, validates them, performs optional post‑processing (e.g., quantization), and returns the model in evaluation mode.
def load_model(self, vllm_config: VllmConfig) -> nn.Module:
device_config = vllm_config.device_config
model_config = vllm_config.model_config
target_device = torch.device(device_config.device)
with set_default_torch_dtype(model_config.dtype):
with target_device:
model = _initialize_model(vllm_config=vllm_config)
weights_to_load = {name for name, _ in model.named_parameters()}
loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
# ... validation and post‑processing ...
return model.eval()Multi‑LoRA Loading
Two entry points exist: static loading at service start and dynamic loading via the /v1/load_lora_adapter API when the environment variable VLLM_ALLOW_RUNTIME_LORA_UPDATING is enabled.
def add_adapter(self, lora_request: LoRARequest) -> bool:
if lora_request.lora_int_id not in self.list_adapters():
lora = self._load_adapter(lora_request)
if len(self._adapter_manager) + 1 > self._adapter_manager.capacity:
self._adapter_manager.remove_oldest_adapter()
loaded = self._adapter_manager.add_adapter(lora)
else:
loaded = self._adapter_manager.get_adapter(lora_request.lora_int_id) is not None
self._adapter_manager.activate_adapter(lora_request.lora_int_id)
return loadedThe _load_adapter method validates the adapter, loads its weights onto CPU, and constructs a LoRAModel instance.
def _load_adapter(self, lora_request: LoRARequest) -> LoRAModel:
lora_path = get_adapter_absolute_path(lora_request.lora_path)
peft_helper = PEFTHelper.from_local_dir(lora_path, self.max_position_embeddings)
peft_helper.validate_legal(self.lora_config)
lora = self._lora_model_cls.from_local_checkpoint(
lora_path,
expected_lora_modules,
peft_helper=peft_helper,
lora_model_id=lora_request.lora_int_id,
device="cpu",
dtype=self.lora_config.lora_dtype,
target_embedding_padding=self.vocab_size + self.lora_config.lora_extra_vocab_size,
embedding_modules=self.embedding_modules,
embedding_padding_modules=self.embedding_padding_modules,
weights_mapper=hf_to_vllm_mapper,
)
return loraAfter registration, _create_merged_loras_inplace packs LoRA weights of modules that were combined (e.g., multiple linear layers) into a single PackedLoRALayerWeights object and replaces the original entries.
def _create_merged_loras_inplace(self, lora_model: LoRAModel) -> None:
for module_name, new_module_names in self.packed_modules.items():
replacement_loras = []
replaced_module = set()
for r in new_module_names:
lora = self._get_lora_layer_weights(lora_model, r)
replacement_loras.append(lora)
if lora:
replaced_module.add(r)
if not any(replacement_loras):
continue
lora_model.loras[module_name] = PackedLoRALayerWeights.pack(replacement_loras)
for module in replaced_module:
lora_model.loras.pop(module, None)Internal Communication
All requests are sent through a ZeroMQ socket ( zmq_socket) using EngineCoreRequestType to differentiate message types. LoRA loading uses the UTILITY request type, which is handled by _handle_client_request in the core process.
def _handle_client_request(self, request_type: EngineCoreRequestType, request: Any) -> None:
if request_type == EngineCoreRequestType.ADD:
self.add_request(request)
elif request_type == EngineCoreRequestType.ABORT:
self.abort_requests(request)
elif request_type == EngineCoreRequestType.UTILITY:
call_id, method_name, args = request
output = UtilityOutput(call_id)
try:
method = getattr(self, method_name)
output.result = method(*self._convert_msgspec_args(method, args))
except BaseException as e:
output.failure_message = f"Call to {method_name} failed: {str(e)}"
self.output_queue.put_nowait(EngineCoreOutputs(utility_output=output))
else:
logger.error("Unrecognized input request type: %s", request_type)Inference Request Flow
When a client calls /v1/completions, the request is transformed into an EngineCoreRequestType.ADD message, placed into the input_queue, and later scheduled by the Scheduler. The scheduler decides which requests run, which are pre‑empted, and how many new tokens each request can generate based on KV‑cache availability.
def schedule(self) -> SchedulerOutput:
# Process running requests
while req_index < len(self.running) and token_budget > 0:
request = self.running[req_index]
num_new_tokens = min(request.num_tokens_with_spec - request.num_computed_tokens, token_budget)
new_blocks = self.kv_cache_manager.allocate_slots(request, num_new_tokens)
if new_blocks is None:
preempted_req = self.running.pop()
self.kv_cache_manager.free(preempted_req)
self.waiting.appendleft(preempted_req)
continue
# Record scheduling info …
# Process waiting requests with similar checks (LoRA limits, token budget, cache)
# Build SchedulerOutput containing new, resumed, and running requests
return scheduler_outputThe scheduler enforces constraints such as maximum concurrent LoRA adapters ( max_loras), token budget, and KV‑cache size, moving unschedulable requests back to the waiting queue.
Execution of a Scheduled Request
After scheduling, the core calls execute_model on the GPU worker. The worker’s GPUModelRunner prepares inputs, activates the required LoRA adapters, and runs the forward pass.
def execute_model(self, scheduler_output: SchedulerOutput):
attn_metadata, logits_indices, spec_decode_metadata = self._prepare_inputs(scheduler_output)
with set_forward_context(attn_metadata, self.vllm_config):
output = self.model(
input_ids=input_ids,
positions=positions,
intermediate_tensors=intermediate_tensors,
inputs_embeds=inputs_embeds,
)
return outputDuring the forward pass, each LoRA‑enabled linear layer calls apply, which forwards the activation to the Punica kernel via punica_wrapper.add_lora_linear. The wrapper holds metadata (adapter indices, weight pointers, scaling factors) required by the custom kernel.
def apply(self, x: torch.Tensor, bias: Optional[torch.Tensor] = None) -> torch.Tensor:
output = self.base_layer.quant_method.apply(self.base_layer, x, bias)
if x.ndim == 3 and output.ndim == 3:
output = output.flatten(0, 1)
x = x.flatten(0, 1)
self.punica_wrapper.add_lora_linear(
output, x, self.lora_a_stacked, self.lora_b_stacked,
self.lora_bias_stacked, 1.0, self.output_slices)
return outputModel Preparation for LoRA
When the model is first loaded, _create_lora_modules walks through all sub‑modules, replaces supported layers (e.g., Linear, ColumnParallelLinear, RowParallelLinear) with LoRA‑aware counterparts, registers them, and attaches a shared punica_wrapper for later metadata updates.
def _create_lora_modules(self):
for module_name, module in self.model.named_modules(remove_duplicate=False):
if not self._match_target_modules(module_name):
continue
new_module = replace_submodule(
self.model, module_name,
from_layer(module, self.lora_slots, self.lora_config,
self.packed_modules_mapping.get(module_name.split('.')[-1], []),
self.model.config)
)
new_module.set_mapping(self.punica_wrapper)
self.register_module(module_name, new_module)The helper replace_submodule simply swaps the attribute on the parent module.
def replace_submodule(model: nn.Module, module_name: str, new_module: nn.Module) -> nn.Module:
parent = model.get_submodule('.'.join(module_name.split('.')[:-1]))
setattr(parent, module_name.split('.')[-1], new_module)
return new_moduleConclusion
The article walks through vLLM 0.8.4’s startup, model loading, Multi‑LoRA deployment, internal messaging, request scheduling, and inference execution, highlighting the core functions and data flows that enable multiple LoRA adapters to share a single base model on one GPU. The detailed code excerpts and diagrams provide a practical reference for engineers building LLM serving platforms with dynamic LoRA support.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
