How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide
This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.
Overview
The team needed to deploy large language models such as ChatGLM and Llama for enterprise AI assistants and document analysis. Initial production runs suffered from high latency, excessive GPU memory usage, and frequent time‑outs, prompting a two‑month effort to bring inference performance to acceptable levels.
Technical Characteristics
Resource‑intensive : A 13B model in full precision requires >26 GB GPU memory; inference can add ~30 % more.
Dynamic load : Prompt lengths vary from a few tokens to several thousand, causing request latency from milliseconds to tens of seconds.
Complex tuning space : Quantization, KV‑cache management, batch strategies, CUDA kernel tweaks, and tensor parallelism each have dozens of inter‑dependent parameters.
Applicable Scenarios
Enterprise AI assistant platform – hundreds to thousands of concurrent users, sub‑3 s first‑token latency.
Batch document processing – long inputs (4k‑8k tokens), throughput‑oriented.
API service provider – mixed request lengths, need to meet p99 latency while controlling per‑GPU QPS.
Preparation
System Check
# Check OS version
cat /etc/os-release
# GPU info
nvidia-smi
# CUDA version
nvcc --version
# Memory and storage checks
free -h
df -h /data
# Verify NVMe read speed (>=3 GB/s)
dd if=/data/testfile of=/dev/null bs=1M count=10240Common pitfall: PCIe slot running at x8 instead of x16 halves bandwidth. Verify with nvidia-smi topo -m.
Install Dependencies
# System packages
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl wget libssl-dev libffi-dev python3-dev python3-pip libaio-dev ninja-build
# NVIDIA driver and CUDA
sudo apt install -y nvidia-driver-535 nvidia-utils-535
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-1
# Python environment
python3 -m venv /opt/llm-inference
source /opt/llm-inference/bin/activate
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.35.0 accelerate==0.24.0 sentencepiece protobuf
pip install flash-attn==2.3.3 --no-build-isolation # compilation can be slow
pip install bitsandbytes==0.41.1 # model quantizationFlash Attention is essential for long‑sequence speed‑up; if compilation fails, the service can run without it and be added later.
Core Configuration
Model Download & Quantization
# Create model directory
sudo mkdir -p /data/models && sudo chown $USER:$USER /data/models
# Clone ChatGLM3‑6B (example)
cd /data/models
git lfs install
git clone https://huggingface.co/THUDM/chatglm3-6b
# Optional: use a mirror
# export HF_ENDPOINT=https://hf-mirror.com
# huggingface-cli download THUDM/chatglm3-6b --local-dir /data/models/chatglm3-6bModel files are tens of gigabytes; use aria2c for resumable downloads or set up an internal cache server.
# quantize_model.py
import torch
from transformers import AutoTokenizer, AutoModel
model_path = "/data/models/chatglm3-6b"
quantized_path = "/data/models/chatglm3-6b-int4"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto",
load_in_4bit=True,
torch_dtype=torch.float16,
)
model.save_pretrained(quantized_path)
tokenizer.save_pretrained(quantized_path)
print(f"Quantized model saved to {quantized_path}")INT4 quantization reduces memory from ~12 GB to ~3 GB with negligible speed loss.
Inference Service Configuration
# config.yaml (excerpt)
server:
host: 0.0.0.0
port: 8000
workers: 1
timeout: 300
model:
path: /data/models/chatglm3-6b-int4
device: cuda:0
dtype: float16
trust_remote_code: true
max_memory: {0: "22GB"}
inference:
max_batch_size: 16
max_input_length: 8192
max_output_length: 2048
batch_timeout: 50 # ms
temperature: 0.8
top_p: 0.8
top_k: 50
repetition_penalty: 1.1
optimization:
use_flash_attention: true
use_kv_cache: true
compile: false
cache:
type: redis
host: localhost
port: 6379
ttl: 3600
logging:
level: INFO
file: /var/log/llm-inference/server.log
rotation: 100MB
retention: 7Parameter notes : max_batch_size: 8‑16 for 7B‑13B models on A100; larger batch increases throughput but also memory. batch_timeout: 30‑50 ms for online services, 100‑200 ms for batch jobs. use_flash_attention: always enable for long sequences (2‑3× speedup).
Service Code (FastAPI)
# server.py (excerpt)
import asyncio, time, yaml, logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModel
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
with open('/opt/llm-inference/config.yaml', 'r') as f:
config = yaml.safe_load(f)
app = FastAPI(title="LLM Inference Service")
class InferenceRequest(BaseModel):
prompt: str
max_length: int = 2048
temperature: float = 0.8
top_p: float = 0.8
top_k: int = 50
class InferenceResponse(BaseModel):
text: str
prompt_tokens: int
completion_tokens: int
total_tokens: int
inference_time: float
class BatchProcessor:
"""Dynamic batch processor"""
def __init__(self, model, tokenizer, cfg):
self.model = model
self.tokenizer = tokenizer
self.cfg = cfg
self.queue = []
self.processing = False
self.batch_timeout = cfg['inference']['batch_timeout'] / 1000
self.max_batch_size = cfg['inference']['max_batch_size']
async def add_request(self, prompt, params):
future = asyncio.Future()
self.queue.append((prompt, params, future, time.time()))
if not self.processing:
asyncio.create_task(self._process_batch())
return await future
async def _process_batch(self):
if self.processing:
return
self.processing = True
await asyncio.sleep(self.batch_timeout)
if not self.queue:
self.processing = False
return
batch = []
while self.queue and len(batch) < self.max_batch_size:
batch.append(self.queue.pop(0))
prompts = [p[0] for p in batch]
params_list = [p[1] for p in batch]
futures = [p[2] for p in batch]
start_times = [p[3] for p in batch]
logger.info(f"Processing batch of size {len(batch)}")
try:
inputs = self.tokenizer(
prompts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=self.cfg['inference']['max_input_length'],
).to(self.model.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=params_list[0]['max_length'],
temperature=params_list[0]['temperature'],
top_p=params_list[0]['top_p'],
top_k=params_list[0]['top_k'],
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)
responses = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
for i, future in enumerate(futures):
if not future.done():
result = {
'text': responses[i],
'prompt_tokens': inputs['input_ids'][i].shape[0],
'completion_tokens': outputs[i].shape[0] - inputs['input_ids'][i].shape[0],
'inference_time': time.time() - start_times[i],
}
future.set_result(result)
except Exception as e:
logger.error(f"Batch processing error: {e}", exc_info=True)
for future in futures:
if not future.done():
future.set_exception(e)
self.processing = False
if self.queue:
asyncio.create_task(self._process_batch())
# Load model and tokenizer
logger.info("Loading model...")
model_start = time.time()
tokenizer = AutoTokenizer.from_pretrained(config['model']['path'], trust_remote_code=config['model']['trust_remote_code'])
model = AutoModel.from_pretrained(
config['model']['path'],
trust_remote_code=config['model']['trust_remote_code'],
device_map=config['model']['device'],
torch_dtype=getattr(torch, config['model']['dtype']),
).eval()
logger.info(f"Model loaded in {time.time() - model_start:.2f}s")
batch_processor = BatchProcessor(model, tokenizer, config)
@app.post("/v1/completions", response_model=InferenceResponse)
async def create_completion(request: InferenceRequest):
try:
params = {
'max_length': request.max_length,
'temperature': request.temperature,
'top_p': request.top_p,
'top_k': request.top_k,
}
result = await batch_processor.add_request(request.prompt, params)
return InferenceResponse(
text=result['text'],
prompt_tokens=result['prompt_tokens'],
completion_tokens=result['completion_tokens'],
total_tokens=result['prompt_tokens'] + result['completion_tokens'],
inference_time=result['inference_time'],
)
except Exception as e:
logger.error(f"Inference error: {e}", exc_info=True)
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"model": config['model']['path'],
"device": str(model.device),
"queue_size": len(batch_processor.queue),
}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host=config['server']['host'], port=config['server']['port'], workers=config['server']['workers'], log_config=None)Launch and Validation
Start Service
# Create log directory
sudo mkdir -p /var/log/llm-inference
sudo chown $USER:$USER /var/log/llm-inference
# Systemd unit file (/etc/systemd/system/llm-inference.service)
[Unit]
Description=LLM Inference Service
After=network.target
[Service]
Type=simple
User=$USER
WorkingDirectory=/opt/llm-inference
Environment="PATH=/opt/llm-inference/bin:/usr/local/cuda-12.1/bin:/usr/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64"
ExecStart=/opt/llm-inference/bin/python /opt/llm-inference/server.py
Restart=always
RestartSec=10
StandardOutput=append:/var/log/llm-inference/stdout.log
StandardError=append:/var/log/llm-inference/stderr.log
[Install]
WantedBy=multi-user.target
# Reload and start
sudo systemctl daemon-reload
sudo systemctl start llm-inference
sudo systemctl enable llm-inferenceFunctional Verification
# Health check
curl http://localhost:8000/health
# Expected output e.g. {"status":"healthy","model":"/data/models/chatglm3-6b-int4","device":"cuda:0","queue_size":0}
# Test inference
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "介绍一下人工智能的发展历史", "max_length": 512, "temperature": 0.7}'Observe response time and output quality.
Example Nginx Reverse Proxy
# /etc/nginx/sites-available/llm-inference
upstream llm_backend {
ip_hash;
server 192.168.1.101:8000 max_fails=3 fail_timeout=30s;
server 192.168.1.102:8000 max_fails=3 fail_timeout=30s;
server 192.168.1.103:8000 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name llm-api.example.com;
access_log /var/log/nginx/llm-access.log llm_log;
error_log /var/log/nginx/llm-error.log warn;
limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=10r/s;
limit_req zone=llm_limit burst=20 nodelay;
location / {
proxy_pass http://llm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
proxy_buffering off;
client_max_body_size 10m;
}
location /health {
proxy_pass http://llm_backend;
access_log off;
}
}Real‑World Cases
Intelligent Customer Service
# qa_service.py (excerpt)
import aiohttp, asyncio
class KnowledgeBaseQA:
def __init__(self, llm_endpoint: str, kb_endpoint: str):
self.llm_endpoint = llm_endpoint
self.kb_endpoint = kb_endpoint
async def retrieve_context(self, question: str, top_k: int = 3):
async with aiohttp.ClientSession() as session:
async with session.post(f"{self.kb_endpoint}/search", json={"query": question, "top_k": top_k}) as resp:
results = await resp.json()
return [item['content'] for item in results['documents']]
async def generate_answer(self, question: str, contexts):
context_text = "
".join([f"参考资料{i+1}:
{c}" for i, c in enumerate(contexts)])
prompt = f"根据以下参考资料回答问题。如果没有相关信息,请回答\"抱歉,我无法根据现有资料回答这个问题\"。
{context_text}
问题:{question}
答案:"
async with aiohttp.ClientSession() as session:
async with session.post(f"{self.llm_endpoint}/v1/completions", json={"prompt": prompt, "max_length": 512, "temperature": 0.3, "top_p": 0.85}, timeout=aiohttp.ClientTimeout(total=30)) as resp:
result = await resp.json()
return result['text']
async def answer_question(self, question: str):
start = asyncio.get_event_loop().time()
contexts = await self.retrieve_context(question)
retrieve_time = asyncio.get_event_loop().time() - start
answer = await self.generate_answer(question, contexts)
total_time = asyncio.get_event_loop().time() - start
return {"question": question, "answer": answer, "contexts": contexts, "retrieve_time": retrieve_time, "total_time": total_time}
# Example usage
async def main():
qa = KnowledgeBaseQA(llm_endpoint="http://localhost:8000", kb_endpoint="http://localhost:9000")
result = await qa.answer_question("如何重置密码?")
print(f"Q: {result['question']}")
print(f"A: {result['answer']}")
print(f"耗时: {result['total_time']:.2f}秒")
if __name__ == "__main__":
asyncio.run(main())Result example (≈1.2 s):
Q: 如何重置密码?
A: 根据参考资料,重置密码的步骤如下:1. 登录系统后点击右上角头像;2. 选择"账户设置";3. 点击"修改密码"按钮;4. 输入原密码和新密码;5. 点击"确认"保存。如果忘记原密码,可以点击登录页面的"忘记密码",通过邮箱验证码重置。
耗时: 1.23秒Batch Document Summarization
# batch_summarize.py (excerpt)
import asyncio, aiohttp
async def summarize_document(session, doc_id: str, content: str):
prompt = f"请为以下文档生成200字以内的摘要,突出核心要点:
{content[:4000]}
摘要:"
async with session.post("http://localhost:8000/v1/completions", json={"prompt": prompt, "max_length": 300, "temperature": 0.5}, timeout=aiohttp.ClientTimeout(total=60)) as resp:
result = await resp.json()
return {"doc_id": doc_id, "summary": result['text'], "tokens": result['total_tokens'], "time": result['inference_time']}
async def batch_summarize(documents, concurrency=32):
async with aiohttp.ClientSession() as session:
semaphore = asyncio.Semaphore(concurrency)
async def process_one(doc):
async with semaphore:
return await summarize_document(session, doc['id'], doc['content'])
tasks = [process_one(doc) for doc in documents]
return await asyncio.gather(*tasks)
# Example run
docs = [{"id": f"doc_{i}", "content": f"这是第{i}份文档的内容..." * 100} for i in range(1000)]
results = asyncio.run(batch_summarize(docs, concurrency=32))
success = [r for r in results if 'error' not in r]
print(f"成功处理: {len(success)}/{len(docs)}")
print(f"平均耗时: {sum(r['time'] for r in success)/len(success):.2f}秒")
print(f"总tokens: {sum(r['tokens'] for r in success)}")Best Practices
Performance Optimization
Model quantization first : INT4 cuts memory >75 % with <2 % accuracy loss for ChatGLM3‑6B.
Dynamic batching tuning : For chat bots, max_batch_size=8, batch_timeout=30ms yields p95 latency ~1.5 s and QPS ~45. For document processing, max_batch_size=24, batch_timeout=150ms doubles throughput.
KV‑cache reuse : Cache past key‑values across dialogue turns to avoid recomputation.
Security Hardening
Rate‑limit at Nginx and application level (e.g., 30 requests/min per IP).
Validate prompts against a blacklist to prevent prompt injection.
Mask sensitive data (phone numbers, IDs, emails) in logs.
High Availability
Deploy multiple instances behind Nginx with ip_hash to keep cache locality.
Implement graceful degradation: switch to a smaller model or cached response when GPU memory >90 %.
Maintain a hot‑standby model instance for instant failover.
Troubleshooting & Monitoring
Common Errors
CUDA out of memory : Reduce max_batch_size, limit max_input_length, or use more aggressive quantization.
Inference timeout : Increase service timeout, enable Flash Attention, lower max_output_length.
Model loading slow : Store model on local NVMe SSD and pre‑warm.
Batch processing stuck : Decrease batch_timeout or lower max_batch_size when request volume is low.
Low GPU utilisation : Increase max_batch_size or concurrent requests; check tokenization bottlenecks.
Poor text quality : Adjust quantization level or generation parameters ( temperature, top_p, repetition_penalty).
Monitoring
GPU utilisation, memory, temperature: nvidia-smi or nvidia‑gpu‑exporter for Prometheus.
Service health endpoint /health provides queue size and model info.
Expose Prometheus metrics in server.py (request count, latency histogram, queue gauge, GPU memory gauge).
Typical alert thresholds: GPU utilisation <30 % or >95 %; memory usage >90 %; temperature >85 °C; queue size >50; p95 latency >5 s.
Backup & Recovery
# backup.sh (excerpt)
#!/bin/bash
BACKUP_DIR="/data/backups/llm-inference"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/backup_$TIMESTAMP"
mkdir -p "$BACKUP_PATH"
# Config files
cp /opt/llm-inference/config.yaml "$BACKUP_PATH/"
cp /etc/systemd/system/llm-inference.service "$BACKUP_PATH/"
cp /etc/nginx/sites-available/llm-inference "$BACKUP_PATH/"
# Recent logs (last 7 days)
find /var/log/llm-inference -name "*.log" -mtime -7 -exec cp {} "$BACKUP_PATH/" \;
# Custom model (if any)
if [ -d "/data/models/custom-model" ]; then
tar -czf "$BACKUP_PATH/custom-model.tar.gz" -C /data/models custom-model
fi
# Compress backup
cd "$BACKUP_DIR"
tar -czf "backup_$TIMESTAMP.tar.gz" "backup_$TIMESTAMP"
rm -rf "backup_$TIMESTAMP"
# Keep last 30 days
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -deleteRestore steps: stop services, extract backup, restore config files, verify model files, reload systemd units, and start services.
Conclusion
Key Takeaways
Model quantization (INT4) provides the highest memory‑performance gain.
Dynamic batching is essential; tune max_batch_size and batch_timeout per workload.
Continuous monitoring of GPU metrics and request latency prevents silent degradation.
Performance tuning is a system‑wide effort: model choice, hardware, batch strategy, and configuration all interact.
Next‑Level Topics
Tensor‑parallel and pipeline‑parallel training for >70B models (DeepSpeed, Megatron‑LM).
Continuous batching and PagedAttention (vLLM) for further throughput gains.
Model pruning and knowledge‑distillation for ultra‑low latency scenarios.
References
Transformers official documentation.
NVIDIA TensorRT‑LLM.
Flash Attention paper.
vLLM technical blog.
Hugging Face community forum.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
