How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

This guide walks you through deploying large language models such as ChatGLM and Llama in production, covering environment setup, model quantization, dynamic batching, service configuration, Nginx load balancing, monitoring, troubleshooting, and best‑practice recommendations for high‑performance, cost‑effective AI inference.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Deploy and Optimize Enterprise‑Scale LLM Inference Services: A Practical Guide

Overview

The team needed to deploy large language models such as ChatGLM and Llama for enterprise AI assistants and document analysis. Initial production runs suffered from high latency, excessive GPU memory usage, and frequent time‑outs, prompting a two‑month effort to bring inference performance to acceptable levels.

Technical Characteristics

Resource‑intensive : A 13B model in full precision requires >26 GB GPU memory; inference can add ~30 % more.

Dynamic load : Prompt lengths vary from a few tokens to several thousand, causing request latency from milliseconds to tens of seconds.

Complex tuning space : Quantization, KV‑cache management, batch strategies, CUDA kernel tweaks, and tensor parallelism each have dozens of inter‑dependent parameters.

Applicable Scenarios

Enterprise AI assistant platform – hundreds to thousands of concurrent users, sub‑3 s first‑token latency.

Batch document processing – long inputs (4k‑8k tokens), throughput‑oriented.

API service provider – mixed request lengths, need to meet p99 latency while controlling per‑GPU QPS.

Preparation

System Check

# Check OS version
cat /etc/os-release
# GPU info
nvidia-smi
# CUDA version
nvcc --version
# Memory and storage checks
free -h
df -h /data
# Verify NVMe read speed (>=3 GB/s)
dd if=/data/testfile of=/dev/null bs=1M count=10240

Common pitfall: PCIe slot running at x8 instead of x16 halves bandwidth. Verify with nvidia-smi topo -m.

Install Dependencies

# System packages
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl wget libssl-dev libffi-dev python3-dev python3-pip libaio-dev ninja-build
# NVIDIA driver and CUDA
sudo apt install -y nvidia-driver-535 nvidia-utils-535
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-1
# Python environment
python3 -m venv /opt/llm-inference
source /opt/llm-inference/bin/activate
pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.35.0 accelerate==0.24.0 sentencepiece protobuf
pip install flash-attn==2.3.3 --no-build-isolation  # compilation can be slow
pip install bitsandbytes==0.41.1  # model quantization

Flash Attention is essential for long‑sequence speed‑up; if compilation fails, the service can run without it and be added later.

Core Configuration

Model Download & Quantization

# Create model directory
sudo mkdir -p /data/models && sudo chown $USER:$USER /data/models
# Clone ChatGLM3‑6B (example)
cd /data/models
git lfs install
git clone https://huggingface.co/THUDM/chatglm3-6b
# Optional: use a mirror
# export HF_ENDPOINT=https://hf-mirror.com
# huggingface-cli download THUDM/chatglm3-6b --local-dir /data/models/chatglm3-6b

Model files are tens of gigabytes; use aria2c for resumable downloads or set up an internal cache server.

# quantize_model.py
import torch
from transformers import AutoTokenizer, AutoModel
model_path = "/data/models/chatglm3-6b"
quantized_path = "/data/models/chatglm3-6b-int4"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto",
    load_in_4bit=True,
    torch_dtype=torch.float16,
)
model.save_pretrained(quantized_path)
tokenizer.save_pretrained(quantized_path)
print(f"Quantized model saved to {quantized_path}")

INT4 quantization reduces memory from ~12 GB to ~3 GB with negligible speed loss.

Inference Service Configuration

# config.yaml (excerpt)
server:
  host: 0.0.0.0
  port: 8000
  workers: 1
  timeout: 300
model:
  path: /data/models/chatglm3-6b-int4
  device: cuda:0
  dtype: float16
  trust_remote_code: true
  max_memory: {0: "22GB"}
inference:
  max_batch_size: 16
  max_input_length: 8192
  max_output_length: 2048
  batch_timeout: 50  # ms
  temperature: 0.8
  top_p: 0.8
  top_k: 50
  repetition_penalty: 1.1
optimization:
  use_flash_attention: true
  use_kv_cache: true
  compile: false
cache:
  type: redis
  host: localhost
  port: 6379
  ttl: 3600
logging:
  level: INFO
  file: /var/log/llm-inference/server.log
  rotation: 100MB
  retention: 7

Parameter notes : max_batch_size: 8‑16 for 7B‑13B models on A100; larger batch increases throughput but also memory. batch_timeout: 30‑50 ms for online services, 100‑200 ms for batch jobs. use_flash_attention: always enable for long sequences (2‑3× speedup).

Service Code (FastAPI)

# server.py (excerpt)
import asyncio, time, yaml, logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModel

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

with open('/opt/llm-inference/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

app = FastAPI(title="LLM Inference Service")

class InferenceRequest(BaseModel):
    prompt: str
    max_length: int = 2048
    temperature: float = 0.8
    top_p: float = 0.8
    top_k: int = 50

class InferenceResponse(BaseModel):
    text: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    inference_time: float

class BatchProcessor:
    """Dynamic batch processor"""
    def __init__(self, model, tokenizer, cfg):
        self.model = model
        self.tokenizer = tokenizer
        self.cfg = cfg
        self.queue = []
        self.processing = False
        self.batch_timeout = cfg['inference']['batch_timeout'] / 1000
        self.max_batch_size = cfg['inference']['max_batch_size']

    async def add_request(self, prompt, params):
        future = asyncio.Future()
        self.queue.append((prompt, params, future, time.time()))
        if not self.processing:
            asyncio.create_task(self._process_batch())
        return await future

    async def _process_batch(self):
        if self.processing:
            return
        self.processing = True
        await asyncio.sleep(self.batch_timeout)
        if not self.queue:
            self.processing = False
            return
        batch = []
        while self.queue and len(batch) < self.max_batch_size:
            batch.append(self.queue.pop(0))
        prompts = [p[0] for p in batch]
        params_list = [p[1] for p in batch]
        futures = [p[2] for p in batch]
        start_times = [p[3] for p in batch]
        logger.info(f"Processing batch of size {len(batch)}")
        try:
            inputs = self.tokenizer(
                prompts,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=self.cfg['inference']['max_input_length'],
            ).to(self.model.device)
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_new_tokens=params_list[0]['max_length'],
                    temperature=params_list[0]['temperature'],
                    top_p=params_list[0]['top_p'],
                    top_k=params_list[0]['top_k'],
                    do_sample=True,
                    pad_token_id=self.tokenizer.eos_token_id,
                )
            responses = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)
            for i, future in enumerate(futures):
                if not future.done():
                    result = {
                        'text': responses[i],
                        'prompt_tokens': inputs['input_ids'][i].shape[0],
                        'completion_tokens': outputs[i].shape[0] - inputs['input_ids'][i].shape[0],
                        'inference_time': time.time() - start_times[i],
                    }
                    future.set_result(result)
        except Exception as e:
            logger.error(f"Batch processing error: {e}", exc_info=True)
            for future in futures:
                if not future.done():
                    future.set_exception(e)
        self.processing = False
        if self.queue:
            asyncio.create_task(self._process_batch())

# Load model and tokenizer
logger.info("Loading model...")
model_start = time.time()
tokenizer = AutoTokenizer.from_pretrained(config['model']['path'], trust_remote_code=config['model']['trust_remote_code'])
model = AutoModel.from_pretrained(
    config['model']['path'],
    trust_remote_code=config['model']['trust_remote_code'],
    device_map=config['model']['device'],
    torch_dtype=getattr(torch, config['model']['dtype']),
).eval()
logger.info(f"Model loaded in {time.time() - model_start:.2f}s")

batch_processor = BatchProcessor(model, tokenizer, config)

@app.post("/v1/completions", response_model=InferenceResponse)
async def create_completion(request: InferenceRequest):
    try:
        params = {
            'max_length': request.max_length,
            'temperature': request.temperature,
            'top_p': request.top_p,
            'top_k': request.top_k,
        }
        result = await batch_processor.add_request(request.prompt, params)
        return InferenceResponse(
            text=result['text'],
            prompt_tokens=result['prompt_tokens'],
            completion_tokens=result['completion_tokens'],
            total_tokens=result['prompt_tokens'] + result['completion_tokens'],
            inference_time=result['inference_time'],
        )
    except Exception as e:
        logger.error(f"Inference error: {e}", exc_info=True)
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model": config['model']['path'],
        "device": str(model.device),
        "queue_size": len(batch_processor.queue),
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host=config['server']['host'], port=config['server']['port'], workers=config['server']['workers'], log_config=None)

Launch and Validation

Start Service

# Create log directory
sudo mkdir -p /var/log/llm-inference
sudo chown $USER:$USER /var/log/llm-inference
# Systemd unit file (/etc/systemd/system/llm-inference.service)
[Unit]
Description=LLM Inference Service
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=/opt/llm-inference
Environment="PATH=/opt/llm-inference/bin:/usr/local/cuda-12.1/bin:/usr/bin"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64"
ExecStart=/opt/llm-inference/bin/python /opt/llm-inference/server.py
Restart=always
RestartSec=10
StandardOutput=append:/var/log/llm-inference/stdout.log
StandardError=append:/var/log/llm-inference/stderr.log

[Install]
WantedBy=multi-user.target

# Reload and start
sudo systemctl daemon-reload
sudo systemctl start llm-inference
sudo systemctl enable llm-inference

Functional Verification

# Health check
curl http://localhost:8000/health
# Expected output e.g. {"status":"healthy","model":"/data/models/chatglm3-6b-int4","device":"cuda:0","queue_size":0}

# Test inference
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "介绍一下人工智能的发展历史", "max_length": 512, "temperature": 0.7}'

Observe response time and output quality.

Example Nginx Reverse Proxy

# /etc/nginx/sites-available/llm-inference
upstream llm_backend {
    ip_hash;
    server 192.168.1.101:8000 max_fails=3 fail_timeout=30s;
    server 192.168.1.102:8000 max_fails=3 fail_timeout=30s;
    server 192.168.1.103:8000 max_fails=3 fail_timeout=30s;
    keepalive 32;
}
server {
    listen 80;
    server_name llm-api.example.com;
    access_log /var/log/nginx/llm-access.log llm_log;
    error_log /var/log/nginx/llm-error.log warn;
    limit_req_zone $binary_remote_addr zone=llm_limit:10m rate=10r/s;
    limit_req zone=llm_limit burst=20 nodelay;
    location / {
        proxy_pass http://llm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_connect_timeout 10s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
        proxy_buffering off;
        client_max_body_size 10m;
    }
    location /health {
        proxy_pass http://llm_backend;
        access_log off;
    }
}

Real‑World Cases

Intelligent Customer Service

# qa_service.py (excerpt)
import aiohttp, asyncio
class KnowledgeBaseQA:
    def __init__(self, llm_endpoint: str, kb_endpoint: str):
        self.llm_endpoint = llm_endpoint
        self.kb_endpoint = kb_endpoint
    async def retrieve_context(self, question: str, top_k: int = 3):
        async with aiohttp.ClientSession() as session:
            async with session.post(f"{self.kb_endpoint}/search", json={"query": question, "top_k": top_k}) as resp:
                results = await resp.json()
                return [item['content'] for item in results['documents']]
    async def generate_answer(self, question: str, contexts):
        context_text = "

".join([f"参考资料{i+1}:
{c}" for i, c in enumerate(contexts)])
        prompt = f"根据以下参考资料回答问题。如果没有相关信息,请回答\"抱歉,我无法根据现有资料回答这个问题\"。

{context_text}

问题:{question}

答案:"
        async with aiohttp.ClientSession() as session:
            async with session.post(f"{self.llm_endpoint}/v1/completions", json={"prompt": prompt, "max_length": 512, "temperature": 0.3, "top_p": 0.85}, timeout=aiohttp.ClientTimeout(total=30)) as resp:
                result = await resp.json()
                return result['text']
    async def answer_question(self, question: str):
        start = asyncio.get_event_loop().time()
        contexts = await self.retrieve_context(question)
        retrieve_time = asyncio.get_event_loop().time() - start
        answer = await self.generate_answer(question, contexts)
        total_time = asyncio.get_event_loop().time() - start
        return {"question": question, "answer": answer, "contexts": contexts, "retrieve_time": retrieve_time, "total_time": total_time}

# Example usage
async def main():
    qa = KnowledgeBaseQA(llm_endpoint="http://localhost:8000", kb_endpoint="http://localhost:9000")
    result = await qa.answer_question("如何重置密码?")
    print(f"Q: {result['question']}")
    print(f"A: {result['answer']}")
    print(f"耗时: {result['total_time']:.2f}秒")

if __name__ == "__main__":
    asyncio.run(main())

Result example (≈1.2 s):

Q: 如何重置密码?
A: 根据参考资料,重置密码的步骤如下:1. 登录系统后点击右上角头像;2. 选择"账户设置";3. 点击"修改密码"按钮;4. 输入原密码和新密码;5. 点击"确认"保存。如果忘记原密码,可以点击登录页面的"忘记密码",通过邮箱验证码重置。
耗时: 1.23秒

Batch Document Summarization

# batch_summarize.py (excerpt)
import asyncio, aiohttp
async def summarize_document(session, doc_id: str, content: str):
    prompt = f"请为以下文档生成200字以内的摘要,突出核心要点:

{content[:4000]}

摘要:"
    async with session.post("http://localhost:8000/v1/completions", json={"prompt": prompt, "max_length": 300, "temperature": 0.5}, timeout=aiohttp.ClientTimeout(total=60)) as resp:
        result = await resp.json()
        return {"doc_id": doc_id, "summary": result['text'], "tokens": result['total_tokens'], "time": result['inference_time']}

async def batch_summarize(documents, concurrency=32):
    async with aiohttp.ClientSession() as session:
        semaphore = asyncio.Semaphore(concurrency)
        async def process_one(doc):
            async with semaphore:
                return await summarize_document(session, doc['id'], doc['content'])
        tasks = [process_one(doc) for doc in documents]
        return await asyncio.gather(*tasks)

# Example run
docs = [{"id": f"doc_{i}", "content": f"这是第{i}份文档的内容..." * 100} for i in range(1000)]
results = asyncio.run(batch_summarize(docs, concurrency=32))
success = [r for r in results if 'error' not in r]
print(f"成功处理: {len(success)}/{len(docs)}")
print(f"平均耗时: {sum(r['time'] for r in success)/len(success):.2f}秒")
print(f"总tokens: {sum(r['tokens'] for r in success)}")

Best Practices

Performance Optimization

Model quantization first : INT4 cuts memory >75 % with <2 % accuracy loss for ChatGLM3‑6B.

Dynamic batching tuning : For chat bots, max_batch_size=8, batch_timeout=30ms yields p95 latency ~1.5 s and QPS ~45. For document processing, max_batch_size=24, batch_timeout=150ms doubles throughput.

KV‑cache reuse : Cache past key‑values across dialogue turns to avoid recomputation.

Security Hardening

Rate‑limit at Nginx and application level (e.g., 30 requests/min per IP).

Validate prompts against a blacklist to prevent prompt injection.

Mask sensitive data (phone numbers, IDs, emails) in logs.

High Availability

Deploy multiple instances behind Nginx with ip_hash to keep cache locality.

Implement graceful degradation: switch to a smaller model or cached response when GPU memory >90 %.

Maintain a hot‑standby model instance for instant failover.

Troubleshooting & Monitoring

Common Errors

CUDA out of memory : Reduce max_batch_size, limit max_input_length, or use more aggressive quantization.

Inference timeout : Increase service timeout, enable Flash Attention, lower max_output_length.

Model loading slow : Store model on local NVMe SSD and pre‑warm.

Batch processing stuck : Decrease batch_timeout or lower max_batch_size when request volume is low.

Low GPU utilisation : Increase max_batch_size or concurrent requests; check tokenization bottlenecks.

Poor text quality : Adjust quantization level or generation parameters ( temperature, top_p, repetition_penalty).

Monitoring

GPU utilisation, memory, temperature: nvidia-smi or nvidia‑gpu‑exporter for Prometheus.

Service health endpoint /health provides queue size and model info.

Expose Prometheus metrics in server.py (request count, latency histogram, queue gauge, GPU memory gauge).

Typical alert thresholds: GPU utilisation <30 % or >95 %; memory usage >90 %; temperature >85 °C; queue size >50; p95 latency >5 s.

Backup & Recovery

# backup.sh (excerpt)
#!/bin/bash
BACKUP_DIR="/data/backups/llm-inference"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_PATH="$BACKUP_DIR/backup_$TIMESTAMP"
mkdir -p "$BACKUP_PATH"
# Config files
cp /opt/llm-inference/config.yaml "$BACKUP_PATH/"
cp /etc/systemd/system/llm-inference.service "$BACKUP_PATH/"
cp /etc/nginx/sites-available/llm-inference "$BACKUP_PATH/"
# Recent logs (last 7 days)
find /var/log/llm-inference -name "*.log" -mtime -7 -exec cp {} "$BACKUP_PATH/" \;
# Custom model (if any)
if [ -d "/data/models/custom-model" ]; then
    tar -czf "$BACKUP_PATH/custom-model.tar.gz" -C /data/models custom-model
fi
# Compress backup
cd "$BACKUP_DIR"
tar -czf "backup_$TIMESTAMP.tar.gz" "backup_$TIMESTAMP"
rm -rf "backup_$TIMESTAMP"
# Keep last 30 days
find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -delete

Restore steps: stop services, extract backup, restore config files, verify model files, reload systemd units, and start services.

Conclusion

Key Takeaways

Model quantization (INT4) provides the highest memory‑performance gain.

Dynamic batching is essential; tune max_batch_size and batch_timeout per workload.

Continuous monitoring of GPU metrics and request latency prevents silent degradation.

Performance tuning is a system‑wide effort: model choice, hardware, batch strategy, and configuration all interact.

Next‑Level Topics

Tensor‑parallel and pipeline‑parallel training for >70B models (DeepSpeed, Megatron‑LM).

Continuous batching and PagedAttention (vLLM) for further throughput gains.

Model pruning and knowledge‑distillation for ultra‑low latency scenarios.

References

Transformers official documentation.

NVIDIA TensorRT‑LLM.

Flash Attention paper.

vLLM technical blog.

Hugging Face community forum.

LLMQuantizationPerformance TuningGPUInference
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.