Step‑by‑Step Guide to Calling Locally Deployed LLMs via OpenAI API Format in Python
This tutorial explains the OpenAI‑style request and response schema, demonstrates low‑level API calls with the requests library, compares them to the high‑level openai package, and walks through building a streaming multi‑turn chatbot that interacts with a locally hosted large language model.
OpenAI‑style protocol
The OpenAI format defines a JSON‑based HTTP schema for LLM services. Core request fields are:
base_url – model service endpoint.
api_key – authentication token (optional for local deployments).
messages – list of objects with role (system, user, assistant, tool) and content.
Function calling extends the protocol with two fields: tool_calls – appears in an assistant message when the model wants to invoke a function. tool – a message that returns the function result together with the corresponding tool_call_id.
Response formats:
Non‑streaming – a single chat completion object containing id, choices (where message.content holds the full reply), created, model, and usage (token counts).
Streaming – when stream=True, the server returns a sequence of chat completion chunk objects. Each chunk’s choices[0].delta.content holds an incremental text fragment that must be concatenated until the stream ends.
Model deployment with vLLM
Start a vLLM service for the Qwen‑3‑4B model:
vllm serve ./Qwen3-4B/ \
--served-model-name Qwen3-4B \
--api-key 111 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--port 6666The service listens at http://localhost:6666/v1.
Low‑level implementation with requests
import requests
import json
import time
class OpenAI:
def __init__(self, base_url, api_key):
self.base_url = base_url
self.api_key = api_key
self.headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
def make_request(self, model, messages):
url = f"{self.base_url}/chat/completions"
payload = {
"model": model,
"messages": messages,
"stream": False,
"max_tokens": 2048,
"temperature": 0.7,
"top_p": 1.0
}
return requests.post(url, headers=self.headers, json=payload)
def extract_response(self, result):
if "choices" in result and len(result["choices"]) > 0:
message = result["choices"][0].get("message", {})
content = message.get("content", "")
finish_reason = result["choices"][0].get("finish_reason", "")
usage = result.get("usage", {})
print(f"✓ Tokens used: prompt={usage.get('prompt_tokens',0)}, "
f"completion={usage.get('completion_tokens',0)}, total={usage.get('total_tokens',0)}")
print(f"✓ Finish reason: {finish_reason}")
print("
" + "="*50)
print("Model reply:")
print("="*50)
print(content)
print("="*50)
return {
"content": content,
"role": message.get("role", "assistant"),
"finish_reason": finish_reason,
"usage": usage,
"full_response": result
}
print("No valid content in response")
return None
def chat_completion(self, model, messages):
response = self.make_request(model, messages)
if response.status_code != 200:
print(f"HTTP error: {response.status_code}")
print(f"Message: {response.text}")
return None
return self.extract_response(response.json())
def chat(self, model, messages):
result = self.chat_completion(model, messages)
return result["content"] if result else NoneTest the client:
if __name__ == "__main__":
base_url = "http://localhost:6666/v1"
API_KEY = "111"
model = "Qwen3-4B"
client = OpenAI(base_url, API_KEY)
messages = [
{"role": "system", "content": "你是一个小助手"},
{"role": "user", "content": "你好"}
]
response = client.chat(model, messages)
print(response)High‑level implementation with the openai package
from openai import OpenAI
client = OpenAI(base_url="http://localhost:6666/v1", api_key="111")
messages = [
{"role": "system", "content": "你是一个小助手"},
{"role": "user", "content": "你好"}
]
response = client.chat.completions.create(model="Qwen3-4B", messages=messages)
print(response.choices[0].message.content)The library wraps the same HTTP schema, providing a concise interface.
Streaming multi‑turn chatbot
from openai import OpenAI
client = OpenAI(base_url="http://localhost:6666/v1", api_key="111")
messages = [{"role": "system", "content": "你是一个友好的AI助手,乐于帮助用户解决问题。"}]
print("="*50)
print("欢迎使用多轮对话机器人!(流式输出版)")
print("输入 'exit' 或 'quit' 退出程序")
print("输入 'clear' 或 'reset' 清除对话历史")
print("="*50)
turn = 1
while True:
user_input = input(f"
[第{turn}轮] 你: ").strip()
if user_input.lower() in ['exit', 'quit', '退出']:
print("再见!")
break
if user_input.lower() in ['clear', 'reset', '清除', '重置']:
messages = [{"role": "system", "content": "你是一个友好的AI助手,乐于帮助用户解决问题。"}]
turn = 1
print("对话历史已清除,开始新的对话")
continue
if not user_input:
continue
messages.append({"role": "user", "content": user_input})
print("
AI: ", end="", flush=True)
full_response = ""
try:
stream = client.chat.completions.create(
model="Qwen3-4B",
messages=messages,
stream=True,
max_tokens=1000
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
full_response += content
print()
if full_response:
messages.append({"role": "assistant", "content": full_response})
turn += 1
except Exception as e:
print(f"
请求出错: {e}")
if messages and messages[-1]["role"] == "user":
messages.pop()Test run: after sending the user message “我的名字是苍井空”, a subsequent query “我叫什么名字?” is answered correctly, confirming that the accumulated messages list provides context for multi‑turn dialogue.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
