Step‑by‑Step Guide to Calling Locally Deployed LLMs via OpenAI API Format in Python

This tutorial explains the OpenAI‑style request and response schema, demonstrates low‑level API calls with the requests library, compares them to the high‑level openai package, and walks through building a streaming multi‑turn chatbot that interacts with a locally hosted large language model.

Fun with Large Models
Fun with Large Models
Fun with Large Models
Step‑by‑Step Guide to Calling Locally Deployed LLMs via OpenAI API Format in Python

OpenAI‑style protocol

The OpenAI format defines a JSON‑based HTTP schema for LLM services. Core request fields are:

base_url – model service endpoint.

api_key – authentication token (optional for local deployments).

messages – list of objects with role (system, user, assistant, tool) and content.

Function calling extends the protocol with two fields: tool_calls – appears in an assistant message when the model wants to invoke a function. tool – a message that returns the function result together with the corresponding tool_call_id.

Response formats:

Non‑streaming – a single chat completion object containing id, choices (where message.content holds the full reply), created, model, and usage (token counts).

Streaming – when stream=True, the server returns a sequence of chat completion chunk objects. Each chunk’s choices[0].delta.content holds an incremental text fragment that must be concatenated until the stream ends.

Model deployment with vLLM

Start a vLLM service for the Qwen‑3‑4B model:

vllm serve ./Qwen3-4B/ \
    --served-model-name Qwen3-4B \
    --api-key 111 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --port 6666

The service listens at http://localhost:6666/v1.

Low‑level implementation with requests

import requests
import json
import time

class OpenAI:
    def __init__(self, base_url, api_key):
        self.base_url = base_url
        self.api_key = api_key
        self.headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        }

    def make_request(self, model, messages):
        url = f"{self.base_url}/chat/completions"
        payload = {
            "model": model,
            "messages": messages,
            "stream": False,
            "max_tokens": 2048,
            "temperature": 0.7,
            "top_p": 1.0
        }
        return requests.post(url, headers=self.headers, json=payload)

    def extract_response(self, result):
        if "choices" in result and len(result["choices"]) > 0:
            message = result["choices"][0].get("message", {})
            content = message.get("content", "")
            finish_reason = result["choices"][0].get("finish_reason", "")
            usage = result.get("usage", {})
            print(f"✓ Tokens used: prompt={usage.get('prompt_tokens',0)}, "
                  f"completion={usage.get('completion_tokens',0)}, total={usage.get('total_tokens',0)}")
            print(f"✓ Finish reason: {finish_reason}")
            print("
" + "="*50)
            print("Model reply:")
            print("="*50)
            print(content)
            print("="*50)
            return {
                "content": content,
                "role": message.get("role", "assistant"),
                "finish_reason": finish_reason,
                "usage": usage,
                "full_response": result
            }
        print("No valid content in response")
        return None

    def chat_completion(self, model, messages):
        response = self.make_request(model, messages)
        if response.status_code != 200:
            print(f"HTTP error: {response.status_code}")
            print(f"Message: {response.text}")
            return None
        return self.extract_response(response.json())

    def chat(self, model, messages):
        result = self.chat_completion(model, messages)
        return result["content"] if result else None

Test the client:

if __name__ == "__main__":
    base_url = "http://localhost:6666/v1"
    API_KEY = "111"
    model = "Qwen3-4B"
    client = OpenAI(base_url, API_KEY)
    messages = [
        {"role": "system", "content": "你是一个小助手"},
        {"role": "user", "content": "你好"}
    ]
    response = client.chat(model, messages)
    print(response)

High‑level implementation with the openai package

from openai import OpenAI
client = OpenAI(base_url="http://localhost:6666/v1", api_key="111")
messages = [
    {"role": "system", "content": "你是一个小助手"},
    {"role": "user", "content": "你好"}
]
response = client.chat.completions.create(model="Qwen3-4B", messages=messages)
print(response.choices[0].message.content)

The library wraps the same HTTP schema, providing a concise interface.

Streaming multi‑turn chatbot

from openai import OpenAI
client = OpenAI(base_url="http://localhost:6666/v1", api_key="111")
messages = [{"role": "system", "content": "你是一个友好的AI助手,乐于帮助用户解决问题。"}]
print("="*50)
print("欢迎使用多轮对话机器人!(流式输出版)")
print("输入 'exit' 或 'quit' 退出程序")
print("输入 'clear' 或 'reset' 清除对话历史")
print("="*50)
turn = 1
while True:
    user_input = input(f"
[第{turn}轮] 你: ").strip()
    if user_input.lower() in ['exit', 'quit', '退出']:
        print("再见!")
        break
    if user_input.lower() in ['clear', 'reset', '清除', '重置']:
        messages = [{"role": "system", "content": "你是一个友好的AI助手,乐于帮助用户解决问题。"}]
        turn = 1
        print("对话历史已清除,开始新的对话")
        continue
    if not user_input:
        continue
    messages.append({"role": "user", "content": user_input})
    print("
AI: ", end="", flush=True)
    full_response = ""
    try:
        stream = client.chat.completions.create(
            model="Qwen3-4B",
            messages=messages,
            stream=True,
            max_tokens=1000
        )
        for chunk in stream:
            if chunk.choices[0].delta.content is not None:
                content = chunk.choices[0].delta.content
                print(content, end="", flush=True)
                full_response += content
        print()
        if full_response:
            messages.append({"role": "assistant", "content": full_response})
            turn += 1
    except Exception as e:
        print(f"
请求出错: {e}")
        if messages and messages[-1]["role"] == "user":
            messages.pop()

Test run: after sending the user message “我的名字是苍井空”, a subsequent query “我叫什么名字?” is answered correctly, confirming that the accumulated messages list provides context for multi‑turn dialogue.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonStreamingvLLMLarge Language ModelChatbotOpenAI API
Fun with Large Models
Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.