Turn LM Studio into a Local OpenAI‑Compatible API Server
This guide shows how to select a model in LM Studio, expose a local port, start the HTTP server, and interact with it via curl commands, covering quick model listing, chat requests, and the difference between streaming and full‑response modes.
Overview
LM Studio can expose a local HTTP endpoint that mimics the OpenAI API, allowing client applications to call a loaded model through standard REST calls.
1. Select a Model
Open the Developer tab in LM Studio and choose the desired model from the list (e.g., llama-4-maverick-17b-128e-instruct).
2. Configure Port Exposure
In the Server settings set the listening port (default 1234) and enable CORS so that web pages or other tools can connect.
3. Start the Service
Switch to the Status tab. LM Studio will start an HTTP server and print the available endpoints.
2025-04-26 20:55:13 [INFO] [LM STUDIO SERVER] Success! HTTP server listening on port 1234
2025-04-26 20:55:13 [INFO] [LM STUDIO SERVER] Supported endpoints:
GET http://localhost:1234/v1/models
POST http://localhost:1234/v1/chat/completions
POST http://localhost:1234/v1/completions
POST http://localhost:1234/v1/embeddings
[LM STUDIO SERVER] Logs are saved into /Users/javaedge/.lmstudio/server-logs
Server started.4. Quick‑Start API Calls
4.1 List Loaded Models
Verify that the server is reachable and see which models are loaded:
curl http://127.0.0.1:1234/v1/models/4.2 Chat Completion
Send a /v1/chat/completions request that follows the OpenAI JSON schema. The request is stateless; the client must include the full conversation history each time.
curl http://127.0.0.1:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-4-maverick-17b-128e-instruct",
"messages": [
{ "role": "system", "content": "Always answer in rhymes." },
{ "role": "user", "content": "Introduce yourself." }
],
"temperature": 0.7,
"max_tokens": -1,
"stream": true
}'4.3 Streaming vs. Full Response
If "stream": true, LM Studio returns each generated token as soon as it is produced, enabling low‑latency, incremental display. Setting "stream": false makes the server accumulate the entire completion before sending the response, which can increase latency for long outputs.
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
