Run OpenAI’s Open‑Source gpt‑oss Models Locally with Ollama – A Quick Guide

OpenAI’s new open‑source gpt‑oss models, available in 20B and 120B sizes, can be run locally via Ollama with features like agentic capabilities, configurable reasoning, fine‑tuning, and MXFP4 quantization, and the article provides step‑by‑step installation, usage, and integration instructions.

Java Architecture Diary
Java Architecture Diary
Java Architecture Diary
Run OpenAI’s Open‑Source gpt‑oss Models Locally with Ollama – A Quick Guide

OpenAI released its latest open‑source weight model series, gpt‑oss, and partnered with Ollama to let developers run the models locally.

gpt‑oss model overview

Two model sizes are offered:

gpt-oss-20b : 200‑billion‑parameter model optimized for low latency and local or domain‑specific use, suitable for personal computers.

gpt-oss-120b : 1.2‑trillion‑parameter flagship model for production, general‑purpose and heavy inference, best run on servers with professional‑grade GPUs.

Key features

gpt-oss

includes built‑in agent capabilities such as function calling, web browsing, and structured output generation (e.g., JSON). It also provides full chain‑of‑thought visibility, configurable reasoning effort (low/medium/high), fine‑tuning support, and an Apache 2.0 license.

Ollama 支持内置搜索
Ollama 支持内置搜索

Technical deep‑dive: MXFP4 quantization

OpenAI uses MXFP4 quantization (≈4.25 bits per parameter) to compress the models. Over 90 % of parameters come from MoE layers; after quantization, gpt-oss-20b runs smoothly on a system with only 16 GB RAM, and gpt-oss-120b fits into a single 80 GB GPU.

Ollama natively supports the MXFP4 format, requiring no extra conversion and matching OpenAI’s reference implementation in benchmarks.

Quick start guide

Install Ollama for your OS, then verify the installation with ollama. Run a model with a single command, e.g.: ollama run gpt-oss:20b For the 120 B version (requires a GPU with ≥80 GB VRAM): ollama run gpt-oss:120b After the model downloads, you can chat directly in the terminal.

Ways to interact with the model

1. Command‑line tool

Use ollama run to ask questions and retrieve model information.

n2fAOV
n2fAOV

2. cURL API

Ollama exposes an HTTP service on port 11434. Example:

curl http://localhost:11434/api/generate -d '{ "model": "gpt-oss:20b", "prompt": "What is water made of?" }'

3. Java integration

Add the spring-ai-ollama dependency and configure the base URL and model:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>
spring.ai.ollama.base-url=http://localhost:11434
spring.ai.ollama.chat.model=gpt-oss:20b

Then use the injected ChatModel to call the model. Note that the current Spring AI 1.0 release does not support configuring the reasoning‑effort parameter for gpt‑oss; use the HTTP API for that purpose.

Important reminder: If you already have Ollama installed, upgrade to the latest version to ensure full support for the gpt‑oss models.
JavaQuantizationOpenAIAI ModelsOllamaGPT-OSS
Java Architecture Diary
Written by

Java Architecture Diary

Committed to sharing original, high‑quality technical articles; no fluff or promotional content.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.