OpenAI Unveils gpt-oss 120B & 20B: Open‑Source MoE Models with 4‑Bit Quantization
OpenAI's gpt-oss series introduces two open‑source large language models—gpt‑oss‑120b and gpt‑oss‑20b—featuring Mixture‑of‑Experts architecture, 4‑bit MXFP4 quantization, extensive benchmark results, and broad deployment options across cloud and consumer hardware.
OpenAI announced the gpt-oss series on August 5, releasing two open‑source large language models: gpt‑oss‑120b (1.17 trillion parameters) and gpt‑oss‑20b (210 billion parameters).
Both models adopt a Mixture‑of‑Experts (MoE) architecture to achieve fast, efficient inference on cloud and consumer‑grade hardware, and are released under the permissive Apache 2.0 license.
1. Main Specifications and Performance
Two model scales: total parameters of 210 billion and 1.17 trillion; MoE design activates only 3.6 billion and 5.1 billion parameters during inference.
Efficient quantization: 4‑bit MXFP4 quantization reduces memory while preserving speed and quality.
Hardware‑friendly: The 1.17 trillion‑parameter model runs on a single 80 GB H100 GPU; the 210 billion‑parameter model runs within 16 GB RAM, enabling device‑side and consumer deployments.
Model
Memory Requirement
Typical Hardware Scenario
gpt‑oss‑20b
16 GB RAM
Laptop (RTX 4080)
gpt‑oss‑120b
80 GB GPU
Single‑card H100 / Blackwell GPU cluster
Inference priority: Pure‑text model with chain‑of‑thought capability; inference intensity can be tuned to balance latency, cost, and performance.
Tool compatibility: Supports instruction following and tool use; recommended inference via the Responses API.
Broad inference support across frameworks such as Transformers, vLLM, llama.cpp, and Ollama.
2. Core Highlights
Architectural foundation: Transformer backbone enhanced with MoE design reduces active parameters during processing, boosting efficiency.
Attention mechanism: Combines dense and local banded sparse attention, uses grouped multi‑query attention (group size 8) and RoPE positional encoding, natively supporting 128k context length.
Quantization and acceleration: MXFP4 4‑bit quantization compresses expert weights to 4‑bit precision with <1% quality loss, cutting memory usage by 4×; three inference modes (low/medium/high) dynamically balance speed and depth.
3. Performance Evaluation
Both gpt‑oss‑120b and gpt‑oss‑20b were benchmarked on standard academic tests covering programming, competition math, medical queries, and tool usage, and compared against OpenAI's o3, o3 mini, and o4‑mini models.
gpt‑oss‑120b outperforms OpenAI o3 mini on Codeforces, MMLU, HLE, and TauBench, and matches or exceeds o4‑mini on HealthBench and AIME 2024/2025. Despite its smaller size, gpt‑oss‑20b matches or surpasses o3 mini on the same benchmarks, even excelling in competition math and medical queries.
4. Chain‑of‑Thought (CoT)
OpenAI found that directly supervising the model’s chain‑of‑thought during training reduces its ability to detect anomalous behavior; therefore, the gpt‑oss models retain native CoT without direct alignment supervision, allowing researchers to observe and detect potential misuse.
Detect anomalous model behavior.
Identify potential deception.
Recognize possible abuse cases.
5. Developer Ecosystem
The weights for both models are freely downloadable from Hugging Face in MXFP4‑quantized format, enabling gpt‑oss‑120b to run within 80 GB memory and gpt‑oss‑20b within 16 GB. They can be deployed locally, on devices, or via third‑party inference services such as Hugging Face, Azure, vLLM, Ollama, llama.cpp, LM Studio, AWS, Fireworks, Together AI, Baseten, Databricks, Vercel, Cloudflare, and OpenRouter.
Trial site: https://gpt-oss.com/ Model download links: https://huggingface.co/openai/gpt-oss-120b https://huggingface.co/openai/gpt-oss-20b https://github.com/openai/gpt-oss
Tool
Applicable Scenario
Code Example
Ollama
Local terminal/GUI interaction
ollama run gpt-oss:20b
Transformers
Python integration
pipeline("text-generation", model="openai/gpt-oss-20b")
vLLM
High‑concurrency production
vllm.entrypoints.openai.api_server --model openai/gpt-oss-120b
Local API Example
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1", # Local Ollama API
api_key="ollama" # Dummy key
)
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
print(response.choices[0].message.content)Function Call Example
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather in a given city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
tools=tools
)
print(response.choices[0].message)6. Why OpenAI Open‑Sources the Models
Releasing gpt‑oss‑120b and gpt‑oss‑20b marks a major step for open‑source large models, delivering high‑performance, safer inference while accelerating research and innovation. The models lower barriers for emerging markets, resource‑constrained industries, and small organizations, enabling broader, transparent AI development.
Reference documents:
https://openai.com/open-models/
https://openai.com/index/introducing-gpt-oss/
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
