Artificial Intelligence 11 min read

OpenAI Unveils gpt-oss 120B & 20B: Open‑Source MoE Models with 4‑Bit Quantization

OpenAI's gpt-oss series introduces two open‑source large language models—gpt‑oss‑120b and gpt‑oss‑20b—featuring Mixture‑of‑Experts architecture, 4‑bit MXFP4 quantization, extensive benchmark results, and broad deployment options across cloud and consumer hardware.

Data Thinking Notes

Aug 6, 2025

OpenAI Unveils gpt-oss 120B & 20B: Open‑Source MoE Models with 4‑Bit Quantization

OpenAI announced the gpt-oss series on August 5, releasing two open‑source large language models: gpt‑oss‑120b (1.17 trillion parameters) and gpt‑oss‑20b (210 billion parameters).

Both models adopt a Mixture‑of‑Experts (MoE) architecture to achieve fast, efficient inference on cloud and consumer‑grade hardware, and are released under the permissive Apache 2.0 license.

1. Main Specifications and Performance

Two model scales: total parameters of 210 billion and 1.17 trillion; MoE design activates only 3.6 billion and 5.1 billion parameters during inference.

Efficient quantization: 4‑bit MXFP4 quantization reduces memory while preserving speed and quality.

Hardware‑friendly: The 1.17 trillion‑parameter model runs on a single 80 GB H100 GPU; the 210 billion‑parameter model runs within 16 GB RAM, enabling device‑side and consumer deployments.

Model

Memory Requirement

Typical Hardware Scenario

gpt‑oss‑20b

16 GB RAM

Laptop (RTX 4080)

gpt‑oss‑120b

80 GB GPU

Single‑card H100 / Blackwell GPU cluster

Inference priority: Pure‑text model with chain‑of‑thought capability; inference intensity can be tuned to balance latency, cost, and performance.

Tool compatibility: Supports instruction following and tool use; recommended inference via the Responses API.

Broad inference support across frameworks such as Transformers, vLLM, llama.cpp, and Ollama.

2. Core Highlights

Architectural foundation: Transformer backbone enhanced with MoE design reduces active parameters during processing, boosting efficiency.

Attention mechanism: Combines dense and local banded sparse attention, uses grouped multi‑query attention (group size 8) and RoPE positional encoding, natively supporting 128k context length.

Quantization and acceleration: MXFP4 4‑bit quantization compresses expert weights to 4‑bit precision with <1% quality loss, cutting memory usage by 4×; three inference modes (low/medium/high) dynamically balance speed and depth.

3. Performance Evaluation

Both gpt‑oss‑120b and gpt‑oss‑20b were benchmarked on standard academic tests covering programming, competition math, medical queries, and tool usage, and compared against OpenAI's o3, o3 mini, and o4‑mini models.

gpt‑oss‑120b outperforms OpenAI o3 mini on Codeforces, MMLU, HLE, and TauBench, and matches or exceeds o4‑mini on HealthBench and AIME 2024/2025. Despite its smaller size, gpt‑oss‑20b matches or surpasses o3 mini on the same benchmarks, even excelling in competition math and medical queries.

4. Chain‑of‑Thought (CoT)

OpenAI found that directly supervising the model’s chain‑of‑thought during training reduces its ability to detect anomalous behavior; therefore, the gpt‑oss models retain native CoT without direct alignment supervision, allowing researchers to observe and detect potential misuse.

Detect anomalous model behavior.

Identify potential deception.

Recognize possible abuse cases.

5. Developer Ecosystem

The weights for both models are freely downloadable from Hugging Face in MXFP4‑quantized format, enabling gpt‑oss‑120b to run within 80 GB memory and gpt‑oss‑20b within 16 GB. They can be deployed locally, on devices, or via third‑party inference services such as Hugging Face, Azure, vLLM, Ollama, llama.cpp, LM Studio, AWS, Fireworks, Together AI, Baseten, Databricks, Vercel, Cloudflare, and OpenRouter.

Trial site: https://gpt-oss.com/ Model download links: https://huggingface.co/openai/gpt-oss-120b https://huggingface.co/openai/gpt-oss-20b https://github.com/openai/gpt-oss

Tool

Applicable Scenario

Code Example

Ollama

Local terminal/GUI interaction

ollama run gpt-oss:20b

Transformers

Python integration

pipeline("text-generation", model="openai/gpt-oss-20b")

vLLM

High‑concurrency production

vllm.entrypoints.openai.api_server --model openai/gpt-oss-120b

Local API Example

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",  # Local Ollama API
    api_key="ollama"  # Dummy key
)
response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)
print(response.choices[0].message.content)

Function Call Example

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            }
        }
    }
]
response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
    tools=tools
)
print(response.choices[0].message)

6. Why OpenAI Open‑Sources the Models

Releasing gpt‑oss‑120b and gpt‑oss‑20b marks a major step for open‑source large models, delivering high‑performance, safer inference while accelerating research and innovation. The models lower barriers for emerging markets, resource‑constrained industries, and small organizations, enabling broader, transparent AI development.

Reference documents:

https://openai.com/open-models/

https://openai.com/index/introducing-gpt-oss/

large language models Mixture of Experts OpenAI AI inference GPT-OSS 4-bit quantization

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.