Artificial Intelligence 13 min read

Deep Dive into OpenAI’s GPT‑OSS and GPT‑5: Features, Performance, and Controversies

The article provides a detailed analysis of OpenAI’s newly released open‑source GPT‑OSS models (20B and 120B) and the closed‑source GPT‑5 family, covering their architectures, training pipelines, benchmark results, practical usage observations, pricing, and the mixed user feedback that surrounds GPT‑5.

Fun with Large Models

Aug 19, 2025

Deep Dive into OpenAI’s GPT‑OSS and GPT‑5: Features, Performance, and Controversies

Introduction

OpenAI announced a simultaneous push in both open‑source and closed‑source large language models, releasing the GPT‑OSS series and the highly anticipated GPT‑5 family. This article examines their background, core characteristics, benchmark performance, practical usage, pricing, and the controversies surrounding GPT‑5.

1. Open‑source model – GPT‑OSS

1.1 Background

On 5 August 2025 OpenAI released GPT‑OSS‑20B and GPT‑OSS‑120B, the first open‑source dialogue models since GPT‑2 (2019). Both models employ a Mixture‑of‑Experts (MoE) architecture with grouped multi‑query attention and native FP4 mixed‑precision, enabling inference on as little as 16 GB VRAM.

Local tests showed >40 tokens/s on an RTX 3090 and >200 tokens/s on an RTX 5090. The models support major GPUs, various inference frameworks, and CPU‑GPU hybrid deployment.

1.2 Core features

Architecture: MoE expert model with grouped multi‑query attention; FP4 mixed‑precision reduces hardware cost (inference 16 GB, fine‑tuning 24 GB).

Performance: GPT‑OSS‑20B matches O3‑mini level; GPT‑OSS‑120B reaches O4‑mini level. Benchmarks such as the Human‑Level Evaluation (HLE) from the Grok‑4 release place GPT‑OSS‑20B between O3‑mini and O4‑mini, while GPT‑OSS‑120B is comparable to O4‑mini. Compared with Qwen3‑235B‑A22B‑Thinking‑2507, GPT‑OSS‑120B surpasses it with less than half the parameters and hardware cost.

Training process: Identical to OpenAI’s O4 pipeline – pre‑training, full‑parameter instruction fine‑tuning, then reinforcement‑learning‑from‑human‑feedback (RLHF). Compared with DeepSeek‑R1, GPT‑OSS adopts stricter unsupervised chain‑of‑thought alignment, improving reasoning efficiency.

Usability: Supports tool calling, structured output, and controllable inference intensity via system prompts; higher intensity yields stronger reasoning at the cost of latency.

1.3 Significance

GPT‑OSS marks OpenAI’s most sincere open‑source effort, raising the performance ceiling for community models and potentially accelerating the open‑source LLM ecosystem.

2. Closed‑source model – GPT‑5

2.1 Background

Following the 2023 release of GPT‑4, OpenAI introduced three GPT‑5 variants: the full‑size GPT‑5, GPT‑5 Mini, and GPT‑5 Nano, each targeting different application scenarios.

2.2 Core features

Training: Synthetic data pre‑training combined with RLHF, emphasizing reasoning, programming, instruction following, and tool use.

Controllable inference: The reasonning_effort parameter adjusts the depth of the model’s thinking chain; setting minimal disables deliberation for the fastest responses.

Performance: Benchmarks show GPT‑5 surpasses O3 by >10 % across reasoning, coding, instruction following, and tool‑calling tasks. Hallucinations are reduced to roughly 10 % of O3 levels, and the model handles sensitive queries with nuanced guidance rather than blunt refusal.

Programming capability: Demonstrations include generating thousands of lines of code in a single response, creating complete front‑end pages, interactive games, dashboards, and workflow plugins. In Cursor’s Agent mode, GPT‑5 reads tens of thousands of lines, debugs, and creates multi‑file projects using built‑in tools such as summary and files.

Cost: Usage fees are about one‑third of Claude‑4’s price.

2.3 Controversies

While many praise GPT‑5’s coding strength, users report mixed results: a generated Flappy Bird game was unplayable, and code refactoring sometimes produced clean but non‑functional code. Ethan Mollick (Wharton School, University of Pennsylvania) notes that GPT‑5 is an ensemble model with uneven component performance, and OpenAI’s lack of transparency about model selection may cause user confusion.

In the author’s ten‑day evaluation, GPT‑5 performed well on well‑specified, logically clear prompts but showed instability on both simple and complex questions, occasionally excelling on harder tasks.

3. Conclusion

OpenAI’s simultaneous push in open‑source (GPT‑OSS) and closed‑source (GPT‑5) solidifies its leadership in the large‑model arena. GPT‑5 delivers notable capability jumps, yet overall progress appears to be slowing, akin to incremental upgrades of an existing vehicle rather than a revolutionary new transport.

OpenAI GPT-5 GPT-OSS model benchmarking

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.