Introduction
OpenAI has released gpt‑oss‑120b and gpt‑oss‑20b, a new series of open‑weight reasoning models. Released under the Apache 2.0 license, these text‑only models are designed for robust instruction following, tool use, and strong reasoning capabilities, making them well‑suited for integration into advanced agentic workflows. This release reflects OpenAI’s ongoing commitment to enabling innovation and encouraging collaborative safety within the AI community.
A key question is how these models compare to other leading options in the fast‑moving open‑ and semi‑open‑weight ecosystem. In this blog, we look at GPT‑OSS in detail and compare its capabilities with models like GLM‑4.5, Qwen3‑Thinking, DeepSeek‑R1, and Kimi K2.
GPT‑OSS: Architecture and Core Strengths
The gpt‑oss models build on the foundations of GPT‑2 and GPT‑3, incorporating a Mixture‑of‑Experts (MoE) design to improve efficiency during both training and inference. This approach activates only a subset of parameters per token, giving the models the scale of very large systems while controlling compute cost.
There are two models in the family:
-
gpt‑oss‑120b: 116.8 billion total parameters, with about 5.1 billion active per token across 36 layers.
-
gpt‑oss‑20b: 20.9 billion total parameters, with 3.6 billion active per token across 24 layers.
Both models share several architectural choices:
-
Residual stream dimension of 2880.
-
Grouped Query Attention with 64 query heads and 8 key‑value heads.
-
Rotary position embeddings for improved contextual reasoning.
-
Extended context length of 131,072 tokens using YaRN.
To make deployment practical, OpenAI applied MXFP4 quantization to the MoE weights. This allows the 120 billion‑parameter model to run on a single 80 GB GPU and the 20 billion‑parameter variant to operate on hardware with as little as 16 GB of memory.
Another notable feature is variable reasoning effort. Developers can specify “low,” “medium,” or “high” reasoning levels via the system prompt, which dynamically adjusts the length of the Chain‑of‑Thought (CoT). This provides flexibility in balancing accuracy, latency, and compute cost.
The models are also trained with built‑in support for agentic workflows, including:
-
A browsing tool for real‑time web search and retrieval.
-
A Python tool for stateful code execution in a Jupyter‑like environment.
-
Support for custom developer functions, enabling complex workflows with interleaved reasoning, tool use, and user interaction.
GPT‑OSS in Context: Comparing Performance Across Models
The open‑model ecosystem is full of capable contenders — GLM‑4.5, Qwen3 Thinking, DeepSeek R1, and Kimi K2 — each with different strengths and trade‑offs. Comparing them with GPT‑OSS gives a clearer view of how these models perform across reasoning, coding, and agentic workflows.
Reasoning and Knowledge
On broad knowledge and reasoning tasks, GPT‑OSS delivers some of the highest scores relative to its size.
-
On MMLU‑Pro, GPT‑OSS‑120b reaches 90.0%, ahead of GLM‑4.5 (84.6%), Qwen3 Thinking (84.4%), DeepSeek R1 (85.0%), and Kimi K2 (81.1%).
-
For competition‑style math tasks, GPT‑OSS shines. On AIME 2024, it hits 96.6% with tools, and on AIME 2025, it pushes to 97.9%, outperforming all others.
-
On the GPQA PhD‑level science benchmark, GPT‑OSS‑120b achieves 80.9% with tools, comparable to GLM‑4.5 (79.1%) and Qwen3 Thinking (81.1%), and just behind DeepSeek R1 (81.0%).
What makes these numbers notable is the balance between model size and performance. GPT‑OSS‑120b is a 116.8B‑parameter model (with only 5.1B active parameters per token thanks to its Mixture‑of‑Experts design). GLM‑4.5 and Qwen3 Thinking are significantly larger full‑parameter models, which partially explains their strong tool use and coding results. DeepSeek R1 also leans toward higher parameter counts and deeper token usage for reasoning tasks (up to 20k tokens per query), while Kimi K2 is tuned as a smaller but more specialized instruct model.
This means GPT‑OSS manages frontier‑level reasoning scores while using fewer active parameters, making it more efficient for developers who want deep reasoning without the cost of running very large dense models.
Coding and Software Engineering
Modern AI coding benchmarks focus on a model’s ability to understand large codebases, make changes, and execute multi‑step reasoning.
-
On SWE‑bench Verified, GPT‑OSS‑120b scores 62.4%, close to GLM‑4.5 (64.2%) and DeepSeek R1 (≈65.8% in agentic mode).
-
On Terminal‑Bench, GLM‑4.5 leads with 37.5%, followed by Kimi K2 at around 30%.
-
GLM‑4.5 also shows strong results in head‑to‑head agentic coding tasks, with over 50% win rates against Kimi K2 and over 80% against Qwen3, while maintaining a high success rate for tool‑based coding workflows.
Here again, model size matters. GLM‑4.5 is a much larger dense model than GPT‑OSS‑120b and Kimi K2, which gives it an edge in agentic coding workflows. But for developers who want solid code‑editing capabilities in a model that can run on a single 80GB GPU, GPT‑OSS offers an appealing balance.
Agentic Tool Use and Function Calling
Agentic capabilities — where a model autonomously calls tools, executes functions, and solves multi‑step tasks — are increasingly important.
-
On TAU‑bench Retail, GPT‑OSS‑120b scores 67.8%, compared to GLM‑4.5’s 79.7% and Kimi K2’s 70.6%.
-
On BFCL‑v3 (a function‑calling benchmark), GLM‑4.5 leads with 77.8%, followed by Qwen3 Thinking at 71.9% and GPT‑OSS around 67–68%.
These results highlight a trade‑off: GLM‑4.5 dominates in function‑calling and agentic workflows, but it does so as a significantly larger, resource‑intensive model. GPT‑OSS delivers competitive results while staying accessible to developers who can’t afford multi‑GPU clusters.
Putting It All Together
Here’s a quick snapshot of how these models stack up:
Benchmark | GPT‑OSS‑120b (High) | GLM‑4.5 | Qwen3 Thinking | DeepSeek R1 | Kimi K2 |
---|---|---|---|---|---|
MMLU‑Pro | 90.0% | 84.6% | 84.4% | 85.0% | 81.1% |
AIME 2024 | 96.6% (with tools) | ~91% | ~91.4% | ~87.5% | ~69.6% |
AIME 2025 | 97.9% (with tools) | ~92% | ~92.3% | ~87.5% | ~49.5% |
GPQA Diamond (Science) | ~80.9% (with tools) | 79.1% | 81.1% | 81.0% | 75.1% |
SWE‑bench Verified | 62.4% | 64.2% | — | ~65.8% | 65.8% agentic |
TAU‑bench Retail | 67.8% | 79.7% | ~67.8% | ~63.9% | ~70.6% |
BFCL‑v3 Function Calling | ~67–68% | 77.8% | 71.9% | 37.0% | — |
Key takeaways:
-
GPT‑OSS punches above its weight in reasoning and long‑form CoT tasks while using fewer active parameters.
-
GLM‑4.5 is a heavyweight dense model that excels at agentic workflows and function‑calling but requires far more compute.
-
DeepSeek R1 and Qwen3 offer strong hybrid reasoning performance at larger sizes, while Kimi K2 targets agentic coding workflows with smaller, more specialized setups.
Conclusion
GPT‑OSS brings frontier‑level reasoning and long‑form CoT capabilities with a smaller active parameter footprint than many dense models. GLM‑4.5 leads in agentic workflows and function‑calling but requires significantly more compute. DeepSeek R1 and Qwen3 deliver strong hybrid reasoning at larger scales, while Kimi K2 focuses on specialized coding workflows with a compact setup.
This makes GPT‑OSS a compelling balance of reasoning performance, coding ability, and deployment efficiency, well‑suited for experimentation, integration into agentic systems, or resource‑aware production workloads.
If you want to try the GPT‑OSS‑20B model, its smaller size makes it practical to run locally on your own hardwareusing Ollama and expose it via a public API with Clarifai’s Local Runners — giving you full control over your compute and keeping your data local. Check out the tutorial here.
If you want to try out the full‑scale GPT‑OSS‑120B model, you can try it directly on the playground here.