Benchmarking Speed, Scale, and Cost Efficiency

September 11, 2025

2

11.8_blog_hero (1)

This blog post focuses on new features and improvements. For a comprehensive list, including bug fixes, please see the release notes.

GPT-OSS-120B: Benchmarking Speed, Scale, and Cost Efficiency

Artificial Analysis has benchmarked Clarifai’s Compute Orchestration with the GPT-OSS-120B model—one of the most advanced open-source large language models available today. The results underscore Clarifai as one of the top hardware and GPU-agnostic engines for AI workloads where speed, flexibility, efficiency and reliability matter most.

What the benchmark shows (P50, last 72h; single query, 1k-token prompt):

High throughput: 313 output tokens per second—among the very fastest measured in this configuration.
Low latency: 0.27s time-to-first-token (TTFT), so responses begin streaming almost instantly.
Compelling price/performance: Placed in the benchmark’s “most attractive quadrant” (high speed + low price).

Pricing that scales:

Clarifai offers GPT-OSS-120B at $0.09 per 1M input tokens and $0.36 per 1M output tokens. Artificial Analysis displays a blended price (3:1 input:output) of just $0.16 per 1M tokens, placing Clarifai significantly below the $0.26–$0.28 cluster of competitors while matching or exceeding their performance.

Below is a comparison of output speed versus price across major providers for GPT-OSS-120B. Clarifai stands out in the “most attractive quadrant,” combining high throughput with competitive pricing.

Output Speed vs Price (10 Sep 25) (2)

Output Speed vs. Price

This chart compares latency (time to first token) against output speed. Clarifai demonstrates one of the lowest latencies while maintaining top-tier throughput—placing it among the best-in-class providers.

Latency vs Output Speed (10 Sep 25) (1)

Latency vs. Output Speed

Why GPT-OSS-120B Matters

As one of the leading open-source “GPT-OSS” models, GPT-OSS-120B represents the growing demand for transparent, community-driven alternatives to closed-source LLMs. Running a model of this scale requires infrastructure that can not only deliver high speed and low latency, but also keep costs under control at production scale. That’s exactly where Clarifai’s Compute Orchestraction makes a difference.

Why This Benchmark Matters

These results are more than numbers—they show how Clarifai has engineered every layer of the stack to optimize GPU utilization. With CO, multiple models can run on the same GPUs, workloads scale elastically, and enterprises can squeeze more value out of every accelerator. The payoff is fast, reliable, and cost-efficient inference that can support both experimentation and large-scale deployment.

Check the full benchmarks on Artificial Analysis here.

Here’s a quick demo of how to access the GPT-OSS-120B model in the Playground.

Local Runners

Local Runners let you develop and run models on your own hardware—laptops, workstations, edge boxes—while making them callable through Clarifai’s cloud API. Clarifai handles the public URL, routing, and authentication; your model executes locally and your data stays on your machine. It behaves like any other Clarifai‑hosted model.

Why teams use Local Runners

Build where your data and tools live. Keep models close to local files, internal databases, and OS‑level utilities.
No custom networking. Start a runner and get a public URL—no port‑forwarding or reverse proxies.
Use your own compute. Bring your GPUs and custom setups; the platform still provides the API, workflows, and governance around them.

New: Ollama Toolkit (now in the CLI)

We’ve added an Ollama Toolkit to the Clarifai CLI so you can initialize an Ollama‑backed model directory in one command (and choose any model from the Ollama library). It pairs perfectly with Local Runners—download, run, and expose an Ollama model via a public API with a minimal setup.

The CLI supports --toolkit ollama plus flags like --model-name, --port, and --context-length, making it trivial to target specific Ollama models.

Example workflow: run Gemma 3 270M or GPT‑OSS- 20B locally and serve it through a public API

Pick a model in Ollama.
- Gemma 3 270M (tiny, fast; 32K context): gemma3:270m.
- GPT‑OSS 20B (OpenAI open‑weight, optimized for local use): gpt-oss:20b.
Initialize the project with the Ollama Toolkit.
Use the command above, swapping --model-name for your pick (e.g., gpt-oss:20b). This will create a new model directory structure that is compatible with the Clarifai platform. You can customize or optimize the generated model by modifying the 1/model.py file as needed.
Start your Local Runner.
From the model directory:

The runner registers with Clarifai and exposes your local model via a public URL; the CLI prints a ready‑to‑run client snippet.
Call it like any Clarifai model.
For example (Python SDK):

Behind the scenes, the API call is routed to your machine; results return to the caller over Clarifai’s secure control plane.

Deep dive: We published a step‑by‑step guide that walks through running Ollama models locally and exposing them with Local Runners. Check it out here.

Try it on the Developer Plan

You can start for free, or use the Developer Plan—$1/month for the first year—which includes up to 5 Local Runners and unlimited runner hours.

Check out the full example and setup guide in the documentation here.

Billing

We’ve made billing more transparent and flexible with this release. Monthly spending limits have been introduced: $100 for Developer and Essential plans, and $500 for the Professional plan. If you need higher limits, you can reach out to our team.

We’ve also added a new credit card pre-authorization process. A temporary charge is applied to verify card validity and available funds — $50 for Developer, $100 for Essential, and $500 for Professional plans. The amount is automatically refunded within seven days, ensuring a seamless verification experience.

Control Center

The Control Center gets even more flexible and informative with this update. You can now resize charts to half their original size on the configure page, making side-by-side comparisons smoother and layouts more manageable.
Charts are smarter too: the Stored Inputs Cost chart now correctly shows the average cost for the selected period, while longer date ranges automatically display weekly aggregated data for easier readability. Empty charts display meaningful messages instead of zeros, so you always know when data isn’t available.
We’ve also added cross-links between compute cost and usage charts, making it simple to navigate between these views and get a complete picture of your AI infrastructure.

Additional Changes

Python SDK: Fixed Local Runner CLI command, updated protocol and gRPC versions, integrated secrets, corrected num_threads defaults, added stream_options validation, prevented downloading original checkpoints, improved model upload and deployment, and added user confirmation to prevent Dockerfile overwrite during uploads.
Check all SDK updates here.
Platform Updates: Added a public resource filter to quickly view Community-shared resources, improved Playground error messaging for streaming limits, and extended login session duration for Google and GitHub SSO users to seven days.
Find all platform changes here.

Ready to start building?

With Local Runners, you can now serve models, MCP servers, or agents directly from your own hardware without uploading model weights or managing infrastructure. It’s the fastest way to test, iterate, and securely run models from your laptop, workstation, or on-prem server. You can read the documentation to get started, or check out the blog to see how to run Ollama models locally and expose them via a public API.

Source link

Benchmarking Speed, Scale, and Cost Efficiency