Introduction
Running Large Language Models (LLMs) and other open-source models locally offers significant advantages for developers. This is where Ollama shines. Ollama simplifies the process of downloading, setting up, and running these powerful models on your local machine, giving you greater control, enhanced privacy, and reduced costs compared to cloud-based solutions.
While running models locally offers immense benefits, integrating them with cloud-based projects or sharing them for broader access can be a challenge. This is precisely where Clarifai Local Runners come in. Local Runners enable you to expose your locally running Ollama models via a public API endpoint, allowing seamless integration with any project, anywhere, effectively bridging the gap between your local environment and the cloud.
In this post, we’ll walk through how to run open-source models using Ollama and expose them with a public API using Clarifai Local Runners. This makes your local models accessible globally while still running entirely on your machine.
Local Runners Explained
Local Runners let you run models on your own machine, whether it’s your laptop, workstation, or on-prem server, while exposing them through a secure, public API endpoint. You don’t need to upload the model to the cloud. The model stays local but behaves like it’s hosted on Clarifai.
Once initialized, the Local Runner opens a secure tunnel to Clarifai’s control plane. Any requests to your model’s Clarifai API endpoint are routed to your machine, processed locally, and returned to the caller. From the outside, it functions like any other hosted model. Internally, everything runs on your hardware.
Local Runners are especially useful for:
- Fast local development: Build, test, and iterate on models in your own environment without deployment delays. Inspect traffic, test outputs, and debug in real time.
- Using your own hardware: Take advantage of local GPUs or custom hardware setups. Let your machine handle inference while Clarifai manages routing and API access.
- Private and offline data: Run models that rely on local files, internal databases, or private APIs. Keep everything on-prem while still exposing a usable endpoint.
Local Runners gives you the flexibility of local execution along with the reach of a managed API, all without giving up control over your data or environment.
Expose Local Ollama Models via Public API
This section will walk you through the steps to get your Ollama model running locally and accessible via a Clarifai public endpoint.
Prerequisites
Before we begin, ensure you have:
Step 1: Install Clarifai and Login
First, install the Clarifai Python SDK:
Next, log in to Clarifai to configure your context. This links your local environment to your Clarifai account, allowing you to manage and expose your models.
Follow the prompts to enter your User ID and Personal Access Token (PAT). If you need help obtaining these, refer to the documentation here.
Step 2: Set Up Your Local Ollama Model for Clarifai
Next, you’ll prepare your local Ollama model so it can be accessed by Clarifai’s Local Runners. This step sets up the necessary files and configuration to expose your model through a public API endpoint using Clarifai’s platform.
Use the following command to initialize the setup:
This generates three key files within your project directory:
-
model.py
-
config.yaml
-
requirements.txt
These define how Clarifai will communicate with your locally running Ollama model.
You can also customize the command with the following options:
-
--model-name
: Name of the Ollama model you want to serve. This pulls from the Ollama model library (defaults tollama3:8b
). -
--port
: The port where your Ollama model is running (defaults to23333
). -
--context-length
: Sets the model’s context length (defaults to8192
).
For example, to use the gemma:2b
model with a 16K context length on port 8008
, run:
After this step, your local model is ready to be exposed using Clarifai Local Runners.
Step 3: Start the Clarifai Local Runner
Once your local Ollama model is configured, the next step is to run Clarifai’s Local Runner. This exposes your local model to the internet through a secure Clarifai endpoint.
Navigate into the model directory and run:
Once the runner starts, you will receive a public Clarifai URL. This URL is your gateway to accessing your locally running Ollama model from anywhere. Requests made to this Clarifai endpoint will be securely routed to your local machine, allowing your Ollama model to process them.
Running Inference on Your Exposed Model
With your Ollama model running locally and exposed via Clarifai Local Runner, you can now send inference requests to it from anywhere using the Clarifai SDK or an OpenAI-compatible endpoint.
Inference using OpenAI compatible method
Set your Clarifai PAT as an environment variable:
Then, you can use the OpenAI client to send requests:
For multimodal inference, you can include image data:
Inference with Clarifai SDK
You can also use the Clarifai Python SDK for inference. The model URL can be obtained from your Clarifai account.
Customizing Ollama Model Configuration
The clarifai model init --toolkit ollama
command generates a model file structure:
ollama-model-upload/
├── 1/
│ └── model.py
│
├── config.yaml
└── requirements.txt
You can customize the generated files to control how your model works:
-
1/model.py
– Customize to tailor your model’s behavior, implement custom logic, or optimize performance. -
config.yaml
– Define settings such as compute requirements, especially useful when deploying to dedicated compute using Compute Orchestration. -
requirements.txt
– List any required Python packages for your model.
This setup gives you full control over how your Ollama model is exposed and used via API. Refer to the documentation here.
Conclusion
Running open-source models locally with Ollama gives you full control over privacy, latency, and customization. With Clarifai Local Runners, you can expose these models via a public API without relying on centralized infrastructure. This setup makes it easy to plug local models into larger workflows or agentic systems, while keeping compute and data fully in your control. If you want to scale beyond your machine, check out Compute Orchestration to deploy models on dedicated GPU nodes.