
Image by Author
# Introduction
Stress testing is crucial for understanding how your application behaves under heavy load. For machine learning-powered APIs, it is especially important because model inference can be CPU-intensive. By simulating a large number of users, we can identify performance bottlenecks, determine the capacity of our system, and ensure reliability.
In this tutorial, we will be using:
- FastAPI: A modern, fast (high-performance) web framework for building APIs with Python.
- Uvicorn: An ASGI server to run our FastAPI application.
- Locust: An open-source load testing tool. You define user behavior with Python code, and swarm your system with hundreds of simultaneous users.
- Scikit-learn: For our example machine learning model.
# 1. Project Setup and Dependencies
Set up the project structure and install the necessary dependencies.
- Create
requirements.txt
file and add the following Python packages: - Open your terminal, create a virtual environment, and activate it.
- Install all the Python packages using the
requirements.txt
file.
fastapi==0.115.12
locust==2.37.10
numpy==2.3.0
pandas==2.3.0
pydantic==2.11.5
scikit-learn==1.7.0
uvicorn==0.34.3
orjson==3.10.18
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
# 2. Building the FastAPI Application
In this section, we will create a file for training the Regression model, for pydantic models, and the FastAPI application.
This ml_model.py
handles the machine learning model. It uses a singleton pattern to ensure only one instance of the model is loaded. The model is a Random Forest Regressor trained on the California housing dataset. If a pre-trained model (model.pkl and scaler.pkl) doesn’t exist, it trains and saves a new one.
app/ml_model.py
:
import os
import threading
import joblib
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
class MLModel:
_instance = None
_lock = threading.Lock()
def __new__(cls):
if cls._instance is None:
with cls._lock:
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
def __init__(self):
if not hasattr(self, "initialized"):
self.model = None
self.scaler = None
self.model_path = "model.pkl"
self.scaler_path = "scaler.pkl"
self.feature_names = None
self.initialized = True
self.load_or_create_model()
def load_or_create_model(self):
"""Load existing model or create a new one using California housing dataset"""
if os.path.exists(self.model_path) and os.path.exists(self.scaler_path):
self.model = joblib.load(self.model_path)
self.scaler = joblib.load(self.scaler_path)
housing = fetch_california_housing()
self.feature_names = housing.feature_names
print("Model loaded successfully")
else:
print("Creating new model...")
housing = fetch_california_housing()
X, y = housing.data, housing.target
self.feature_names = housing.feature_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
self.scaler = StandardScaler()
X_train_scaled = self.scaler.fit_transform(X_train)
self.model = RandomForestRegressor(
n_estimators=50, # Reduced for faster predictions
max_depth=8, # Reduced for faster predictions
random_state=42,
n_jobs=1, # Single thread for consistency
)
self.model.fit(X_train_scaled, y_train)
joblib.dump(self.model, self.model_path)
joblib.dump(self.scaler, self.scaler_path)
X_test_scaled = self.scaler.transform(X_test)
score = self.model.score(X_test_scaled, y_test)
print(f"Model R² score: {score:.4f}")
def predict(self, features):
"""Make prediction for house price"""
features_array = np.array(features).reshape(1, -1)
features_scaled = self.scaler.transform(features_array)
prediction = self.model.predict(features_scaled)[0]
return prediction * 100000
def get_feature_info(self):
"""Get information about the features"""
return {
"feature_names": list(self.feature_names),
"num_features": len(self.feature_names),
"description": "California housing dataset features",
}
# Initialize model as singleton
ml_model = MLModel()
The pydantic_models.py
file defines the Pydantic models for request and response data validation and serialization.
app/pydantic_models.py
:
from typing import List
from pydantic import BaseModel, Field
class PredictionRequest(BaseModel):
features: List[float] = Field(
...,
description="List of 8 features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude",
min_length=8,
max_length=8,
)
model_config = {
"json_schema_extra": {
"examples": [
{"features": [8.3252, 41.0, 6.984, 1.024, 322.0, 2.556, 37.88, -122.23]}
]
}
}
app/main.py
: This file is the core FastAPI application, defining the API endpoints.
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from fastapi.responses import ORJSONResponse
from .ml_model import ml_model
from .pydantic_models import (
PredictionRequest,
)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Pre-load the model
_ = ml_model.get_feature_info()
yield
app = FastAPI(
title="California Housing Price Prediction API",
version="1.0.0",
description="API for predicting California housing prices using Random Forest model",
lifespan=lifespan,
default_response_class=ORJSONResponse,
)
@app.get("/health")
async def health_check():
"""Health check endpoint"""
return {"status": "healthy", "message": "Service is operational"}
@app.get("/model-info")
async def model_info():
"""Get information about the ML model"""
try:
feature_info = await asyncio.to_thread(ml_model.get_feature_info)
return {
"model_type": "Random Forest Regressor",
"dataset": "California Housing Dataset",
"features": feature_info,
}
except Exception:
raise HTTPException(
status_code=500, detail="Error retrieving model information"
)
@app.post("/predict")
async def predict(request: PredictionRequest):
"""Make house price prediction"""
if len(request.features) != 8:
raise HTTPException(
status_code=400,
detail=f"Expected 8 features, got {len(request.features)}",
)
try:
prediction = ml_model.predict(request.features)
return {
"prediction": float(prediction),
"status": "success",
"features_used": request.features,
}
except ValueError as e:
raise HTTPException(status_code=400, detail=str(e))
except Exception:
raise HTTPException(status_code=500, detail="Prediction error")
Key points:
lifespan
manager: Ensures the ML model is loaded during application startup.asyncio.to_thread
: This is crucial because scikit-learn’s predict method is CPU-bound (synchronous). Running it in a separate thread prevents it from blocking FastAPI’s asynchronous event loop, allowing the server to handle other requests concurrently.
Endpoints:
/health
: A simple health check./model-info
: Provides metadata about the ML model./predict
: Accepts a list of features and returns a house price prediction.
run_server.py
: It contains the script that is used to run the FastAPI application using Uvicorn.
import uvicorn
if __name__ == "__main__":
uvicorn.run("app.main:app", host="localhost", port=8000, workers=4)
All the files and configurations are available at the GitHub repository: kingabzpro/Stress-Testing-FastAPI
# 3. Writing the Locust Stress Test
Now, let’s create the stress test script using Locust.
tests/locustfile.py
: This file defines the behavior of simulated users.
import json
import logging
import random
from locust import HttpUser, task
# Reduce logging to improve performance
logging.getLogger("urllib3").setLevel(logging.WARNING)
class HousingAPIUser(HttpUser):
def generate_random_features(self):
"""Generate random but realistic California housing features"""
return [
round(random.uniform(0.5, 15.0), 4), # MedInc
round(random.uniform(1.0, 52.0), 1), # HouseAge
round(random.uniform(2.0, 10.0), 2), # AveRooms
round(random.uniform(0.5, 2.0), 2), # AveBedrms
round(random.uniform(3.0, 35000.0), 0), # Population
round(random.uniform(1.0, 10.0), 2), # AveOccup
round(random.uniform(32.0, 42.0), 2), # Latitude
round(random.uniform(-124.0, -114.0), 2), # Longitude
]
@task(1)
def model_info(self):
"""Test health endpoint"""
with self.client.get("/model-info", catch_response=True) as response:
if response.status_code == 200:
response.success()
else:
response.failure(f"Model info failed: {response.status_code}")
@task(3)
def single_prediction(self):
"""Test single prediction endpoint"""
features = self.generate_random_features()
with self.client.post(
"/predict", json={"features": features}, catch_response=True, timeout=10
) as response:
if response.status_code == 200:
try:
data = response.json()
if "prediction" in data:
response.success()
else:
response.failure("Invalid response format")
except json.JSONDecodeError:
response.failure("Failed to parse JSON")
elif response.status_code == 503:
response.failure("Service unavailable")
else:
response.failure(f"Status code: {response.status_code}")
Key points:
- Each simulated user will wait between 0.5 and 2 seconds between executing tasks.
- Creates realistic random feature data for the prediction requests.
- Each user will make one health_check request and 3 single_prediction requests.
# 4. Running the Stress Test
- To evaluate the performance of your application under load, begin by starting your asynchronous machine learning application in one terminal.
- Open your browser and navigate to http://localhost:8000/docs. Use the interactive API documentation to test your endpoints and ensure they are functioning correctly.
- Open a new terminal window, activate the virtual environment, and navigate to your project’s root directory to run Locust with the Web UI:
- In the Locust web UI, set the total number of users to 500, the spawn rate to 10 users per second, and run it for a minute.
- During the test, Locust will display real-time statistics, including the number of requests, failures, and response times for each endpoint.
- Once the test is complete, click on the Charts tab to view interactive graphs showing the number of users, requests per second, and response times.
- To run Locust without the web UI and automatically generate an HTML report, use the following command:
Model loaded successfully
INFO: Started server process [26216]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

locust -f tests/locustfile.py --host http://localhost:8000
Access the Locust web UI at http://localhost:8089
in your browser.



locust -f tests/locustfile.py --host http://localhost:8000 --users 500 --spawn-rate 10 --run-time 60s --headless --html report.html
After the test finishes, an HTML report named report.html will be saved in your project directory for later review.

# Final Thoughts
Our app can handle a large number of users as we are using a simple machine learning model. The results show that the model-info endpoint has a greater response time than the prediction, which is impressive. This is the best-case scenario for testing your application locally before pushing it to production.
If you would like to experience this setup firsthand, please visit the kingabzpro/Stress-Testing-FastAPI repository and follow the instructions in the documentation.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.