Stress Testing FastAPI Application – KDnuggets

August 15, 2025

14

Stress Testing FastAPI Application – KDnuggets

Image by Author

# Introduction

Stress testing is crucial for understanding how your application behaves under heavy load. For machine learning-powered APIs, it is especially important because model inference can be CPU-intensive. By simulating a large number of users, we can identify performance bottlenecks, determine the capacity of our system, and ensure reliability.

In this tutorial, we will be using:

FastAPI: A modern, fast (high-performance) web framework for building APIs with Python.
Uvicorn: An ASGI server to run our FastAPI application.
Locust: An open-source load testing tool. You define user behavior with Python code, and swarm your system with hundreds of simultaneous users.
Scikit-learn: For our example machine learning model.

# 1. Project Setup and Dependencies

Set up the project structure and install the necessary dependencies.

Create requirements.txt file and add the following Python packages:

fastapi==0.115.12
locust==2.37.10
numpy==2.3.0
pandas==2.3.0
pydantic==2.11.5
scikit-learn==1.7.0
uvicorn==0.34.3
orjson==3.10.18

Open your terminal, create a virtual environment, and activate it.

python -m venv venv
venv\Scripts\activate

Install all the Python packages using the requirements.txt file.

pip install -r requirements.txt

# 2. Building the FastAPI Application

In this section, we will create a file for training the Regression model, for pydantic models, and the FastAPI application.

This ml_model.py handles the machine learning model. It uses a singleton pattern to ensure only one instance of the model is loaded. The model is a Random Forest Regressor trained on the California housing dataset. If a pre-trained model (model.pkl and scaler.pkl) doesn’t exist, it trains and saves a new one.

app/ml_model.py:

import os
import threading

import joblib
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class MLModel:
    _instance = None
    _lock = threading.Lock()

    def __new__(cls):
        if cls._instance is None:
            with cls._lock:
                if cls._instance is None:
                    cls._instance = super().__new__(cls)
        return cls._instance

    def __init__(self):
        if not hasattr(self, "initialized"):
            self.model = None
            self.scaler = None
            self.model_path = "model.pkl"
            self.scaler_path = "scaler.pkl"
            self.feature_names = None
            self.initialized = True
            self.load_or_create_model()

    def load_or_create_model(self):
        """Load existing model or create a new one using California housing dataset"""
        if os.path.exists(self.model_path) and os.path.exists(self.scaler_path):
            self.model = joblib.load(self.model_path)
            self.scaler = joblib.load(self.scaler_path)
            housing = fetch_california_housing()
            self.feature_names = housing.feature_names
            print("Model loaded successfully")
        else:
            print("Creating new model...")
            housing = fetch_california_housing()
            X, y = housing.data, housing.target
            self.feature_names = housing.feature_names

            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.2, random_state=42
            )

            self.scaler = StandardScaler()
            X_train_scaled = self.scaler.fit_transform(X_train)

            self.model = RandomForestRegressor(
                n_estimators=50,  # Reduced for faster predictions
                max_depth=8,  # Reduced for faster predictions
                random_state=42,
                n_jobs=1,  # Single thread for consistency
            )
            self.model.fit(X_train_scaled, y_train)

            joblib.dump(self.model, self.model_path)
            joblib.dump(self.scaler, self.scaler_path)

            X_test_scaled = self.scaler.transform(X_test)
            score = self.model.score(X_test_scaled, y_test)
            print(f"Model R² score: {score:.4f}")

    def predict(self, features):
        """Make prediction for house price"""
        features_array = np.array(features).reshape(1, -1)
        features_scaled = self.scaler.transform(features_array)
        prediction = self.model.predict(features_scaled)[0]
        return prediction * 100000

    def get_feature_info(self):
        """Get information about the features"""
        return {
            "feature_names": list(self.feature_names),
            "num_features": len(self.feature_names),
            "description": "California housing dataset features",
        }

# Initialize model as singleton
ml_model = MLModel()

The pydantic_models.py file defines the Pydantic models for request and response data validation and serialization.

app/pydantic_models.py:

from typing import List

from pydantic import BaseModel, Field

class PredictionRequest(BaseModel):
    features: List[float] = Field(
        ...,
        description="List of 8 features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude",
        min_length=8,
        max_length=8,
    )

    model_config = {
        "json_schema_extra": {
            "examples": [
                {"features": [8.3252, 41.0, 6.984, 1.024, 322.0, 2.556, 37.88, -122.23]}
            ]
        }
    }

app/main.py: This file is the core FastAPI application, defining the API endpoints.

import asyncio
from contextlib import asynccontextmanager

from fastapi import FastAPI, HTTPException
from fastapi.responses import ORJSONResponse

from .ml_model import ml_model
from .pydantic_models import (
    PredictionRequest,
)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Pre-load the model
    _ = ml_model.get_feature_info()
    yield

app = FastAPI(
    title="California Housing Price Prediction API",
    version="1.0.0",
    description="API for predicting California housing prices using Random Forest model",
    lifespan=lifespan,
    default_response_class=ORJSONResponse,
)

@app.get("/health")
async def health_check():
    """Health check endpoint"""
    return {"status": "healthy", "message": "Service is operational"}

@app.get("/model-info")
async def model_info():
    """Get information about the ML model"""
    try:
        feature_info = await asyncio.to_thread(ml_model.get_feature_info)
        return {
            "model_type": "Random Forest Regressor",
            "dataset": "California Housing Dataset",
            "features": feature_info,
        }
    except Exception:
        raise HTTPException(
            status_code=500, detail="Error retrieving model information"
        )

@app.post("/predict")
async def predict(request: PredictionRequest):
    """Make house price prediction"""
    if len(request.features) != 8:
        raise HTTPException(
            status_code=400,
            detail=f"Expected 8 features, got {len(request.features)}",
        )
    try:
        prediction = ml_model.predict(request.features)
        return {
            "prediction": float(prediction),
            "status": "success",
            "features_used": request.features,
        }
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
    except Exception:
        raise HTTPException(status_code=500, detail="Prediction error")

Key points:

lifespan manager: Ensures the ML model is loaded during application startup.
asyncio.to_thread: This is crucial because scikit-learn’s predict method is CPU-bound (synchronous). Running it in a separate thread prevents it from blocking FastAPI’s asynchronous event loop, allowing the server to handle other requests concurrently.

Endpoints:

/health: A simple health check.
/model-info: Provides metadata about the ML model.
/predict: Accepts a list of features and returns a house price prediction.

run_server.py: It contains the script that is used to run the FastAPI application using Uvicorn.

import uvicorn

if __name__ == "__main__":

    uvicorn.run("app.main:app", host="localhost", port=8000, workers=4)

All the files and configurations are available at the GitHub repository: kingabzpro/Stress-Testing-FastAPI

# 3. Writing the Locust Stress Test

Now, let’s create the stress test script using Locust.

tests/locustfile.py: This file defines the behavior of simulated users.

import json
import logging
import random

from locust import HttpUser, task

# Reduce logging to improve performance
logging.getLogger("urllib3").setLevel(logging.WARNING)

class HousingAPIUser(HttpUser):
    def generate_random_features(self):
        """Generate random but realistic California housing features"""
        return [
            round(random.uniform(0.5, 15.0), 4),  # MedInc
            round(random.uniform(1.0, 52.0), 1),  # HouseAge
            round(random.uniform(2.0, 10.0), 2),  # AveRooms
            round(random.uniform(0.5, 2.0), 2),  # AveBedrms
            round(random.uniform(3.0, 35000.0), 0),  # Population
            round(random.uniform(1.0, 10.0), 2),  # AveOccup
            round(random.uniform(32.0, 42.0), 2),  # Latitude
            round(random.uniform(-124.0, -114.0), 2),  # Longitude
        ]

    @task(1)
    def model_info(self):
        """Test health endpoint"""
        with self.client.get("/model-info", catch_response=True) as response:
            if response.status_code == 200:
                response.success()
            else:
                response.failure(f"Model info failed: {response.status_code}")

    @task(3)
    def single_prediction(self):
        """Test single prediction endpoint"""
        features = self.generate_random_features()


        with self.client.post(
            "/predict", json={"features": features}, catch_response=True, timeout=10
        ) as response:
            if response.status_code == 200:
                try:
                    data = response.json()
                    if "prediction" in data:
                        response.success()
                    else:
                        response.failure("Invalid response format")
                except json.JSONDecodeError:
                    response.failure("Failed to parse JSON")
            elif response.status_code == 503:
                response.failure("Service unavailable")
            else:
                response.failure(f"Status code: {response.status_code}")

Key points:

Each simulated user will wait between 0.5 and 2 seconds between executing tasks.
Creates realistic random feature data for the prediction requests.
Each user will make one health_check request and 3 single_prediction requests.

# 4. Running the Stress Test

To evaluate the performance of your application under load, begin by starting your asynchronous machine learning application in one terminal.

Model loaded successfully
INFO:     Started server process [26216]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Open your browser and navigate to http://localhost:8000/docs. Use the interactive API documentation to test your endpoints and ensure they are functioning correctly.

Open a new terminal window, activate the virtual environment, and navigate to your project’s root directory to run Locust with the Web UI:

locust -f tests/locustfile.py --host http://localhost:8000

Access the Locust web UI at http://localhost:8089 in your browser.

In the Locust web UI, set the total number of users to 500, the spawn rate to 10 users per second, and run it for a minute.

During the test, Locust will display real-time statistics, including the number of requests, failures, and response times for each endpoint.

Once the test is complete, click on the Charts tab to view interactive graphs showing the number of users, requests per second, and response times.

To run Locust without the web UI and automatically generate an HTML report, use the following command:

locust -f tests/locustfile.py --host http://localhost:8000 --users 500 --spawn-rate 10 --run-time 60s --headless  --html report.html

After the test finishes, an HTML report named report.html will be saved in your project directory for later review.

# Final Thoughts

Our app can handle a large number of users as we are using a simple machine learning model. The results show that the model-info endpoint has a greater response time than the prediction, which is impressive. This is the best-case scenario for testing your application locally before pushing it to production.

If you would like to experience this setup firsthand, please visit the kingabzpro/Stress-Testing-FastAPI repository and follow the instructions in the documentation.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Source link

Stress Testing FastAPI Application – KDnuggets

# Introduction

# 1. Project Setup and Dependencies

# 2. Building the FastAPI Application

# 3. Writing the Locust Stress Test

# 4. Running the Stress Test

# Final Thoughts

Deep Cognition at WESCCON 2025 |

Spellbook Raises $50m + CEO, Scott Stevenson Interview – Artificial Lawyer

Sirion Rolls Out Agentic ‘Conversational Contracting Experience’ – Artificial Lawyer

Most Popular

Wedding Gala of Infante Juan, Count of Barcelona and Princess María de las Mercedes of Bourbon-Two Sicilies

‘Chicago Med’s S. Epatha Merkerson Reflects On Successful Career

Goals, and defences, win you titles

The Student Mule Economy: A Billion-Dollar Problem Hiding In Plain Sight

Recent Comments

EDITOR PICKS

Wedding Gala of Infante Juan, Count of Barcelona and Princess María de las Mercedes of Bourbon-Two Sicilies

‘Chicago Med’s S. Epatha Merkerson Reflects On Successful Career

Goals, and defences, win you titles

POPULAR POSTS

Wedding Gala of Infante Juan, Count of Barcelona and Princess María de las Mercedes of Bourbon-Two Sicilies

‘Chicago Med’s S. Epatha Merkerson Reflects On Successful Career

Goals, and defences, win you titles

POPULAR CATEGORY

ABOUT US

FOLLOW US