Prompt Engineering for Outlier Detection

December 9, 2025

8

Prompt Engineering for Outlier Detection

Image by Author

# Introduction

Outliers in a given dataset represent extreme values. They are so extreme that they can ruin your analysis by heavily distorting statistics like the mean. For example, in a player height dataset, 12 feet is an outlier even for NBA players and would significantly pull the mean upward.

How do we handle them? We will answer this question by performing a real-life data project requested by Physician Partners during the data scientist recruitment process.

First, we will explore detection methods, define outliers, and finally craft prompts to execute the process.

# What Are Outlier Detection & Removal Methods?

Outlier detection depends on the dataset you have. How?

For instance, if your dataset distribution is normal, you can use the standard deviation or the Z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the Percentile Method, Principal Component Analysis (PCA), or the Interquartile Range (IQR) method.

You can check this article to see how to detect outliers using a box plot.

In this section, we will discover methodologies and Python code to apply these techniques.

// Standard Deviation Method

In this method, we can define outliers by measuring how much each value deviates from the mean.

For example, in the graph below, you can see the normal distribution and $ \pm3 $ standard deviations from the mean.

To use this method, first measure the mean and calculate the standard deviation. Next, determine the threshold by adding and subtracting three standard deviations from the mean, and filter the dataset to keep only the values within this range. Here is the Pandas code that performs this operation.

import pandas as pd
import numpy as np

col = df['column']

mean = col.mean()
std = col.std()

lower = mean - 3 * std
upper = mean + 3 * std

# Keep values within the 3 std dev range
filtered_df = df[(col >= lower) & (col <= upper)]

We make one assumption: the dataset should follow a normal distribution. What is a normal distribution? It means that the data follows a balanced, bell-shaped distribution. Here is an example:

By using this method, you will flag about 0.3% of the data as outliers, since 3 standard deviations from the mean covers about 99.7% of the data.

// IQR

The Interquartile Range (IQR) represents the middle 50% of your data and shows the most common values in your dataset, as shown in the graph below.

To detect outliers using IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR ($ 0.75 – 0.25 = 0.5 $).

Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)

IQR = Q3 - Q1

Once you have the IQR, you must create the filter, defining the boundaries.

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

Any value outside these bounds will be flagged as an outlier.

filtered_df = df[(df['column'] >= lower) & (df['column'] <= upper)]

As you can see from the image below, the IQR represents the box in the middle. You can clearly see the boundaries we have defined ($ \pm1.5 \text{ IQR} $).

You can apply IQR to any distribution, but it works best if the distribution is not highly skewed.

// Percentile

The Percentile Method involves removing values based on a chosen threshold.

This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which usually contains the outliers.

We did the same thing in the last section while calculating the IQR, like this:

Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)

For instance, let’s define the upper 99% and lower 1% of the dataset as outliers.

lower_p = df['column'].quantile(0.01)
upper_p = df['column'].quantile(0.99)

Finally, filter the dataset based on these boundaries.

filtered_df = df[(df['column'] >= lower_p) & (df['column'] <= upper_p)]

This method does not rely on assumptions, unlike standard deviation (normal distribution) and IQR methods (non-highly skewed distribution).

# Outliers Detection Data Project From Physician Partners

Physician Partners is a healthcare group that helps doctors coordinate patient care more effectively. In this data project, they asked us to create an algorithm that can find outliers in the data in one or several columns.

First, let’s explore the dataset using this code.

sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()

Here is the output:

member_unique_id	gender	dob	eligible_year	eligible_month	affiliation_type	pbp_group	plan_name	npi	line_of_business
1	F	21/06/1990	2020	202006	Affiliate	NON-SNP	MEDICARE – CAREFREE	1	HMO
2	M	02/01/1948	2020	202006	Affiliate	NON-SNP	NaN	1	HMO
3	M	14/06/1948	2020	202006	Affiliate	NON-SNP	MEDICARE – CAREFREE	1	HMO
4	M	10/02/1954	2020	202006	Affiliate	D-SNP	MEDICARE – CARENEEDS	1	HMO
5	M	31/12/1953	2020	202006	Affiliate	NON-SNP	NaN	1	HMO

However, there are more columns we did not see with the head() method. To see them, let’s use the info() method.

And let’s see the output.

This dataset contains synthetic healthcare and financial information, including demographics, plan details, clinical flags, and financial columns used to identify unusually high-spending members.

Here are those columns and their explanations.

Column	Explanation
member_unique_id	member’s ID
gender	member’s gender
dob	member’s date of birth
eligible_year	year
eligible_month	month
affiliation_type	doctor’s type
pbp_group	health plan group
plan_name	health plan name
npi	doctor’s ID
line_of_business	health plan type
esrd	True if the patient is on dialysis
hospice	True if the patient is in hospice

As you can see from the project data description, there is a catch: some data points include a dollar sign (“$”), so this needs to be taken care of.

Let’s view this column closely.

Here is the output.

The dollar signs and these commas need to be addressed so we can perform proper data analysis.

# Prompt Crafting for Outlier Detection

Now we are aware of the specifics of the dataset. It is time to write two different prompts: one to detect outliers and a second to remove them.

// Prompt to Detect Outliers

We have learned three different techniques, so we should include them in the prompt.

Also, as you can see from the info() method output, the dataset has NaNs (missing values): most columns have 10,530 entries, but some columns have missing values (e.g., the plan_name column with 6,606 non-null values). This should be taken care of.

Here is the prompt:

You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: Standard Deviation, IQR, and Percentile.

Follow these steps:

1. Load the attached dataset and remove both the “$” sign and any comma separators (“,”) from financial columns, then convert them to numeric.

2. Handle missing values by removing rows with NA in the numeric columns we analyze.

3. Apply the three methods to the financial columns:

Standard Deviation Method: flag values outside mean +/- 3 * std

IQR Method: flag values outside Q1 – 1.5 * IQR and Q3 + 1.5 * IQR

Percentile Method: use the 1st and 99th percentiles as cutoffs

4. Instead of listing all results for each column, compute and output only:

– the total number of outliers detected across all financial columns for each method
– the average number of outliers per column for each method

Additionally, save the row indices of the detected outliers into three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– percentile_outlier_indices.csv

Output only the summary counts and save the indices to CSV.

financial_columns = [
“ipa_funding”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“partd_premium”,
“pcp_cap”,
“pcp_ffs”,
“plan_premium”,
“prof”,
“reinsurance”,
“risk_score_partd”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
]

This prompt above will first load the dataset and handle missing values by removing them. Next, it will output the number of outliers using financial columns and create three CSV files. They will include indices of missing values for each of these techniques.

// Prompt to Remove the Outliers

After finding indices, the next step is to remove them. To do that, we will also write a prompt.

You are a data analysis assistant. I have attached a dataset along with a CSV which includes indices which are outliers.

Your task is to remove these outliers and return a clean version of the dataset.

1. Load the dataset.
2. Remove all given outliers using the given indices.
3. Confirm how many values were removed.
4. Return the cleaned dataset.

This prompt first loads the dataset and removes the outliers using the given indices.

# Testing Prompts

Let’s test how those prompts work. First, download the dataset.

// Outlier Detection Prompt

Now, attach the dataset you have to ChatGPT (or the Large Language Model (LLM) of your choice). Paste the prompt to detect outliers after attaching the dataset. Let’s see the output.

The output shows how many outliers each method detected, the average per column, and, as requested, the CSV files containing the IDs of these outliers.

We then ask it to make all CSVs downloadable with this prompt:

Prepare the cleaned CSVs for download

Here is the output with links.

// Outlier Removal Prompt

This is the final step. Select the method you want to use to remove outliers, then copy the outlier removal prompt. Attach the CSV with this prompt and send it.

We removed the outliers. Now, let’s validate it using Python. The following code will read the cleaned dataset and compare the shapes to show the before-and-after.

cleaned = pd.read_csv("/cleaned_dataset.csv")

print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape[0] - cleaned.shape[0])

Here is the output.

This validates that we removed 791 outliers, using the Standard Deviation method with ChatGPT.

# Final Thoughts

Removing outliers not only increases your machine learning model’s efficiency but also makes your analysis more robust. Extreme values may ruin your analysis. The reason for these outliers? They can be simple typing mistakes, or they can be values that appear in the dataset but are not representative of the real population, like a 7-foot guy like Shaquille O’Neal.

To remove outliers, you can use those techniques by using Python or go one step further and include AI in the process, using your prompts. Always be very careful because your dataset might have specifics that AI cannot understand at first glance, like “$” signs.

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Source link

Prompt Engineering for Outlier Detection

# Introduction

# What Are Outlier Detection & Removal Methods?

// Standard Deviation Method

// IQR

// Percentile

# Outliers Detection Data Project From Physician Partners

# Prompt Crafting for Outlier Detection

// Prompt to Detect Outliers

// Prompt to Remove the Outliers

# Testing Prompts

// Outlier Detection Prompt

// Outlier Removal Prompt

# Final Thoughts

AIhub blog post highlights 2025

Benefits, Real-World Applications & Use Cases

Benefits, Real-World Use Cases & Infrastructure

LEAVE A REPLY Cancel reply

Most Popular

People Keep Trying ‘Airport Theory’

Inflation and the Frozen Advantage: How Rising Food Prices Are Driving Value‑Conscious, Nutrition‑Smart Choices

Samsung Drops Its microSD Card Below $0.13 per GB for Holidays, the Cheapest Nintendo Switch Storage Upgrade

AIhub blog post highlights 2025

Recent Comments

EDITOR PICKS

People Keep Trying ‘Airport Theory’

Inflation and the Frozen Advantage: How Rising Food Prices Are Driving Value‑Conscious, Nutrition‑Smart Choices

Samsung Drops Its microSD Card Below $0.13 per GB for Holidays, the Cheapest Nintendo Switch Storage Upgrade

POPULAR POSTS

People Keep Trying ‘Airport Theory’

Inflation and the Frozen Advantage: How Rising Food Prices Are Driving Value‑Conscious, Nutrition‑Smart Choices

Samsung Drops Its microSD Card Below $0.13 per GB for Holidays, the Cheapest Nintendo Switch Storage Upgrade

POPULAR CATEGORY

ABOUT US

FOLLOW US