Neszed-Mobile-header-logo
Tuesday, December 16, 2025
Newszed-Header-Logo
HomeAIPrompt Engineering for Outlier Detection

Prompt Engineering for Outlier Detection

Prompt Engineering for Outlier Detection
Image by Author

 

Introduction

 
Outliers in a given dataset represent extreme values. They are so extreme that they can ruin your analysis by heavily distorting statistics like the mean. For example, in a player height dataset, 12 feet is an outlier even for NBA players and would significantly pull the mean upward.

How do we handle them? We will answer this question by performing a real-life data project requested by Physician Partners during the data scientist recruitment process.

First, we will explore detection methods, define outliers, and finally craft prompts to execute the process.

 

What Are Outlier Detection & Removal Methods?

 
Outlier detection depends on the dataset you have. How?

For instance, if your dataset distribution is normal, you can use the standard deviation or the Z-score to detect them. However, if your dataset does not follow a normal distribution, you can use the Percentile Method, Principal Component Analysis (PCA), or the Interquartile Range (IQR) method.

You can check this article to see how to detect outliers using a box plot.

In this section, we will discover methodologies and Python code to apply these techniques.

 

// Standard Deviation Method

In this method, we can define outliers by measuring how much each value deviates from the mean.

For example, in the graph below, you can see the normal distribution and \( \pm3 \) standard deviations from the mean.

 
Prompt Engineering for Outlier Detection
 

To use this method, first measure the mean and calculate the standard deviation. Next, determine the threshold by adding and subtracting three standard deviations from the mean, and filter the dataset to keep only the values within this range. Here is the Pandas code that performs this operation.

import pandas as pd
import numpy as np

col = df['column']

mean = col.mean()
std = col.std()

lower = mean - 3 * std
upper = mean + 3 * std

# Keep values within the 3 std dev range
filtered_df = df[(col >= lower) & (col <= upper)]

 

We make one assumption: the dataset should follow a normal distribution. What is a normal distribution? It means that the data follows a balanced, bell-shaped distribution. Here is an example:

 
Prompt Engineering for Outlier Detection
 

By using this method, you will flag about 0.3% of the data as outliers, since 3 standard deviations from the mean covers about 99.7% of the data.
 
Prompt Engineering for Outlier Detection
 

 

// IQR

The Interquartile Range (IQR) represents the middle 50% of your data and shows the most common values in your dataset, as shown in the graph below.

 
Prompt Engineering for Outlier Detection
 

To detect outliers using IQR, first calculate the IQR. In the following code, we define the first and third quartiles and subtract the first quartile from the third to find the IQR (\( 0.75 – 0.25 = 0.5 \)).

Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)

IQR = Q3 - Q1

 

Once you have the IQR, you must create the filter, defining the boundaries.

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

 

Any value outside these bounds will be flagged as an outlier.

filtered_df = df[(df['column'] >= lower) & (df['column'] <= upper)]

 

As you can see from the image below, the IQR represents the box in the middle. You can clearly see the boundaries we have defined (\( \pm1.5 \text{ IQR} \)).
 
Prompt Engineering for Outlier Detection
 

You can apply IQR to any distribution, but it works best if the distribution is not highly skewed.

 

// Percentile

The Percentile Method involves removing values based on a chosen threshold.

This threshold is commonly used because it removes the most extreme 1% to 5% of the data, which usually contains the outliers.

We did the same thing in the last section while calculating the IQR, like this:

Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)

 

For instance, let’s define the upper 99% and lower 1% of the dataset as outliers.

lower_p = df['column'].quantile(0.01)
upper_p = df['column'].quantile(0.99)

 

Finally, filter the dataset based on these boundaries.

filtered_df = df[(df['column'] >= lower_p) & (df['column'] <= upper_p)]

 

This method does not rely on assumptions, unlike standard deviation (normal distribution) and IQR methods (non-highly skewed distribution).

 

Outliers Detection Data Project From Physician Partners

 
Physician Partners is a healthcare group that helps doctors coordinate patient care more effectively. In this data project, they asked us to create an algorithm that can find outliers in the data in one or several columns.

First, let’s explore the dataset using this code.

sfrs = pd.read_csv('sfr_test.csv')
sfrs.head()

 

Here is the output:

 

member_unique_id gender dob eligible_year eligible_month affiliation_type pbp_group plan_name npi line_of_business
1 F 21/06/1990 2020 202006 Affiliate NON-SNP MEDICARE – CAREFREE 1 HMO
2 M 02/01/1948 2020 202006 Affiliate NON-SNP NaN 1 HMO
3 M 14/06/1948 2020 202006 Affiliate NON-SNP MEDICARE – CAREFREE 1 HMO
4 M 10/02/1954 2020 202006 Affiliate D-SNP MEDICARE – CARENEEDS 1 HMO
5 M 31/12/1953 2020 202006 Affiliate NON-SNP NaN 1 HMO

 

However, there are more columns we did not see with the head() method. To see them, let’s use the info() method.

 

And let’s see the output.
 
Prompt Engineering for Outlier Detection
 

This dataset contains synthetic healthcare and financial information, including demographics, plan details, clinical flags, and financial columns used to identify unusually high-spending members.

Here are those columns and their explanations.

 

Column Explanation
member_unique_id member’s ID
gender member’s gender
dob member’s date of birth
eligible_year year
eligible_month month
affiliation_type doctor’s type
pbp_group health plan group
plan_name health plan name
npi doctor’s ID
line_of_business health plan type
esrd True if the patient is on dialysis
hospice True if the patient is in hospice

 

As you can see from the project data description, there is a catch: some data points include a dollar sign (“$”), so this needs to be taken care of.

 
Prompt Engineering for Outlier Detection
 

Let’s view this column closely.

 

Here is the output.

 
Prompt Engineering for Outlier Detection
 

The dollar signs and these commas need to be addressed so we can perform proper data analysis.

 

Prompt Crafting for Outlier Detection

 
Now we are aware of the specifics of the dataset. It is time to write two different prompts: one to detect outliers and a second to remove them.

 

// Prompt to Detect Outliers

We have learned three different techniques, so we should include them in the prompt.

Also, as you can see from the info() method output, the dataset has NaNs (missing values): most columns have 10,530 entries, but some columns have missing values (e.g., the plan_name column with 6,606 non-null values). This should be taken care of.

Here is the prompt:

You are a data analysis assistant. I have attached a dataset. Your task is to detect outliers using three methods: Standard Deviation, IQR, and Percentile.

Follow these steps:

1. Load the attached dataset and remove both the “$” sign and any comma separators (“,”) from financial columns, then convert them to numeric.

2. Handle missing values by removing rows with NA in the numeric columns we analyze.

3. Apply the three methods to the financial columns:

Standard Deviation Method: flag values outside mean +/- 3 * std

IQR Method: flag values outside Q1 – 1.5 * IQR and Q3 + 1.5 * IQR

Percentile Method: use the 1st and 99th percentiles as cutoffs

4. Instead of listing all results for each column, compute and output only:

– the total number of outliers detected across all financial columns for each method
– the average number of outliers per column for each method

Additionally, save the row indices of the detected outliers into three separate CSV files:
– sd_outlier_indices.csv
– iqr_outlier_indices.csv
– percentile_outlier_indices.csv

Output only the summary counts and save the indices to CSV.

financial_columns = [
“ipa_funding”,
“ma_premium”,
“ma_risk_score”,
“mbr_with_rx_rebates”,
“partd_premium”,
“pcp_cap”,
“pcp_ffs”,
“plan_premium”,
“prof”,
“reinsurance”,
“risk_score_partd”,
“rx”,
“rx_rebates”,
“rx_with_rebates”,
“rx_without_rebates”,
“spec_cap”
]

 

This prompt above will first load the dataset and handle missing values by removing them. Next, it will output the number of outliers using financial columns and create three CSV files. They will include indices of missing values for each of these techniques.

 

// Prompt to Remove the Outliers

After finding indices, the next step is to remove them. To do that, we will also write a prompt.

You are a data analysis assistant. I have attached a dataset along with a CSV which includes indices which are outliers.

Your task is to remove these outliers and return a clean version of the dataset.

1. Load the dataset.
2. Remove all given outliers using the given indices.
3. Confirm how many values were removed.
4. Return the cleaned dataset.

This prompt first loads the dataset and removes the outliers using the given indices.

 

Testing Prompts

 
Let’s test how those prompts work. First, download the dataset.

 

// Outlier Detection Prompt

Now, attach the dataset you have to ChatGPT (or the Large Language Model (LLM) of your choice). Paste the prompt to detect outliers after attaching the dataset. Let’s see the output.

 
Prompt Engineering for Outlier Detection
 

The output shows how many outliers each method detected, the average per column, and, as requested, the CSV files containing the IDs of these outliers.

We then ask it to make all CSVs downloadable with this prompt:

Prepare the cleaned CSVs for download

 

Here is the output with links.

 
Prompt Engineering for Outlier Detection

 

// Outlier Removal Prompt

This is the final step. Select the method you want to use to remove outliers, then copy the outlier removal prompt. Attach the CSV with this prompt and send it.

 
Prompt Engineering for Outlier Detection
 

We removed the outliers. Now, let’s validate it using Python. The following code will read the cleaned dataset and compare the shapes to show the before-and-after.

cleaned = pd.read_csv("/cleaned_dataset.csv")

print("Before:", sfrs.shape)
print("After :", cleaned.shape)
print("Removed rows:", sfrs.shape[0] - cleaned.shape[0])

 

Here is the output.

 
Prompt Engineering for Outlier Detection
 

This validates that we removed 791 outliers, using the Standard Deviation method with ChatGPT.

 

Final Thoughts

 
Removing outliers not only increases your machine learning model’s efficiency but also makes your analysis more robust. Extreme values may ruin your analysis. The reason for these outliers? They can be simple typing mistakes, or they can be values that appear in the dataset but are not representative of the real population, like a 7-foot guy like Shaquille O’Neal.

To remove outliers, you can use those techniques by using Python or go one step further and include AI in the process, using your prompts. Always be very careful because your dataset might have specifics that AI cannot understand at first glance, like “$” signs.
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments