Neszed-Mobile-header-logo
Monday, August 4, 2025
Newszed-Header-Logo
HomeAI5 Routine Tasks That ChatGPT Can Handle for Data Scientists

5 Routine Tasks That ChatGPT Can Handle for Data Scientists

Tasks That ChatGPT Can Handle for Data Scientists
Image by Author | Canva

 

According to the data science report by Anaconda, data scientists spend nearly 60% of their time on cleaning and organizing data. These are routine, time-consuming tasks that make them ideal candidates for ChatGPT to take over.

In this article, we will explore five routine tasks that ChatGPT can handle if you use the right prompts, including cleaning and organizing the data. We’ll use a real data project from Gett, a London black taxi app similar to Uber, used in their recruitment process, to show how it works in practice.

 

Case Study: Analyzing Failed Ride Orders from Gett

 
In this data project, Gett asks you to analyze failed rider orders by examining key matching metrics to understand why some customers did not successfully get a car.

Here is the data description.

 
Analyzing Failed Ride Orders from Gett
 

Now, let’s explore it by uploading the data to ChatGPT.

In the next five steps, we will walk through the routine tasks that ChatGPT can handle in a data project. The steps are shown below.

 
Analyzing Failed Ride Orders from Gett
 

Step 1: Data Exploration and Analysis

In data exploration, we use the same functions every time, like head, info, or describe.

When we ask ChatGPT, we’ll include the key functions in the prompt. We’ll also paste the project description and attach the dataset.

 
Data Exploration and Analysis
 

We will use the prompt below. Just replace the text inside the square brackets with the project description. You can find the project description here:

Here is the data project description: [paste here ] 
Perform basic EDA, show head, info, and summary stats, missing values, and correlation heatmap.

 

Here is the output.

 
Data Exploration and Analysis
 

As you can see, ChatGPT summarizes the dataset by highlighting key columns, missing values, and then creates a correlation heatmap to explore relationships.

 

Step 2: Data Cleaning

Both datasets contain missing values.

 
Data Cleaning
 

Let’s write a prompt to work on this.

Clean this dataset: identify and handle missing values appropriately (e.g., drop or impute based on context). Provide a summary of the cleaning steps.

 

Here is the summary of what ChatGPT did:

 
Data Cleaning
 

ChatGPT converted the date column, dropped invalid orders, and imputed missing values to the m_order_eta.

 

Step 3: Generate Visualizations

To make the most of your data, it is important to visualize the right things. Instead of generating random plots, we can guide ChatGPT by providing the link to the source, which is called Retrieval-Augmented Generation.

We will use this article. Here is the prompt:

Before generating visualizations, read this article on choosing the right plots for different data types and distributions: [LINK]. hen, show most suitable visualizations for this dataset and explain why each was selected and produce the plots in this chat by running code on the dataset.

 

Here is the output.

 
Generate Visualizations
 

We have six different graphs that we produced with ChatGPT.

 
Generate Visualizations
 

You will see why the related graph has been selected, the graph, and the explanation of this graph.

 

Step 4: Make your Dataset Ready for Machine Learning

Now that we have handled missing values and explored the dataset, the next step is to prepare it for machine learning. This involves steps like encoding categorical variables and scaling numerical features.

Here is our prompt.

Prepare this dataset for machine learning: encode categorical variables, scale numerical features, and return a clean DataFrame ready for modeling. Briefly explain each step.

 

Here is the output.

 
Make your Dataset Ready for Machine Learning
 

Now your features have been scaled and encoded, so your dataset is ready to apply a machine learning model.

 

Step 5: Applying Machine Learning Model

Let’s move on to machine learning modeling. We will use the following prompt structure to apply a basic machine learning model.

Use this dataset to predict [target variable]. Apply [model type] and report machine learning evaluation metrics like [accuracy, precision, recall, F1-score]. Use only relevant 5 features and explain your modeling steps.

 

Let’s update this prompt based on our project.

Use this dataset to predict order_status_key. Apply a multiclass classification model (e.g., Random Forest), and report evaluation metrics like accuracy, precision, recall, and F1-score. Use only the 5 most relevant features and explain your modeling steps.

 

Now, paste this into the ongoing conversation and review the output.

Here is the output.

 
Applying Machine Learning Model
 

As you can see, the model performed well, perhaps too well?

 

Bonus: Gemini CLI

 
Gemini has launched an open-source agent that you can interact with from your terminal. You can install it by using this code. (60 model requests per minute and 1,000 requests per day at no charge.)

Besides ChatGPT, you can also use Gemini CLI to handle routine data science tasks, such as cleaning, exploration, and even building a dashboard to automate these tasks.

The Gemini CLI provides a straightforward command-line interface and is available at no cost. Let’s start by installing it using the code below.

sudo npm install -g @google/gemini-cli

 

After running the code above, open your terminal and paste the following code to start building with it:

 

Once you run the commands above, you’ll see the Gemini CLI as shown in the screenshot below.

 
Gemini CLI
 

Gemini CLI lets you run code, ask questions, or even build apps directly from your terminal. In this case, we will use Gemini CLI to build a Streamlit app that automates everything we’ve done so far, EDA, cleaning, visualization, and modeling.

To build a Streamlit app, we will use a prompt that covers all steps. It’s shown below.

Built a streamlit app that automates EDA, Data Cleaning, Creates Automatic data visualization, prepares the dataset for machine learning, and applies a machine learning model after selecting target variables by the user.

Step 1 – Basic EDA:
• Display .head(), .info(), and .describe()
• Show missing values per column
• Show correlation heatmap of numerical features
Step 2 – Data Cleaning:
• Detect columns with missing values
• Handle missing data appropriately (drop or impute)
• Display a summary of cleaning actions taken
Step 3 – Auto Visualizations
• Before plotting, use these visualization principles:
• Use histograms for numerical distributions
• Use bar plots for categorical distributions
• Use boxplots or violin plots to compare categories
• Use scatter plots for numerical relationships
• Use correlation heatmaps for multicollinearity
• Use line plots for time series (if applicable)
• Generate the most relevant plots for this dataset
• Explain why each plot was chosen
Step 4 – Machine Learning Preparation:
• Encode variables
• Scale numerical features
• Return a clean DataFrame ready for modeling
Step 5 – Apply Machine Learning Model:
• Offer the target variable to the user.
• Apply multiple machine learning models.
• Report evaluation metrics.
Each step should display in a different tab. Run the Streamlit app after you built it.

 

It will prompt you for permission when creating the directory or running code on your terminal.

 
Gemini CLI
 

After a few approval steps like we did, the Streamlit app will be ready, as shown below.

 
Gemini CLI
 

Now, let’s test it.

 
Gemini CLI

 

Final Thoughts

 
In this article, we first used ChatGPT to handle routine tasks, such as data cleaning, exploration, and data visualization. Next, we went one step further by using it to prepare our dataset for machine learning and applied machine learning models.

Finally, we used Gemini CLI to create a Streamlit dashboard that performs all of these steps with just a click.

To demonstrate all of this, we have used a data project from Gett. Although AI is not yet entirely reliable for every task, you can leverage it to handle routine tasks, saving you a lot of time.
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments