Data Cleaning at the Command Line for Beginner Data Scientists

November 20, 2025

9

Data Cleaning at the Command Line for Beginner Data Scientists

Image by Author

# Introduction

When you think about data cleaning, you probably think of spinning up a Jupyter Notebook. But here’s something that might surprise you: some of the most powerful data cleaning can happen right in your terminal, using tools that are already installed on your system.

In this article, you will learn how to use basic command-line utilities to clean, transform, and explore data files. No installations are required; just your terminal and some CSV files.

Before we get started cleaning data at the command line, let’s talk about why this matters:

Command-line tools are easy to use, fast, and efficient, especially for large files.
These tools are built into Linux/macOS and are available on Windows.
They are great for getting a first look at data before loading and analyzing it with Python.
It is easy to chain commands together in scripts and use these scripts for automation.

Now, let’s start coding!

Note: You can find all the commands in this Bash script on GitHub. To get the most out of this article, I encourage you to open your terminal and code along.

# Setting Up Sample Data

Let’s create a messy CSV file to work with. This simulates real-world data issues you will likely run into.

cat > messy_data.csv << 'EOF'
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
   Bob Davis    ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Charlie Brown,45,70000,Sales,charlie@example.com
Dave Wilson,31,52000,Engineering,
Emma Davis,,58000,Marketing,emma@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
EOF

This dataset has several common issues: leading and trailing whitespace, missing values, and duplicate rows. It is great for learning!

# 1. Exploring Your Data with `head`, `tail`, and `wc`

Before cleaning your data, you need to understand what you are working with. Let’s start with the basics.

# See the first 5 rows (including header)
head -n 5 messy_data.csv

# See the last 3 rows
tail -n 3 messy_data.csv

# Count total rows (including header)
wc -l messy_data.csv

Here is what is happening:

head -n 5 shows the first 5 lines, giving you a quick preview.
tail -n 3 shows the last 3 lines (useful for checking if data is complete).
wc -l counts lines — subtract 1 for the header to get your record count.

Output:

name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
   Bob Davis    ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
11 messy_data.csv

# 2. Viewing Specific Columns with `cut`

You do not always need to see all the columns. Let’s extract only the names and departments.

cut -d',' -f1,4 messy_data.csv

Breaking it down:

cut is a tool for extracting sections from each line.
-d',' sets the delimiter to a comma (for CSV files).
-f1,4 selects fields (columns) 1 and 4.
You can also use ranges: -f1-3 for columns 1 through 3.

Here is the output:

name,department
John Lee,Engineering
Jane Smith,Marketing
   Bob Davis    ,Engineering
Alice Williams,Marketing
Charlie Brown,Sales
Dave Wilson,Engineering
Emma Davis,Marketing
Frank Miller,Sales
John Lee,Engineering
Grace Lee,Engineering

# 3. Removing Duplicate Rows with `sort` and `uniq`

Notice that “John Lee” appears twice in our dataset. Let’s fix that.

# Save the header first
head -n 1 messy_data.csv > cleaned_data.csv

# Remove duplicates from the data (excluding header)
tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv

Here is what each command does: head -n 1 grabs just the header row. tail -n +2 gets everything starting from line 2 (skipping the header). Then, sort sorts the lines. Please note that uniq only works on sorted data, and uniq removes adjacent duplicate lines. Finally, >> appends to the file (versus > which overwrites).

# 4. Searching and Filtering with `grep`

Let’s now do some searching and filtering operations. Want to find all engineers or filter out rows with missing data? grep comes in handy for all such tasks.

# Find all engineers
grep "Engineering" messy_data.csv

# Find rows with empty fields (two consecutive commas)
grep ",," messy_data.csv

# Exclude rows with missing data
grep -v ",," messy_data.csv > no_missing.csv

Here, grep "pattern" searches for lines containing that pattern. grep -v inverts the match (shows lines that DO NOT match). This is a quick way to filter out incomplete records, provided the missing value results in a double comma (,,).

# 5. Trimming Whitespace with `sed`

See how the record of “Bob Davis” has extra spaces? Let’s clean that up.

sed 's/^[ \t]*//; s/[ \t]*$//' messy_data.csv > trimmed_data.csv

Now let’s understand the command: sed is a stream editor for text transformation. s/pattern/replacement/ is the substitution syntax. ^[ \t]* matches spaces/tabs at the start of a line. [ \t]*$ matches spaces/tabs at the end of a line. The semicolon separates two operations (trim the line start, then trim the line end).

# 6. Replacing Values with `sed`

Sometimes you need to standardize values or fix typos. Let’s try to replace all occurrences of “Engineering” with “Tech”.

# Replace all "Engineering" with "Tech"
sed 's/Engineering/Tech/g' messy_data.csv

Next, let’s fill empty email fields (denoted by a comma at the end of the line) with a default email value.

# Replace empty email fields with "no-email@example.com"
sed 's/,$/,no-email@example.com/' messy_data.csv

Run the above commands and observe the output. I have excluded the output here to avoid being repetitive.

Breaking it down:

The g flag means “global” — replace all occurrences on each line.
,$ matches a comma at the end of a line (indicating an empty last field).
You can chain multiple replacements with ; between them.

# 7. Counting and Summarizing with `awk`

awk is super useful for field-based operations. Let’s do some basic analysis.

# Count records by department
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c

# Calculate average age (excluding header and empty values)
tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'

In this awk command, -F',' sets the field separator to a comma, and $2 refers to the second field (age). The condition if($2) ensures only non-empty values are processed, while sum += $2 accumulates the total. Finally, the END block executes after all lines are read to calculate and print the average age.

Output:

      5 Engineering
      3 Marketing
      2 Sales
Average age: 33

# 8. Combining Commands with Pipes

You get more useful processing when you chain these command-line tools together.

# Get unique departments, sorted alphabetically
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq

# Find engineers with salary > 55000
tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3

# Count employees per department with counts
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn

Here, each | passes the output of one command as input to the next. This lets you build complex data transformations step by step. The last step sorts by count in reverse numerical order (-rn).

This outputs:

Engineering
Marketing
Sales
   Bob Davis    ,60000
      5 Engineering
      3 Marketing
      2 Sales

# 9. Converting Data Formats

Sometimes you need to work with different delimiters. Here, we try to use a tab as the separator instead of a comma.

# Convert CSV to TSV (tab-separated)
sed 's/,/\t/g' messy_data.csv > data.tsv

# Add a new column with a fixed value
awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv

In this awk command, BEGIN{OFS=","} sets the output field separator to a comma. $0 represents the entire input line, and print $0, "2024" appends “2024” as a new column to each line of output.

# 10. A Complete Cleaning Pipeline

Let’s put it all together into one useful command that cleans our messy data:

# Save header
head -n 1 messy_data.csv > final_clean.csv

# Clean the data: remove duplicates, trim whitespace, exclude missing values
tail -n +2 messy_data.csv | \
  sed 's/^[ \t]*//; s/[ \t]*$//' | \
  grep -v ",," | \
  sort | \
  uniq >> final_clean.csv

echo "Cleaning complete! Check final_clean.csv"

This pipeline first saves the header to preserve the column names, then skips it while processing the data rows. It trims leading and trailing whitespace from each line, removes any rows containing empty fields (specifically double commas), sorts the data, and eliminates duplicate entries. Finally, it appends the cleaned data to the output file.

# Conclusion

Command-line data cleaning is a powerful yet underrated skill for data scientists. These tools are fast and reliable. While you will still use Python for complex analysis, mastering these basics will make you more efficient and give you options when Python isn’t ideal.

The best part is that these skills transfer to data engineering, DevOps, and system administration roles. Learning to manipulate data at the command line makes you a more versatile developer.

Start practicing with your own datasets, and you will be surprised how often you reach for these tools instead of spinning up Python notebooks. Happy data cleaning!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Source link

Data Cleaning at the Command Line for Beginner Data Scientists

# Introduction

# Setting Up Sample Data

# 1. Exploring Your Data with `head`, `tail`, and `wc`

# 2. Viewing Specific Columns with `cut`

# 3. Removing Duplicate Rows with `sort` and `uniq`

# 4. Searching and Filtering with `grep`

# 5. Trimming Whitespace with `sed`

# 6. Replacing Values with `sed`

# 7. Counting and Summarizing with `awk`

# 8. Combining Commands with Pipes

# 9. Converting Data Formats

# 10. A Complete Cleaning Pipeline

# Conclusion

Rewriting Legal Tech’s Playbook – Artificial Lawyer

Building Brand Consistency Without Hiring an Illustrator: A Deep Dive into Ouch – AI Time Journal

Bridging the Divide: How Sanket Sable is Architecting the Future of Privacy-First AI in Advertising – AI Time Journal

Most Popular

Netflix Scores With 3 Docs; Mariska Hargitay Snubbed

The Fed Cut Rates Again–What Else Is New?

The rise of private solar power in Nigeria – A greener life, a greener world

Another defensive injury increases January transfer necessity

Recent Comments

EDITOR PICKS

Netflix Scores With 3 Docs; Mariska Hargitay Snubbed

The Fed Cut Rates Again–What Else Is New?

The rise of private solar power in Nigeria – A greener life, a greener world

POPULAR POSTS

Netflix Scores With 3 Docs; Mariska Hargitay Snubbed

The Fed Cut Rates Again–What Else Is New?

The rise of private solar power in Nigeria – A greener life, a greener world

POPULAR CATEGORY

ABOUT US

FOLLOW US

Data Cleaning at the Command Line for Beginner Data Scientists

# Introduction

# Setting Up Sample Data

# 1. Exploring Your Data with head, tail, and wc

# 2. Viewing Specific Columns with cut

# 3. Removing Duplicate Rows with sort and uniq

# 4. Searching and Filtering with grep

# 5. Trimming Whitespace with sed

# 6. Replacing Values with sed

# 7. Counting and Summarizing with awk

# 8. Combining Commands with Pipes

# 9. Converting Data Formats

# 10. A Complete Cleaning Pipeline

# Conclusion

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US

# 1. Exploring Your Data with `head`, `tail`, and `wc`

# 2. Viewing Specific Columns with `cut`

# 3. Removing Duplicate Rows with `sort` and `uniq`

# 4. Searching and Filtering with `grep`

# 5. Trimming Whitespace with `sed`

# 6. Replacing Values with `sed`

# 7. Counting and Summarizing with `awk`