
Image by Author
# Introduction
When you think about data cleaning, you probably think of spinning up a Jupyter Notebook. But here’s something that might surprise you: some of the most powerful data cleaning can happen right in your terminal, using tools that are already installed on your system.
In this article, you will learn how to use basic command-line utilities to clean, transform, and explore data files. No installations are required; just your terminal and some CSV files.
Before we get started cleaning data at the command line, let’s talk about why this matters:
- Command-line tools are easy to use, fast, and efficient, especially for large files.
- These tools are built into Linux/macOS and are available on Windows.
- They are great for getting a first look at data before loading and analyzing it with Python.
- It is easy to chain commands together in scripts and use these scripts for automation.
Now, let’s start coding!
Note: You can find all the commands in this Bash script on GitHub. To get the most out of this article, I encourage you to open your terminal and code along.
# Setting Up Sample Data
Let’s create a messy CSV file to work with. This simulates real-world data issues you will likely run into.
cat > messy_data.csv << 'EOF'
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
Bob Davis ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Charlie Brown,45,70000,Sales,charlie@example.com
Dave Wilson,31,52000,Engineering,
Emma Davis,,58000,Marketing,emma@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
EOF
This dataset has several common issues: leading and trailing whitespace, missing values, and duplicate rows. It is great for learning!
# 1. Exploring Your Data with head, tail, and wc
Before cleaning your data, you need to understand what you are working with. Let’s start with the basics.
# See the first 5 rows (including header)
head -n 5 messy_data.csv
# See the last 3 rows
tail -n 3 messy_data.csv
# Count total rows (including header)
wc -l messy_data.csv
Here is what is happening:
head -n 5shows the first 5 lines, giving you a quick preview.tail -n 3shows the last 3 lines (useful for checking if data is complete).wc -lcounts lines — subtract 1 for the header to get your record count.
Output:
name,age,salary,department,email
John Lee,32,50000,Engineering,john@example.com
Jane Smith,28,55000,Marketing,jane@example.com
Bob Davis ,35,60000,Engineering,bob@example.com
Alice Williams,29,,Marketing,alice@example.com
Frank Miller,38,65000,Sales,frank@example.com
John Lee,32,50000,Engineering,john@example.com
Grace Lee,27,51000,Engineering,grace@example.com
11 messy_data.csv
# 2. Viewing Specific Columns with cut
You do not always need to see all the columns. Let’s extract only the names and departments.
cut -d',' -f1,4 messy_data.csv
Breaking it down:
cutis a tool for extracting sections from each line.-d','sets the delimiter to a comma (for CSV files).-f1,4selects fields (columns) 1 and 4.- You can also use ranges:
-f1-3for columns 1 through 3.
Here is the output:
name,department
John Lee,Engineering
Jane Smith,Marketing
Bob Davis ,Engineering
Alice Williams,Marketing
Charlie Brown,Sales
Dave Wilson,Engineering
Emma Davis,Marketing
Frank Miller,Sales
John Lee,Engineering
Grace Lee,Engineering
# 3. Removing Duplicate Rows with sort and uniq
Notice that “John Lee” appears twice in our dataset. Let’s fix that.
# Save the header first
head -n 1 messy_data.csv > cleaned_data.csv
# Remove duplicates from the data (excluding header)
tail -n +2 messy_data.csv | sort | uniq >> cleaned_data.csv
Here is what each command does: head -n 1 grabs just the header row. tail -n +2 gets everything starting from line 2 (skipping the header). Then, sort sorts the lines. Please note that uniq only works on sorted data, and uniq removes adjacent duplicate lines. Finally, >> appends to the file (versus > which overwrites).
# 4. Searching and Filtering with grep
Let’s now do some searching and filtering operations. Want to find all engineers or filter out rows with missing data? grep comes in handy for all such tasks.
# Find all engineers
grep "Engineering" messy_data.csv
# Find rows with empty fields (two consecutive commas)
grep ",," messy_data.csv
# Exclude rows with missing data
grep -v ",," messy_data.csv > no_missing.csv
Here, grep "pattern" searches for lines containing that pattern. grep -v inverts the match (shows lines that DO NOT match). This is a quick way to filter out incomplete records, provided the missing value results in a double comma (,,).
# 5. Trimming Whitespace with sed
See how the record of “Bob Davis” has extra spaces? Let’s clean that up.
sed 's/^[ \t]*//; s/[ \t]*$//' messy_data.csv > trimmed_data.csv
Now let’s understand the command: sed is a stream editor for text transformation. s/pattern/replacement/ is the substitution syntax. ^[ \t]* matches spaces/tabs at the start of a line. [ \t]*$ matches spaces/tabs at the end of a line. The semicolon separates two operations (trim the line start, then trim the line end).
# 6. Replacing Values with sed
Sometimes you need to standardize values or fix typos. Let’s try to replace all occurrences of “Engineering” with “Tech”.
# Replace all "Engineering" with "Tech"
sed 's/Engineering/Tech/g' messy_data.csv
Next, let’s fill empty email fields (denoted by a comma at the end of the line) with a default email value.
# Replace empty email fields with "no-email@example.com"
sed 's/,$/,no-email@example.com/' messy_data.csv
Run the above commands and observe the output. I have excluded the output here to avoid being repetitive.
Breaking it down:
- The
gflag means “global” — replace all occurrences on each line. ,$matches a comma at the end of a line (indicating an empty last field).- You can chain multiple replacements with
;between them.
# 7. Counting and Summarizing with awk
awk is super useful for field-based operations. Let’s do some basic analysis.
# Count records by department
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c
# Calculate average age (excluding header and empty values)
tail -n +2 messy_data.csv | awk -F',' '{if($2) sum+=$2; if($2) count++} END {print "Average age:", sum/count}'
In this awk command, -F',' sets the field separator to a comma, and $2 refers to the second field (age). The condition if($2) ensures only non-empty values are processed, while sum += $2 accumulates the total. Finally, the END block executes after all lines are read to calculate and print the average age.
Output:
5 Engineering
3 Marketing
2 Sales
Average age: 33
# 8. Combining Commands with Pipes
You get more useful processing when you chain these command-line tools together.
# Get unique departments, sorted alphabetically
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq
# Find engineers with salary > 55000
tail -n +2 messy_data.csv | grep "Engineering" | awk -F',' '$3 > 55000' | cut -d',' -f1,3
# Count employees per department with counts
tail -n +2 messy_data.csv | cut -d',' -f4 | sort | uniq -c | sort -rn
Here, each | passes the output of one command as input to the next. This lets you build complex data transformations step by step. The last step sorts by count in reverse numerical order (-rn).
This outputs:
Engineering
Marketing
Sales
Bob Davis ,60000
5 Engineering
3 Marketing
2 Sales
# 9. Converting Data Formats
Sometimes you need to work with different delimiters. Here, we try to use a tab as the separator instead of a comma.
# Convert CSV to TSV (tab-separated)
sed 's/,/\t/g' messy_data.csv > data.tsv
# Add a new column with a fixed value
awk -F',' 'BEGIN{OFS=","} {print $0, "2024"}' messy_data.csv > data_with_year.csv
In this awk command, BEGIN{OFS=","} sets the output field separator to a comma. $0 represents the entire input line, and print $0, "2024" appends “2024” as a new column to each line of output.
# 10. A Complete Cleaning Pipeline
Let’s put it all together into one useful command that cleans our messy data:
# Save header
head -n 1 messy_data.csv > final_clean.csv
# Clean the data: remove duplicates, trim whitespace, exclude missing values
tail -n +2 messy_data.csv | \
sed 's/^[ \t]*//; s/[ \t]*$//' | \
grep -v ",," | \
sort | \
uniq >> final_clean.csv
echo "Cleaning complete! Check final_clean.csv"
This pipeline first saves the header to preserve the column names, then skips it while processing the data rows. It trims leading and trailing whitespace from each line, removes any rows containing empty fields (specifically double commas), sorts the data, and eliminates duplicate entries. Finally, it appends the cleaned data to the output file.
# Conclusion
Command-line data cleaning is a powerful yet underrated skill for data scientists. These tools are fast and reliable. While you will still use Python for complex analysis, mastering these basics will make you more efficient and give you options when Python isn’t ideal.
The best part is that these skills transfer to data engineering, DevOps, and system administration roles. Learning to manipulate data at the command line makes you a more versatile developer.
Start practicing with your own datasets, and you will be surprised how often you reach for these tools instead of spinning up Python notebooks. Happy data cleaning!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

