Text Processing¶
Linux provides a rich set of tools for searching, transforming, and analyzing text. These commands are designed to work with streams and pipelines, making them composable building blocks for data processing.
grep¶
grep searches for lines matching a pattern. It's one of the most frequently used commands on any Linux system.
Basic Usage¶
grep "error" /var/log/syslog # lines containing "error"
grep "error" file1.txt file2.txt # search multiple files
grep "error" *.log # search files matching a glob
Common Options¶
| Option | Effect |
|---|---|
-i |
Case-insensitive search |
-v |
Invert match (lines that don't contain the pattern) |
-c |
Print count of matching lines instead of the lines |
-n |
Show line numbers |
-l |
Print only filenames containing a match |
-L |
Print only filenames NOT containing a match |
-o |
Print only the matched portion of each line |
-w |
Match whole words only |
-x |
Match whole lines only |
-r |
Recursive search through directories |
-F |
Treat pattern as a fixed string (no regex) |
-E |
Use extended regular expressions |
-P |
Use Perl-compatible regular expressions |
The most used flags in practice: -r to search an entire project tree, -l when you just need to know which files contain something (not the matching lines themselves), -o to extract just the matched text (useful for pulling values out of structured data), -c to count matches per file without seeing the actual lines, -F when your search string contains regex metacharacters like . or * and you want to search for them literally, and -P when you need advanced features like lookahead or non-greedy matching that basic and extended regex can't do.
Context Lines¶
Show lines surrounding each match:
grep -A 3 "error" log.txt # 3 lines After each match
grep -B 2 "error" log.txt # 2 lines Before each match
grep -C 2 "error" log.txt # 2 lines of Context (before and after)
Recursive Search¶
grep -r "TODO" src/ # search all files under src/
grep -rn "TODO" src/ # same, with line numbers
grep -r --include="*.py" "import" . # only search .py files
grep -r --exclude-dir=".git" "pattern" . # skip .git directory
Practical Examples¶
# Count occurrences in each file
grep -c "function" *.js
# Find files that contain a pattern
grep -rl "deprecated" src/
# Extract email addresses
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' contacts.txt
# Show processes matching a name (excluding the grep itself)
ps aux | grep "[n]ginx"
Regular Expressions¶
Regular expressions (regex) describe search patterns. grep supports three flavors.
Basic Regular Expressions (BRE)¶
This is grep's default mode. Some metacharacters require backslash escaping.
| Pattern | Matches |
|---|---|
. |
Any single character |
^ |
Start of line |
$ |
End of line |
* |
Zero or more of the preceding element |
\+ |
One or more of the preceding element |
\? |
Zero or one of the preceding element |
\{n\} |
Exactly n of the preceding element |
\{n,m\} |
Between n and m of the preceding element |
[abc] |
Any one character in the set |
[^abc] |
Any one character NOT in the set |
[a-z] |
Any character in the range |
\(group\) |
Capture group |
\1 |
Backreference to first group |
Extended Regular Expressions (ERE)¶
Use grep -E. Metacharacters work without backslashes. (egrep is deprecated - use grep -E instead.)
| Pattern | Matches |
|---|---|
+ |
One or more |
? |
Zero or one |
{n,m} |
Repetition |
(group) |
Capture group |
\| |
Alternation (no backslash needed in ERE) |
POSIX Character Classes¶
These work inside bracket expressions and are locale-aware:
| Class | Equivalent | Matches |
|---|---|---|
[:digit:] |
[0-9] |
Digits |
[:alpha:] |
[a-zA-Z] |
Letters |
[:alnum:] |
[a-zA-Z0-9] |
Letters and digits |
[:upper:] |
[A-Z] |
Uppercase letters |
[:lower:] |
[a-z] |
Lowercase letters |
[:space:] |
[ \t\n\r\f\v] |
Whitespace |
[:blank:] |
[ \t] |
Space and tab |
[:punct:] |
Punctuation characters |
Note the double brackets - the outer [] is the bracket expression, the inner [:class:] is the class name:
POSIX character classes exist because of locale awareness. The range [a-z] doesn't always mean what you'd expect - in some locales (like en_US.UTF-8), the sort order interleaves upper and lowercase, so [a-z] can match uppercase letters too. [:lower:] always means 'lowercase letters in the current locale,' regardless of sorting rules. If you're writing scripts that will run on systems with different locale settings, POSIX classes are the safe choice. For quick interactive use where you know your locale, [a-z] is fine.
Backreferences¶
Capture a group and match it again later:
# Find repeated words (GNU grep - \b, \w, and \1 in ERE are GNU extensions)
grep -E '\b(\w+)\s+\1\b' document.txt
# Find lines where first and last word match (GNU grep - \w is a GNU extension)
grep '^\([[:alpha:]]\+\).*\1$' file.txt
Perl-Compatible Regular Expressions¶
Use grep -P for Perl-compatible regex (PCRE), which adds shorthand classes and advanced features not available in BRE or ERE:
# Shorthand character classes
grep -P '\d{3}-\d{4}' phones.txt # \d = digit (like [0-9])
grep -P '\bword\b' file.txt # \b = word boundary
grep -P '\s+' file.txt # \s = whitespace
# Lookahead and lookbehind
grep -P '(?<=price: )\d+' catalog.txt # match digits preceded by "price: "
grep -P '\d+(?= dollars)' catalog.txt # match digits followed by " dollars"
# Non-greedy matching
grep -oP '".*?"' data.json # match shortest quoted strings
-P is a GNU grep extension and may not be available on all systems (notably macOS, where you can install GNU grep via Homebrew as ggrep).
sed¶
sed (stream editor) transforms text line by line. It reads input, applies editing commands, and writes the result.
Substitute¶
The most common sed operation is substitution:
sed 's/old/new/' file.txt # replace first occurrence on each line
sed 's/old/new/g' file.txt # replace all occurrences on each line
sed 's/old/new/gI' file.txt # case-insensitive, all occurrences (GNU sed only)
sed 's/old/new/3' file.txt # replace only the 3rd occurrence on each line
The delimiter doesn't have to be /. Use any character to avoid escaping:
Addresses¶
Sed commands can target specific lines:
sed '5s/old/new/' file.txt # only line 5
sed '1,10s/old/new/' file.txt # lines 1 through 10
sed '/pattern/s/old/new/' file.txt # lines matching pattern
sed '1,/END/s/old/new/' file.txt # line 1 through first line matching END
Delete¶
sed '5d' file.txt # delete line 5
sed '/^#/d' file.txt # delete comment lines
sed '/^$/d' file.txt # delete empty lines
sed '1,5d' file.txt # delete lines 1-5
Insert and Append¶
sed '3i\New line before line 3' file.txt # insert before line 3
sed '3a\New line after line 3' file.txt # append after line 3
sed '/pattern/a\Added after matching line' file.txt
Print¶
sed -n '5p' file.txt # print only line 5
sed -n '10,20p' file.txt # print lines 10-20
sed -n '/start/,/end/p' file.txt # print between patterns (inclusive)
The -n flag suppresses default output, so only explicitly printed lines appear.
In-Place Editing¶
sed -i 's/old/new/g' file.txt # edit file directly (GNU sed)
sed -i '' 's/old/new/g' file.txt # macOS sed (requires empty string for backup)
sed -i.bak 's/old/new/g' file.txt # create backup before editing
Extended Regex¶
Use -E for extended regex (same as grep -E):
Multiple Commands¶
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Or use a semicolon
sed 's/foo/bar/g; s/baz/qux/g' file.txt
Practical Examples¶
# Remove trailing whitespace
sed 's/[[:space:]]*$//' file.txt
# Add a prefix to every line
sed 's/^/PREFIX: /' file.txt
# Extract text between markers
sed -n '/BEGIN/,/END/p' file.txt
# Replace the 2nd line
sed '2c\This replaces the second line' file.txt
# Number all non-empty lines
sed '/./=' file.txt | sed 'N; s/\n/ /'
awk¶
awk is a pattern-scanning and text-processing language. It excels at working with structured, column-based data.
Basic Structure¶
An awk program is a series of pattern-action pairs:
If no pattern is given, the action runs on every line. If no action is given, the matching line is printed.
Fields¶
Awk automatically splits each line into fields. By default, the separator is whitespace.
| Variable | Meaning |
|---|---|
$0 |
The entire line |
$1, $2, ... |
Individual fields |
$NF |
The last field |
NF |
Number of fields on the current line |
NR |
Current line number (across all files) |
FNR |
Line number in the current file |
# Print the second column
awk '{ print $2 }' data.txt
# Print the last column
awk '{ print $NF }' data.txt
# Print line number and first field
awk '{ print NR, $1 }' data.txt
Field Separator¶
Use -F to set the field separator:
awk -F: '{ print $1, $7 }' /etc/passwd # username and shell
awk -F, '{ print $2 }' data.csv # second column of CSV
awk -F'\t' '{ print $3 }' data.tsv # tab-separated
Patterns¶
# Lines matching a regex
awk '/error/' log.txt
# Lines where a field matches
awk '$3 > 100' data.txt
# Lines where a specific field matches a pattern
awk '$1 ~ /^server/' config.txt
# Lines where a field does NOT match
awk '$1 !~ /^#/' config.txt
# Range patterns
awk '/START/,/END/' file.txt
BEGIN and END¶
BEGIN runs before processing any input. END runs after all input is processed.
Built-in Functions and printf¶
# String length
awk '{ print length($0) }' file.txt
# Substring
awk '{ print substr($1, 1, 3) }' file.txt
# Formatted output
awk '{ printf "%-20s %10.2f\n", $1, $3 }' data.txt
Variables and Arithmetic¶
# Sum a column
awk '{ sum += $3 } END { print sum }' data.txt
# Average
awk '{ sum += $3; count++ } END { print sum/count }' data.txt
# Track maximum
awk '$3 > max { max = $3; line = $0 } END { print line }' data.txt
Practical Examples¶
# Print lines longer than 80 characters
awk 'length > 80' file.txt
# Print unique values in column 1
awk '!seen[$1]++' data.txt
# Swap first two columns
awk '{ temp = $1; $1 = $2; $2 = temp; print }' data.txt
# Sum file sizes from ls -l
ls -l | awk 'NR > 1 { sum += $5 } END { print sum }'
# Process /etc/passwd: list users with bash shell
awk -F: '$7 == "/bin/bash" { print $1 }' /etc/passwd
# Print every other line
awk 'NR % 2 == 1' file.txt
cut¶
cut extracts specific columns or character positions from each line.
By Field¶
cut -f1 data.tsv # first field (tab-delimited by default)
cut -f1,3 data.tsv # first and third fields
cut -f2- data.tsv # second field through end
cut -d',' -f1,3 data.csv # comma-delimited, fields 1 and 3
cut -d: -f1,7 /etc/passwd # colon-delimited, fields 1 and 7
By Character Position¶
cut is fast but limited. For complex field extraction, use awk.
When to use cut vs awk: cut is simpler and faster for extracting fixed-position fields from cleanly delimited data. Use awk when you need conditional logic, multiple delimiters, or field reordering. One gotcha: cut defaults to tab as its delimiter, not spaces. If your data is space-separated, you need -d' ', but cut can't handle multiple consecutive spaces as a single delimiter the way awk does. Also, cut has no way to handle quoted CSV fields - a field containing a comma inside quotes will be split incorrectly. For anything beyond simple TSV or single-character-delimited data, use awk.
sort¶
sort orders lines of text.
sort file.txt # alphabetical sort
sort -n file.txt # numeric sort
sort -r file.txt # reverse order
sort -u file.txt # remove duplicate lines
sort -k2 file.txt # sort by second field
sort -k2,2n file.txt # sort by second field, numerically
sort -k3,3 -k1,1 file.txt # sort by field 3, then field 1 as tiebreaker
sort -t, -k2 data.csv # comma-delimited, sort by field 2
sort -s -k2 file.txt # stable sort (preserves original order for equal elements)
sort -h file.txt # human-numeric sort (handles 1K, 2M, 3G)
Practical Examples¶
# Sort du output by size
du -sh * | sort -h
# Sort /etc/passwd by UID
sort -t: -k3 -n /etc/passwd
# Find the 10 largest files
du -a /var/log | sort -rn | head -10
The -s (stable sort) flag preserves the original order of lines that compare as equal. This matters when sorting by multiple keys in separate passes - without stable sort, the second sort might scramble the ordering you established in the first. For example, to sort a list of employees by department and then by name within each department, a stable sort on name followed by a stable sort on department gives you the right result.
Human-numeric sort (-h) understands unit suffixes, which makes it essential for sorting du output:
Without -h, sort -n would rank '2M' below '100K' because it only looks at the leading digit. With -h, it correctly understands that 2M is larger than 100K.
uniq¶
uniq filters adjacent duplicate lines. Input must be sorted first (or use sort -u instead).
sort file.txt | uniq # remove duplicates
sort file.txt | uniq -c # count occurrences
sort file.txt | uniq -d # show only duplicated lines
sort file.txt | uniq -u # show only unique lines (no duplicates)
Practical Examples¶
# Top 10 most common lines
sort access.log | uniq -c | sort -rn | head -10
# Count unique IP addresses
awk '{ print $1 }' access.log | sort | uniq -c | sort -rn
Why must input be sorted? uniq only compares each line against the one directly before it. If duplicate lines appear in different parts of the file with other lines between them, uniq won't detect them. Sorting brings identical lines together so uniq can find them. If sorting would destroy meaningful ordering, use sort -u (which deduplicates as it sorts) or awk '!seen[$0]++' (which deduplicates while preserving original order).
tr¶
tr translates, squeezes, or deletes characters. It reads from STDIN only (no file arguments).
Translate¶
echo "hello" | tr 'a-z' 'A-Z' # HELLO
echo "hello" | tr 'aeiou' '*' # h*ll*
cat file.txt | tr '\t' ' ' # tabs to spaces
Squeeze¶
Replace repeated characters with a single instance:
Delete¶
echo "Hello, World! 123" | tr -d '[:digit:]' # Hello, World!
echo "Hello, World! 123" | tr -d '[:punct:]' # Hello World 123
Character Classes¶
tr '[:lower:]' '[:upper:]' < file.txt # uppercase everything
tr -d '[:space:]' < file.txt # remove all whitespace
tr -dc '[:print:]' < file.txt # keep only printable characters
tr works on a one-to-one character mapping model. Each character in SET1 maps to the character at the same position in SET2. If SET1 is longer than SET2, the last character of SET2 is repeated to fill the gap. That's why tr 'aeiou' '*' replaces all five vowels with * - the single * gets repeated for each remaining position.
Practical Examples¶
# Convert Windows line endings to Unix
tr -d '\r' < windows.txt > unix.txt
# Generate a random password
tr -dc 'A-Za-z0-9' < /dev/urandom | head -c 32; echo
# Remove all non-alphanumeric characters
tr -dc '[:alnum:]\n' < file.txt
wc¶
wc (word count) counts lines, words, and characters.
wc file.txt # lines, words, characters (all three)
wc -l file.txt # line count only
wc -w file.txt # word count only
wc -c file.txt # byte count
wc -m file.txt # character count (handles multibyte)
# Count files in a directory
ls | wc -l
# Count matching lines
grep -c "error" log.txt # more efficient than grep | wc -l
The difference between -c (bytes) and -m (characters) matters with multibyte encodings like UTF-8. An ASCII file will show the same count for both, but a file containing non-ASCII characters (accented letters, emoji, CJK characters) will have more bytes than characters. Use -c when you care about file size on disk, and -m when you care about the number of human-readable characters.
head and tail¶
head¶
head prints the first N lines of a file (default 10):
head file.txt # first 10 lines
head -n 5 file.txt # first 5 lines
head -n -5 file.txt # all lines EXCEPT the last 5
head -c 100 file.txt # first 100 bytes
tail¶
tail prints the last N lines (default 10):
tail file.txt # last 10 lines
tail -n 5 file.txt # last 5 lines
tail -n +5 file.txt # from line 5 to end
The + prefix reverses tail's logic. Instead of 'show the last N lines,' tail -n +N means 'start from line N.' This is useful for skipping headers: tail -n +2 data.csv gives you everything except the first line.
Following Files¶
tail -f watches a file for new content in real time:
tail -f /var/log/syslog # follow log updates
tail -f /var/log/syslog | grep "error" # follow with filtering
tail -F /var/log/syslog # follow even if file is rotated
-F is like -f but handles log rotation (file is deleted and recreated).
tee¶
tee is covered in Streams and Redirection. In brief, it reads from STDIN and writes to both STDOUT and one or more files, making it useful for saving intermediate pipeline results while still passing data to the next stage.
Combining Tools¶
The real power of text processing tools comes from combining them in pipelines.
# Top 10 most common words in a file
tr -s '[:space:]' '\n' < file.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn | head -10
# Extract and count HTTP status codes from access logs
awk '{ print $9 }' access.log | sort | uniq -c | sort -rn
# Find all unique file extensions in a directory tree
find . -type f | sed 's/.*\.//' | sort -u
# Compare two sorted CSV columns
diff <(cut -d, -f1 old.csv | sort) <(cut -d, -f1 new.csv | sort)
# Sum numbers from a file, one per line
paste -sd+ numbers.txt | bc
# Convert CSV to a formatted table
column -t -s, data.csv
Breaking Down Pipelines¶
Here's how to read these pipelines stage by stage.
Top 10 most common words in a file:
tr -s '[:space:]' '\n' < file.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn | head -10
tr -s '[:space:]' '\n'- split text into one word per line by converting all whitespace (spaces, tabs, newlines) to newlines, squeezing consecutive whitespace into onetr '[:upper:]' '[:lower:]'- lowercase everything so 'The' and 'the' count as the same wordsort- sort alphabetically so identical words are adjacent (required foruniq)uniq -c- collapse adjacent identical lines and prefix each with its countsort -rn- sort numerically in reverse so the highest counts come firsthead -10- show only the first 10 results
Extract and count HTTP status codes:
awk '{ print $9 }'- pull the 9th whitespace-delimited field from each line (the HTTP status code in standard access log format)sort- sort the status codes so identical ones are adjacentuniq -c- count consecutive identical linessort -rn- show most frequent codes first
Further Reading¶
- GNU Grep Manual - pattern matching with regular expressions
- GNU Sed Manual - stream editing reference
- GAWK Manual - the GNU awk programming language
- GNU Coreutils Manual - documentation for sort, cut, uniq, tr, wc, head, tail, and other core utilities
- Linux man-pages Project - comprehensive manual pages for Linux commands and system calls
- regular-expressions.info - regex tutorial and reference across flavors
Previous: Streams and Redirection | Next: Finding Files | Back to Index