Regular Expressions¶
Regular expressions are pattern-matching rules used to search, match, and manipulate text. On Linux, they're everywhere - grep uses them to search files, sed uses them for substitutions, awk uses them for pattern matching, find uses them for filename matching, and bash's [[ ]] supports them natively. The Text Processing guide uses regex patterns throughout; this guide teaches the language itself.
Metacharacters¶
Most characters in a regex match themselves literally. The character a matches the letter "a", hello matches the string "hello". But certain characters have special meaning:
These are metacharacters. To match them literally, escape them with a backslash: \. matches an actual period, \$ matches a dollar sign.
# Match lines containing a literal period
grep '\.' /etc/hosts
# Match lines containing a dollar sign
grep '\$' script.sh
Anchors¶
Anchors match positions, not characters.
| Anchor | Matches |
|---|---|
^ |
Start of line |
$ |
End of line |
\b |
Word boundary (PCRE/ERE) |
\B |
Non-word boundary |
# Lines starting with "error"
grep '^error' /var/log/syslog
# Lines ending with a number
grep '[0-9]$' data.txt
# The word "port" (not "export" or "transport")
grep -w 'port' config.txt # -w is shorthand for word boundaries
grep '\bport\b' config.txt # equivalent with \b (requires -P or -E depending on tool)
# Empty lines
grep '^$' file.txt
Character Classes¶
A character class matches any single character from a set.
Basic Syntax¶
| Pattern | Matches |
|---|---|
[abc] |
a, b, or c |
[a-z] |
Any lowercase letter |
[A-Z] |
Any uppercase letter |
[0-9] |
Any digit |
[a-zA-Z0-9] |
Any alphanumeric character |
[^abc] |
Any character except a, b, or c |
[^0-9] |
Any non-digit |
The caret ^ at the start of a class negates it. A hyphen - between characters defines a range.
# Lines containing a digit
grep '[0-9]' file.txt
# Lines starting with a vowel
grep -i '^[aeiou]' words.txt
# Lines containing non-alphanumeric characters
grep '[^a-zA-Z0-9 ]' file.txt
POSIX Character Classes¶
POSIX defines named character classes that work across locales:
| Class | Equivalent | Matches |
|---|---|---|
[:alpha:] |
[a-zA-Z] |
Letters |
[:digit:] |
[0-9] |
Digits |
[:alnum:] |
[a-zA-Z0-9] |
Letters and digits |
[:upper:] |
[A-Z] |
Uppercase letters |
[:lower:] |
[a-z] |
Lowercase letters |
[:space:] |
[ \t\n\r\f\v] |
Whitespace characters |
[:blank:] |
[ \t] |
Space and tab only |
[:punct:] |
Punctuation characters | |
[:print:] |
Printable characters (including space) | |
[:graph:] |
Printable characters (excluding space) |
POSIX classes go inside a character class: [[:digit:]] (the outer brackets are the character class, the inner [:digit:] is the named class).
# Lines containing only digits
grep '^[[:digit:]]*$' file.txt
# Lines starting with uppercase
grep '^[[:upper:]]' file.txt
# Lines containing punctuation
grep '[[:punct:]]' file.txt
Shorthand Classes (PCRE)¶
When using grep -P (Perl-compatible regex), shorthand classes are available:
| Shorthand | Equivalent | Matches |
|---|---|---|
\d |
[0-9] |
Digit |
\D |
[^0-9] |
Non-digit |
\w |
[a-zA-Z0-9_] |
Word character |
\W |
[^a-zA-Z0-9_] |
Non-word character |
\s |
[ \t\n\r\f\v] |
Whitespace |
\S |
[^ \t\n\r\f\v] |
Non-whitespace |
# Match IP-like patterns (PCRE shorthand)
grep -P '\d+\.\d+\.\d+\.\d+' access.log
# Match non-whitespace sequences
grep -P '\S+' file.txt
Quantifiers¶
Quantifiers specify how many times the preceding element must match.
| Quantifier | Meaning | Example |
|---|---|---|
* |
Zero or more | ab*c matches ac, abc, abbc, abbbc |
+ |
One or more | ab+c matches abc, abbc, but not ac |
? |
Zero or one | colou?r matches color and colour |
{n} |
Exactly n | a{3} matches aaa |
{n,} |
n or more | a{2,} matches aa, aaa, aaaa |
{n,m} |
Between n and m | a{2,4} matches aa, aaa, aaaa |
# Lines with three or more consecutive digits
grep -E '[0-9]{3,}' file.txt
# Optional "s" for plural
grep -E 'files?' file.txt
# One or more whitespace characters
grep -E '[[:space:]]+' file.txt
The Dot (.)¶
The dot matches any single character except newline:
# Three-letter words starting with "c" and ending with "t"
grep -E '\bc.t\b' /usr/share/dict/words
# Matches: cat, cot, cut, etc.
# Match any character between quotes
grep -E '".*"' file.txt
Greedy vs Lazy Matching¶
By default, quantifiers are greedy - they match as much text as possible. Adding ? after a quantifier makes it lazy (match as little as possible). Lazy quantifiers require PCRE (grep -P).
# Greedy: matches from first < to LAST >
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*>'
# Output: <b>bold</b> and <i>italic</i>
# Lazy: matches from first < to NEXT >
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*?>'
# Output: <b>
# </b>
# <i>
# </i>
Greedy matching is the #1 regex surprise
When .* matches too much, the fix is usually one of: make it lazy (.*?), use a negated character class ([^>]* instead of .*), or be more specific about what you're matching. Negated character classes work in all regex flavors, not just PCRE.
Alternation and Grouping¶
Alternation¶
The pipe | matches either the pattern on the left or the right:
# Match "error" or "warning"
grep -E 'error|warning' /var/log/syslog
# Match file extensions
grep -E '\.(jpg|png|gif)$' filelist.txt
Grouping¶
Parentheses () group patterns together. This is useful for applying quantifiers to multi-character sequences and for alternation scoping:
# Repeat a group: match "abcabc"
grep -E '(abc){2}' file.txt
# Scope alternation: "pre-" followed by "fix" or "set" or "view"
grep -E 'pre-(fix|set|view)' file.txt
# Without grouping, "pre-fix|set|view" matches "pre-fix" OR "set" OR "view"
Capture Groups and Backreferences¶
Parentheses also capture the matched text for later use. Backreferences (\1, \2) refer to what the first, second, etc. group captured:
# Find repeated words ("the the", "is is")
grep -E '\b(\w+)\s+\1\b' document.txt
# Swap first and last names with sed (capture groups)
echo "Doe, Jane" | sed -E 's/(\w+), (\w+)/\2 \1/'
# Output: Jane Doe
# Match repeated characters (e.g., "aaa", "bbb")
grep -E '(.)\1{2}' file.txt
# Match HTML tags with matching close tag
grep -E '<([a-z]+)>.*</\1>' page.html
In sed, backreferences are \1, \2. In the replacement string, & refers to the entire match:
# Wrap each line in quotes
sed 's/.*/"&"/' file.txt
# Add "line: " prefix to lines containing numbers
sed -E 's/^(.*[0-9].*)$/line: \1/' file.txt
BRE vs ERE vs PCRE¶
Linux uses three regex flavors. The differences are mainly about which metacharacters need escaping.
Basic Regular Expressions (BRE)¶
Used by: grep (default), sed (default)
In BRE, several metacharacters are literal unless escaped:
| Character | BRE | ERE |
|---|---|---|
+ |
Literal + |
One or more |
? |
Literal ? |
Zero or one |
{ } |
Literal { } |
Quantifier |
( ) |
Literal ( ) |
Group |
\| |
Alternation | - |
# BRE: need to escape +, ?, {}, ()
grep 'ab\+c' file.txt # one or more b
grep 'colou\?r' file.txt # optional u
grep '\(abc\)\{2\}' file.txt # group + quantifier
# BRE: pipe needs backslash
grep 'error\|warning' file.txt
Extended Regular Expressions (ERE)¶
Used by: grep -E (or egrep), sed -E, awk
In ERE, metacharacters work without escaping:
# ERE: no escaping needed
grep -E 'ab+c' file.txt
grep -E 'colou?r' file.txt
grep -E '(abc){2}' file.txt
grep -E 'error|warning' file.txt
Perl-Compatible Regular Expressions (PCRE)¶
Used by: grep -P, Perl, Python, PHP, JavaScript (mostly compatible)
PCRE adds features not available in BRE/ERE:
- Shorthand classes (
\d,\w,\s) - Lazy quantifiers (
*?,+?) - Lookahead and lookbehind
- Non-capturing groups
(?:...) - Named groups
(?P<name>...)
# Only works with grep -P
grep -P '\d{3}-\d{3}-\d{4}' contacts.txt # phone numbers
grep -P '(?<=price: )\d+' catalog.txt # lookbehind
Quick Reference¶
| Feature | BRE | ERE | PCRE |
|---|---|---|---|
. * ^ $ [] |
Yes | Yes | Yes |
+ ? |
\+ \? |
+ ? |
+ ? |
{n,m} |
\{n,m\} |
{n,m} |
{n,m} |
() groups |
\(\) |
() |
() |
\| alternation |
\| |
\| |
\| |
\d \w \s |
No | No | Yes |
| Lazy quantifiers | No | No | Yes |
| Lookahead/behind | No | No | Yes |
| Backreferences | \1 |
\1 |
\1 or $1 |
| grep flag | (default) | -E |
-P |
| sed flag | (default) | -E |
N/A |
ERE is the practical default
Unless you have a specific reason to use BRE, use grep -E and sed -E for everything. The unescaped syntax is cleaner and less error-prone. Use grep -P when you need PCRE-only features like \d, lazy quantifiers, or lookaround.
Lookahead and Lookbehind¶
Lookaround assertions match a position based on what's ahead or behind, without consuming characters. They require PCRE (grep -P).
| Syntax | Name | Matches |
|---|---|---|
(?=...) |
Positive lookahead | Position followed by pattern |
(?!...) |
Negative lookahead | Position NOT followed by pattern |
(?<=...) |
Positive lookbehind | Position preceded by pattern |
(?<!...) |
Negative lookbehind | Position NOT preceded by pattern |
# Extract numbers that are followed by "GB"
grep -oP '\d+(?=GB)' disk-report.txt
# Input: "500GB" -> Output: "500"
# Extract numbers NOT followed by "MB"
grep -oP '\d+(?!MB)' report.txt
# Extract values after "price: "
grep -oP '(?<=price: )\d+\.\d{2}' catalog.txt
# Input: "price: 49.99" -> Output: "49.99"
# Extract words NOT preceded by "un"
grep -oP '(?<!un)happy' text.txt
# Matches "happy" but not "unhappy"
Lookaround is especially useful with -o (print only the matching part), because the lookaround context isn't included in the output.
Practical Patterns¶
IP Addresses¶
# Basic IP pattern (matches 0-999 per octet - good enough for log extraction)
grep -E '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log
# Strict IP pattern (0-255 per octet, PCRE)
grep -P '\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b' access.log
# Extract IPs with -o
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log
Email Addresses¶
Dates¶
# YYYY-MM-DD
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' logfile.txt
# MM/DD/YYYY or DD/MM/YYYY
grep -E '[0-9]{2}/[0-9]{2}/[0-9]{4}' logfile.txt
# Syslog date format (Mar 25 14:30:01)
grep -E '^[A-Z][a-z]{2} [0-9 ][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}' /var/log/syslog
Log Lines¶
# Extract HTTP status codes from access logs
grep -oE 'HTTP/[0-9.]+" [0-9]{3}' access.log
# Find 5xx errors
grep -E '" 5[0-9]{2} ' access.log
# Extract key=value pairs
grep -oP '\w+=\S+' config.log
Testing and Debugging Regex¶
Build Patterns Incrementally¶
Start simple and add complexity one piece at a time:
# Step 1: Find lines with "error"
grep -i 'error' logfile.txt
# Step 2: Add a timestamp pattern before it
grep -E '[0-9]{2}:[0-9]{2}:[0-9]{2}.*error' logfile.txt
# Step 3: Capture the timestamp and error message
grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}.*error[^"]*' logfile.txt
Use Color Highlighting¶
# Highlight matches in the terminal
grep --color=auto -E 'pattern' file.txt
# Make it permanent
alias grep='grep --color=auto'
Common Mistakes¶
| Mistake | Problem | Fix |
|---|---|---|
grep 'a+b' |
BRE treats + as literal |
grep -E 'a+b' |
grep -E '1.2.3.4' |
Dots match any character | grep -E '1\.2\.3\.4' |
grep -E '<.*>' |
Greedy match spans too far | grep -P '<.*?>' or grep -E '<[^>]*>' |
grep -E '[0-9]+' <<< "abc" |
No match but exit code 1 | Expected - no digits in input |
sed 's/old/new/' |
Only replaces first match | sed 's/old/new/g' for global |
Further Reading¶
- regex101.com - interactive regex tester with explanation of each token
- Regular-Expressions.info - comprehensive tutorial covering all regex flavors
- grep man page - grep options and regex support reference
- POSIX Regular Expressions - official BRE and ERE specification
- perlre man page - Perl regex reference (the basis for PCRE)
- Arch Wiki: Regular Expressions - practical quick reference
Previous: Text Processing | Next: Finding Files | Back to Index