Regular Expressions¶

Intermediate40 minPrereqs: text-processingcli regex text-processing

Learning outcomes

Distinguish between basic, extended, and Perl-compatible regex flavors
Construct patterns using metacharacters, quantifiers, and character classes
Use capture groups and backreferences for matching and substitution
Apply anchors and lookaround assertions for precise pattern matching

Regular expressions are pattern-matching rules used to search, match, and manipulate text. On Linux, they're everywhere - grep uses them to search files, sed uses them for substitutions, awk uses them for pattern matching, find uses them for filename matching, and bash's [[ ]] supports them natively. The Text Processing guide uses regex patterns throughout; this guide teaches the language itself.

Metacharacters¶

Most characters in a regex match themselves literally. The character a matches the letter "a", hello matches the string "hello". But certain characters have special meaning:

.  ^  $  *  +  ?  {  }  [  ]  (  )  |  \

These are metacharacters. To match them literally, escape them with a backslash: \. matches an actual period, \$ matches a dollar sign.

# Match lines containing a literal period
grep '\.' /etc/hosts

# Match lines containing a dollar sign
grep '\$' script.sh

Anchors¶

Anchors match positions, not characters.

Anchor	Matches
`^`	Start of line
`$`	End of line
`\b`	Word boundary (PCRE/ERE)
`\B`	Non-word boundary

# Lines starting with "error"
grep '^error' /var/log/syslog

# Lines ending with a number
grep '[0-9]$' data.txt

# The word "port" (not "export" or "transport")
grep -w 'port' config.txt        # -w is shorthand for word boundaries
grep '\bport\b' config.txt       # equivalent with \b (requires -P or -E depending on tool)

# Empty lines
grep '^$' file.txt

Character Classes¶

A character class matches any single character from a set.

Basic Syntax¶

Pattern	Matches
`[abc]`	a, b, or c
`[a-z]`	Any lowercase letter
`[A-Z]`	Any uppercase letter
`[0-9]`	Any digit
`[a-zA-Z0-9]`	Any alphanumeric character
`[^abc]`	Any character except a, b, or c
`[^0-9]`	Any non-digit

The caret ^ at the start of a class negates it. A hyphen - between characters defines a range.

# Lines containing a digit
grep '[0-9]' file.txt

# Lines starting with a vowel
grep -i '^[aeiou]' words.txt

# Lines containing non-alphanumeric characters
grep '[^a-zA-Z0-9 ]' file.txt

POSIX Character Classes¶

POSIX defines named character classes that work across locales:

Class	Equivalent	Matches
`[:alpha:]`	`[a-zA-Z]`	Letters
`[:digit:]`	`[0-9]`	Digits
`[:alnum:]`	`[a-zA-Z0-9]`	Letters and digits
`[:upper:]`	`[A-Z]`	Uppercase letters
`[:lower:]`	`[a-z]`	Lowercase letters
`[:space:]`	`[ \t\n\r\f\v]`	Whitespace characters
`[:blank:]`	`[ \t]`	Space and tab only
`[:punct:]`		Punctuation characters
`[:print:]`		Printable characters (including space)
`[:graph:]`		Printable characters (excluding space)

POSIX classes go inside a character class: [[:digit:]] (the outer brackets are the character class, the inner [:digit:] is the named class).

# Lines containing only digits
grep '^[[:digit:]]*$' file.txt

# Lines starting with uppercase
grep '^[[:upper:]]' file.txt

# Lines containing punctuation
grep '[[:punct:]]' file.txt

Shorthand Classes (PCRE)¶

When using grep -P (Perl-compatible regex), shorthand classes are available:

Shorthand	Equivalent	Matches
`\d`	`[0-9]`	Digit
`\D`	`[^0-9]`	Non-digit
`\w`	`[a-zA-Z0-9_]`	Word character
`\W`	`[^a-zA-Z0-9_]`	Non-word character
`\s`	`[ \t\n\r\f\v]`	Whitespace
`\S`	`[^ \t\n\r\f\v]`	Non-whitespace

# Match IP-like patterns (PCRE shorthand)
grep -P '\d+\.\d+\.\d+\.\d+' access.log

# Match non-whitespace sequences
grep -P '\S+' file.txt

Quantifiers¶

Quantifiers specify how many times the preceding element must match.

Quantifier	Meaning	Example
`*`	Zero or more	`ab*c` matches ac, abc, abbc, abbbc
`+`	One or more	`ab+c` matches abc, abbc, but not ac
`?`	Zero or one	`colou?r` matches color and colour
`{n}`	Exactly n	`a{3}` matches aaa
`{n,}`	n or more	`a{2,}` matches aa, aaa, aaaa
`{n,m}`	Between n and m	`a{2,4}` matches aa, aaa, aaaa

# Lines with three or more consecutive digits
grep -E '[0-9]{3,}' file.txt

# Optional "s" for plural
grep -E 'files?' file.txt

# One or more whitespace characters
grep -E '[[:space:]]+' file.txt

The Dot (.)¶

The dot matches any single character except newline:

# Three-letter words starting with "c" and ending with "t"
grep -E '\bc.t\b' /usr/share/dict/words
# Matches: cat, cot, cut, etc.

# Match any character between quotes
grep -E '".*"' file.txt

Greedy vs Lazy Matching¶

By default, quantifiers are greedy - they match as much text as possible. Adding ? after a quantifier makes it lazy (match as little as possible). Lazy quantifiers require PCRE (grep -P).

# Greedy: matches from first < to LAST >
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*>'
# Output: <b>bold</b> and <i>italic</i>

# Lazy: matches from first < to NEXT >
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*?>'
# Output: <b>
#         </b>
#         <i>
#         </i>

Greedy matching is the #1 regex surprise

When .* matches too much, the fix is usually one of: make it lazy (.*?), use a negated character class ([^>]* instead of .*), or be more specific about what you're matching. Negated character classes work in all regex flavors, not just PCRE.

Alternation and Grouping¶

Alternation¶

The pipe | matches either the pattern on the left or the right:

# Match "error" or "warning"
grep -E 'error|warning' /var/log/syslog

# Match file extensions
grep -E '\.(jpg|png|gif)$' filelist.txt

Grouping¶

Parentheses () group patterns together. This is useful for applying quantifiers to multi-character sequences and for alternation scoping:

# Repeat a group: match "abcabc"
grep -E '(abc){2}' file.txt

# Scope alternation: "pre-" followed by "fix" or "set" or "view"
grep -E 'pre-(fix|set|view)' file.txt
# Without grouping, "pre-fix|set|view" matches "pre-fix" OR "set" OR "view"

Capture Groups and Backreferences¶

Parentheses also capture the matched text for later use. Backreferences (\1, \2) refer to what the first, second, etc. group captured:

# Find repeated words ("the the", "is is")
grep -E '\b(\w+)\s+\1\b' document.txt

# Swap first and last names with sed (capture groups)
echo "Doe, Jane" | sed -E 's/(\w+), (\w+)/\2 \1/'
# Output: Jane Doe

# Match repeated characters (e.g., "aaa", "bbb")
grep -E '(.)\1{2}' file.txt

# Match HTML tags with matching close tag
grep -E '<([a-z]+)>.*</\1>' page.html

In sed, backreferences are \1, \2. In the replacement string, & refers to the entire match:

# Wrap each line in quotes
sed 's/.*/"&"/' file.txt

# Add "line: " prefix to lines containing numbers
sed -E 's/^(.*[0-9].*)$/line: \1/' file.txt

BRE vs ERE vs PCRE¶

Linux uses three regex flavors. The differences are mainly about which metacharacters need escaping.

Basic Regular Expressions (BRE)¶

Used by: grep (default), sed (default)

In BRE, several metacharacters are literal unless escaped:

Character	BRE	ERE
`+`	Literal `+`	One or more
`?`	Literal `?`	Zero or one
`{` `}`	Literal `{` `}`	Quantifier
`(` `)`	Literal `(` `)`	Group
`\\|`	Alternation	-

# BRE: need to escape +, ?, {}, ()
grep 'ab\+c' file.txt          # one or more b
grep 'colou\?r' file.txt       # optional u
grep '\(abc\)\{2\}' file.txt   # group + quantifier

# BRE: pipe needs backslash
grep 'error\|warning' file.txt

Extended Regular Expressions (ERE)¶

Used by: grep -E (or egrep), sed -E, awk

In ERE, metacharacters work without escaping:

# ERE: no escaping needed
grep -E 'ab+c' file.txt
grep -E 'colou?r' file.txt
grep -E '(abc){2}' file.txt
grep -E 'error|warning' file.txt

Perl-Compatible Regular Expressions (PCRE)¶

Used by: grep -P, Perl, Python, PHP, JavaScript (mostly compatible)

PCRE adds features not available in BRE/ERE:

Shorthand classes (\d, \w, \s)
Lazy quantifiers (*?, +?)
Lookahead and lookbehind
Non-capturing groups (?:...)
Named groups (?P<name>...)

# Only works with grep -P
grep -P '\d{3}-\d{3}-\d{4}' contacts.txt     # phone numbers
grep -P '(?<=price: )\d+' catalog.txt          # lookbehind

Quick Reference¶

Feature	BRE	ERE	PCRE
`.` `*` `^` `$` `[]`	Yes	Yes	Yes
`+` `?`	`\+` `\?`	`+` `?`	`+` `?`
`{n,m}`	`\{n,m\}`	`{n,m}`	`{n,m}`
`()` groups	``	`()`	`()`
`\\|` alternation	`\\|`	`\\|`	`\\|`
`\d` `\w` `\s`	No	No	Yes
Lazy quantifiers	No	No	Yes
Lookahead/behind	No	No	Yes
Backreferences	`\1`	`\1`	`\1` or `$1`
grep flag	(default)	`-E`	`-P`
sed flag	(default)	`-E`	N/A

ERE is the practical default

Unless you have a specific reason to use BRE, use grep -E and sed -E for everything. The unescaped syntax is cleaner and less error-prone. Use grep -P when you need PCRE-only features like \d, lazy quantifiers, or lookaround.

Lookahead and Lookbehind¶

Lookaround assertions match a position based on what's ahead or behind, without consuming characters. They require PCRE (grep -P).

Syntax	Name	Matches
`(?=...)`	Positive lookahead	Position followed by pattern
`(?!...)`	Negative lookahead	Position NOT followed by pattern
`(?<=...)`	Positive lookbehind	Position preceded by pattern
`(?<!...)`	Negative lookbehind	Position NOT preceded by pattern

# Extract numbers that are followed by "GB"
grep -oP '\d+(?=GB)' disk-report.txt
# Input: "500GB" -> Output: "500"

# Extract numbers NOT followed by "MB"
grep -oP '\d+(?!MB)' report.txt

# Extract values after "price: "
grep -oP '(?<=price: )\d+\.\d{2}' catalog.txt
# Input: "price: 49.99" -> Output: "49.99"

# Extract words NOT preceded by "un"
grep -oP '(?<!un)happy' text.txt
# Matches "happy" but not "unhappy"

Lookaround is especially useful with -o (print only the matching part), because the lookaround context isn't included in the output.

Practical Patterns¶

IP Addresses¶

# Basic IP pattern (matches 0-999 per octet - good enough for log extraction)
grep -E '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log

# Strict IP pattern (0-255 per octet, PCRE)
grep -P '\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b' access.log

# Extract IPs with -o
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log

Email Addresses¶

# Simple email pattern
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' contacts.txt

Dates¶

# YYYY-MM-DD
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' logfile.txt

# MM/DD/YYYY or DD/MM/YYYY
grep -E '[0-9]{2}/[0-9]{2}/[0-9]{4}' logfile.txt

# Syslog date format (Mar 25 14:30:01)
grep -E '^[A-Z][a-z]{2} [0-9 ][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}' /var/log/syslog

Log Lines¶

# Extract HTTP status codes from access logs
grep -oE 'HTTP/[0-9.]+" [0-9]{3}' access.log

# Find 5xx errors
grep -E '" 5[0-9]{2} ' access.log

# Extract key=value pairs
grep -oP '\w+=\S+' config.log

Testing and Debugging Regex¶

Build Patterns Incrementally¶

Start simple and add complexity one piece at a time:

# Step 1: Find lines with "error"
grep -i 'error' logfile.txt

# Step 2: Add a timestamp pattern before it
grep -E '[0-9]{2}:[0-9]{2}:[0-9]{2}.*error' logfile.txt

# Step 3: Capture the timestamp and error message
grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}.*error[^"]*' logfile.txt

Use Color Highlighting¶

# Highlight matches in the terminal
grep --color=auto -E 'pattern' file.txt

# Make it permanent
alias grep='grep --color=auto'

Common Mistakes¶

Mistake	Problem	Fix
`grep 'a+b'`	BRE treats `+` as literal	`grep -E 'a+b'`
`grep -E '1.2.3.4'`	Dots match any character	`grep -E '1\.2\.3\.4'`
`grep -E '<.*>'`	Greedy match spans too far	`grep -P '<.?>'` or `grep -E '<[^>]>'`
`grep -E '[0-9]+' <<< "abc"`	No match but exit code 1	Expected - no digits in input
`sed 's/old/new/'`	Only replaces first match	`sed 's/old/new/g'` for global