Skip to content

Regular Expressions

Regular expressions are pattern-matching rules used to search, match, and manipulate text. On Linux, they're everywhere - grep uses them to search files, sed uses them for substitutions, awk uses them for pattern matching, find uses them for filename matching, and bash's [[ ]] supports them natively. The Text Processing guide uses regex patterns throughout; this guide teaches the language itself.


Metacharacters

Most characters in a regex match themselves literally. The character a matches the letter "a", hello matches the string "hello". But certain characters have special meaning:

.  ^  $  *  +  ?  {  }  [  ]  (  )  |  \

These are metacharacters. To match them literally, escape them with a backslash: \. matches an actual period, \$ matches a dollar sign.

# Match lines containing a literal period
grep '\.' /etc/hosts

# Match lines containing a dollar sign
grep '\$' script.sh

Anchors

Anchors match positions, not characters.

Anchor Matches
^ Start of line
$ End of line
\b Word boundary (PCRE/ERE)
\B Non-word boundary
# Lines starting with "error"
grep '^error' /var/log/syslog

# Lines ending with a number
grep '[0-9]$' data.txt

# The word "port" (not "export" or "transport")
grep -w 'port' config.txt        # -w is shorthand for word boundaries
grep '\bport\b' config.txt       # equivalent with \b (requires -P or -E depending on tool)

# Empty lines
grep '^$' file.txt

Character Classes

A character class matches any single character from a set.

Basic Syntax

Pattern Matches
[abc] a, b, or c
[a-z] Any lowercase letter
[A-Z] Any uppercase letter
[0-9] Any digit
[a-zA-Z0-9] Any alphanumeric character
[^abc] Any character except a, b, or c
[^0-9] Any non-digit

The caret ^ at the start of a class negates it. A hyphen - between characters defines a range.

# Lines containing a digit
grep '[0-9]' file.txt

# Lines starting with a vowel
grep -i '^[aeiou]' words.txt

# Lines containing non-alphanumeric characters
grep '[^a-zA-Z0-9 ]' file.txt

POSIX Character Classes

POSIX defines named character classes that work across locales:

Class Equivalent Matches
[:alpha:] [a-zA-Z] Letters
[:digit:] [0-9] Digits
[:alnum:] [a-zA-Z0-9] Letters and digits
[:upper:] [A-Z] Uppercase letters
[:lower:] [a-z] Lowercase letters
[:space:] [ \t\n\r\f\v] Whitespace characters
[:blank:] [ \t] Space and tab only
[:punct:] Punctuation characters
[:print:] Printable characters (including space)
[:graph:] Printable characters (excluding space)

POSIX classes go inside a character class: [[:digit:]] (the outer brackets are the character class, the inner [:digit:] is the named class).

# Lines containing only digits
grep '^[[:digit:]]*$' file.txt

# Lines starting with uppercase
grep '^[[:upper:]]' file.txt

# Lines containing punctuation
grep '[[:punct:]]' file.txt

Shorthand Classes (PCRE)

When using grep -P (Perl-compatible regex), shorthand classes are available:

Shorthand Equivalent Matches
\d [0-9] Digit
\D [^0-9] Non-digit
\w [a-zA-Z0-9_] Word character
\W [^a-zA-Z0-9_] Non-word character
\s [ \t\n\r\f\v] Whitespace
\S [^ \t\n\r\f\v] Non-whitespace
# Match IP-like patterns (PCRE shorthand)
grep -P '\d+\.\d+\.\d+\.\d+' access.log

# Match non-whitespace sequences
grep -P '\S+' file.txt

Quantifiers

Quantifiers specify how many times the preceding element must match.

Quantifier Meaning Example
* Zero or more ab*c matches ac, abc, abbc, abbbc
+ One or more ab+c matches abc, abbc, but not ac
? Zero or one colou?r matches color and colour
{n} Exactly n a{3} matches aaa
{n,} n or more a{2,} matches aa, aaa, aaaa
{n,m} Between n and m a{2,4} matches aa, aaa, aaaa
# Lines with three or more consecutive digits
grep -E '[0-9]{3,}' file.txt

# Optional "s" for plural
grep -E 'files?' file.txt

# One or more whitespace characters
grep -E '[[:space:]]+' file.txt

The Dot (.)

The dot matches any single character except newline:

# Three-letter words starting with "c" and ending with "t"
grep -E '\bc.t\b' /usr/share/dict/words
# Matches: cat, cot, cut, etc.

# Match any character between quotes
grep -E '".*"' file.txt

Greedy vs Lazy Matching

By default, quantifiers are greedy - they match as much text as possible. Adding ? after a quantifier makes it lazy (match as little as possible). Lazy quantifiers require PCRE (grep -P).

# Greedy: matches from first < to LAST >
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*>'
# Output: <b>bold</b> and <i>italic</i>

# Lazy: matches from first < to NEXT >
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*?>'
# Output: <b>
#         </b>
#         <i>
#         </i>

Greedy matching is the #1 regex surprise

When .* matches too much, the fix is usually one of: make it lazy (.*?), use a negated character class ([^>]* instead of .*), or be more specific about what you're matching. Negated character classes work in all regex flavors, not just PCRE.


Alternation and Grouping

Alternation

The pipe | matches either the pattern on the left or the right:

# Match "error" or "warning"
grep -E 'error|warning' /var/log/syslog

# Match file extensions
grep -E '\.(jpg|png|gif)$' filelist.txt

Grouping

Parentheses () group patterns together. This is useful for applying quantifiers to multi-character sequences and for alternation scoping:

# Repeat a group: match "abcabc"
grep -E '(abc){2}' file.txt

# Scope alternation: "pre-" followed by "fix" or "set" or "view"
grep -E 'pre-(fix|set|view)' file.txt
# Without grouping, "pre-fix|set|view" matches "pre-fix" OR "set" OR "view"

Capture Groups and Backreferences

Parentheses also capture the matched text for later use. Backreferences (\1, \2) refer to what the first, second, etc. group captured:

# Find repeated words ("the the", "is is")
grep -E '\b(\w+)\s+\1\b' document.txt

# Swap first and last names with sed (capture groups)
echo "Doe, Jane" | sed -E 's/(\w+), (\w+)/\2 \1/'
# Output: Jane Doe

# Match repeated characters (e.g., "aaa", "bbb")
grep -E '(.)\1{2}' file.txt

# Match HTML tags with matching close tag
grep -E '<([a-z]+)>.*</\1>' page.html

In sed, backreferences are \1, \2. In the replacement string, & refers to the entire match:

# Wrap each line in quotes
sed 's/.*/"&"/' file.txt

# Add "line: " prefix to lines containing numbers
sed -E 's/^(.*[0-9].*)$/line: \1/' file.txt

BRE vs ERE vs PCRE

Linux uses three regex flavors. The differences are mainly about which metacharacters need escaping.

Basic Regular Expressions (BRE)

Used by: grep (default), sed (default)

In BRE, several metacharacters are literal unless escaped:

Character BRE ERE
+ Literal + One or more
? Literal ? Zero or one
{ } Literal { } Quantifier
( ) Literal ( ) Group
\| Alternation -
# BRE: need to escape +, ?, {}, ()
grep 'ab\+c' file.txt          # one or more b
grep 'colou\?r' file.txt       # optional u
grep '\(abc\)\{2\}' file.txt   # group + quantifier

# BRE: pipe needs backslash
grep 'error\|warning' file.txt

Extended Regular Expressions (ERE)

Used by: grep -E (or egrep), sed -E, awk

In ERE, metacharacters work without escaping:

# ERE: no escaping needed
grep -E 'ab+c' file.txt
grep -E 'colou?r' file.txt
grep -E '(abc){2}' file.txt
grep -E 'error|warning' file.txt

Perl-Compatible Regular Expressions (PCRE)

Used by: grep -P, Perl, Python, PHP, JavaScript (mostly compatible)

PCRE adds features not available in BRE/ERE:

  • Shorthand classes (\d, \w, \s)
  • Lazy quantifiers (*?, +?)
  • Lookahead and lookbehind
  • Non-capturing groups (?:...)
  • Named groups (?P<name>...)
# Only works with grep -P
grep -P '\d{3}-\d{3}-\d{4}' contacts.txt     # phone numbers
grep -P '(?<=price: )\d+' catalog.txt          # lookbehind

Quick Reference

Feature BRE ERE PCRE
. * ^ $ [] Yes Yes Yes
+ ? \+ \? + ? + ?
{n,m} \{n,m\} {n,m} {n,m}
() groups \(\) () ()
\| alternation \| \| \|
\d \w \s No No Yes
Lazy quantifiers No No Yes
Lookahead/behind No No Yes
Backreferences \1 \1 \1 or $1
grep flag (default) -E -P
sed flag (default) -E N/A

ERE is the practical default

Unless you have a specific reason to use BRE, use grep -E and sed -E for everything. The unescaped syntax is cleaner and less error-prone. Use grep -P when you need PCRE-only features like \d, lazy quantifiers, or lookaround.


Lookahead and Lookbehind

Lookaround assertions match a position based on what's ahead or behind, without consuming characters. They require PCRE (grep -P).

Syntax Name Matches
(?=...) Positive lookahead Position followed by pattern
(?!...) Negative lookahead Position NOT followed by pattern
(?<=...) Positive lookbehind Position preceded by pattern
(?<!...) Negative lookbehind Position NOT preceded by pattern
# Extract numbers that are followed by "GB"
grep -oP '\d+(?=GB)' disk-report.txt
# Input: "500GB" -> Output: "500"

# Extract numbers NOT followed by "MB"
grep -oP '\d+(?!MB)' report.txt

# Extract values after "price: "
grep -oP '(?<=price: )\d+\.\d{2}' catalog.txt
# Input: "price: 49.99" -> Output: "49.99"

# Extract words NOT preceded by "un"
grep -oP '(?<!un)happy' text.txt
# Matches "happy" but not "unhappy"

Lookaround is especially useful with -o (print only the matching part), because the lookaround context isn't included in the output.


Practical Patterns

IP Addresses

# Basic IP pattern (matches 0-999 per octet - good enough for log extraction)
grep -E '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log

# Strict IP pattern (0-255 per octet, PCRE)
grep -P '\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b' access.log

# Extract IPs with -o
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log

Email Addresses

# Simple email pattern
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' contacts.txt

Dates

# YYYY-MM-DD
grep -E '[0-9]{4}-[0-9]{2}-[0-9]{2}' logfile.txt

# MM/DD/YYYY or DD/MM/YYYY
grep -E '[0-9]{2}/[0-9]{2}/[0-9]{4}' logfile.txt

# Syslog date format (Mar 25 14:30:01)
grep -E '^[A-Z][a-z]{2} [0-9 ][0-9] [0-9]{2}:[0-9]{2}:[0-9]{2}' /var/log/syslog

Log Lines

# Extract HTTP status codes from access logs
grep -oE 'HTTP/[0-9.]+" [0-9]{3}' access.log

# Find 5xx errors
grep -E '" 5[0-9]{2} ' access.log

# Extract key=value pairs
grep -oP '\w+=\S+' config.log

Testing and Debugging Regex

Build Patterns Incrementally

Start simple and add complexity one piece at a time:

# Step 1: Find lines with "error"
grep -i 'error' logfile.txt

# Step 2: Add a timestamp pattern before it
grep -E '[0-9]{2}:[0-9]{2}:[0-9]{2}.*error' logfile.txt

# Step 3: Capture the timestamp and error message
grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}.*error[^"]*' logfile.txt

Use Color Highlighting

# Highlight matches in the terminal
grep --color=auto -E 'pattern' file.txt

# Make it permanent
alias grep='grep --color=auto'

Common Mistakes

Mistake Problem Fix
grep 'a+b' BRE treats + as literal grep -E 'a+b'
grep -E '1.2.3.4' Dots match any character grep -E '1\.2\.3\.4'
grep -E '<.*>' Greedy match spans too far grep -P '<.*?>' or grep -E '<[^>]*>'
grep -E '[0-9]+' <<< "abc" No match but exit code 1 Expected - no digits in input
sed 's/old/new/' Only replaces first match sed 's/old/new/g' for global

Further Reading


Previous: Text Processing | Next: Finding Files | Back to Index

Comments