Regular Expressions¶
Pattern Matching and Text Transformation¶
Version: 1.0 Year: 2026
Copyright Notice¶
Copyright (c) 2025-2026 Ryan Thomas Robson / Robworks Software LLC. Licensed under CC BY-NC-ND 4.0. You may share this material for non-commercial purposes with attribution, but you may not distribute modified versions.
Most languages treat regular expressions as a library feature - you import a module, create a pattern object, and call methods on it. Perl is different. Regular expressions are woven directly into the language syntax. They have their own operators (=~, !~), their own quoting constructs (m//, s///, qr//), and their own set of special variables ($1, $&, $+{name}). This is not an accident - Perl was originally designed for text processing, and regex is its native tongue.
Perl's regex engine is so influential that it became the basis for PCRE (Perl-Compatible Regular Expressions), the library used by Python, PHP, JavaScript, and dozens of other languages. When you learn Perl regex, you are learning the original that shaped every modern regex implementation.
This guide takes you from basic matching through advanced features like lookaround assertions and compiled patterns. Every concept builds on the previous one, so read it in order if this is your first pass through Perl regex.
Matching with m//¶
The =~ binding operator connects a string to a pattern. The basic form tests whether a string matches:
The /World/ part is a regex pattern. The =~ operator binds $str to that pattern and returns true if the pattern matches anywhere in the string.
The m// Operator¶
The slashes are shorthand for m// (match). You can use m with any delimiter:
# These are all equivalent
$str =~ /World/
$str =~ m/World/
$str =~ m{World}
$str =~ m|World|
$str =~ m!World!
Alternate delimiters are useful when your pattern contains slashes - matching a file path like /usr/local/bin is cleaner as m{/usr/local/bin} than as /\/usr\/local\/bin/.
Negation with !~¶
The !~ operator is the opposite of =~. It returns true when the pattern does not match:
Default Variable Matching¶
When you omit the =~ operator, Perl matches against $_, the default variable:
This is idiomatic in loops and grep/map blocks where $_ is set automatically.
Match Variables¶
After a successful match, Perl sets several special variables:
| Variable | Contains |
|---|---|
$& |
The entire matched text |
$` |
Everything before the match |
$' |
Everything after the match |
my $str = "The quick brown fox";
if ($str =~ /quick/) {
print "Before: '$`'\n"; # "The "
print "Match: '$&'\n"; # "quick"
print "After: '$''\n"; # " brown fox"
}
Performance Cost of $&
Using $&, $`, or $' anywhere in your program forces Perl to compute them for every regex match in the entire program - even matches that do not use these variables. In Perl 5.20+, use /p modifier and ${^MATCH}, ${^PREMATCH}, ${^POSTMATCH} instead. Or better yet, use captures (covered below).
Character Classes¶
A character class matches one character from a defined set. Square brackets define a custom class:
/[aeiou]/ # matches any single vowel
/[0-9]/ # matches any digit
/[A-Za-z]/ # matches any ASCII letter
/[a-zA-Z0-9_]/ # matches word characters
Ranges and Negation¶
Hyphens inside brackets define ranges. A caret ^ at the start negates the class:
/[^aeiou]/ # matches any character that is NOT a vowel
/[^0-9]/ # matches any non-digit
/[a-fA-F0-9]/ # matches a hexadecimal digit
Literal Hyphen in Character Classes
To include a literal hyphen, place it first, last, or escape it: [-abc], [abc-], or [a\-c]. Placing it between characters creates a range.
POSIX Character Classes¶
POSIX classes use a double-bracket syntax inside a character class:
/[[:alpha:]]/ # alphabetic characters
/[[:digit:]]/ # digits (same as [0-9])
/[[:alnum:]]/ # alphanumeric
/[[:space:]]/ # whitespace characters
/[[:upper:]]/ # uppercase letters
/[[:lower:]]/ # lowercase letters
/[[:punct:]]/ # punctuation characters
POSIX classes respect locale settings, which makes them more portable than hardcoded ranges for internationalized text. In practice, most Perl code uses the shorthand escapes below.
Shorthand Character Classes¶
Perl provides single-character shortcuts for the most common classes:
| Shorthand | Matches | Equivalent |
|---|---|---|
\d |
A digit | [0-9] (ASCII) or Unicode digit |
\D |
A non-digit | [^0-9] |
\w |
A "word" character | [a-zA-Z0-9_] |
\W |
A non-word character | [^a-zA-Z0-9_] |
\s |
Whitespace | [ \t\n\r\f] |
\S |
Non-whitespace | [^ \t\n\r\f] |
. |
Any character except newline | [^\n] (unless /s modifier) |
# Match a simple date format
if ($str =~ /\d{4}-\d{2}-\d{2}/) {
print "Looks like a date\n";
}
# Match a Perl variable name
if ($str =~ /[\$\@\%]\w+/) {
print "Looks like a variable\n";
}
Unicode and \d
Under use utf8 or the /u flag, \d matches any Unicode digit - including Arabic-Indic, Devanagari, and other scripts. For strict ASCII digits, use [0-9] explicitly.
Quantifiers¶
Quantifiers control how many times a preceding element must appear for the pattern to match.
Basic Quantifiers¶
| Quantifier | Meaning | Example |
|---|---|---|
* |
0 or more | /bo*/ matches "b", "bo", "boo", "booo" |
+ |
1 or more | /bo+/ matches "bo", "boo", "booo" (not "b") |
? |
0 or 1 | /colou?r/ matches "color" and "colour" |
{n} |
Exactly n | /\d{4}/ matches exactly 4 digits |
{n,} |
n or more | /\d{2,}/ matches 2 or more digits |
{n,m} |
Between n and m | /\d{2,4}/ matches 2 to 4 digits |
# Phone number: 3 digits, separator, 3 digits, separator, 4 digits
if ($phone =~ /\d{3}[-.\s]\d{3}[-.\s]\d{4}/) {
print "Valid phone format\n";
}
# One or more whitespace-separated words
if ($line =~ /\w+(\s+\w+)*/) {
print "Contains words\n";
}
Greedy vs Non-Greedy¶
By default, quantifiers are greedy - they match as much text as possible while still allowing the overall pattern to succeed. Adding ? after a quantifier makes it non-greedy (also called lazy) - it matches as little as possible.
my $html = "<b>bold</b> and <i>italic</i>";
# Greedy: .* grabs as much as possible
$html =~ /<.*>/;
# $& is "<b>bold</b> and <i>italic</i>"
# Non-greedy: .*? grabs as little as possible
$html =~ /<.*?>/;
# $& is "<b>"
This is one of the most common sources of regex bugs. If your pattern matches more text than expected, a greedy quantifier is usually the cause.
| Greedy | Non-greedy | Meaning |
|---|---|---|
* |
*? |
0 or more (prefer fewer) |
+ |
+? |
1 or more (prefer fewer) |
? |
?? |
0 or 1 (prefer 0) |
{n,m} |
{n,m}? |
n to m (prefer n) |
Possessive Quantifiers¶
Possessive quantifiers (Perl 5.10+) add + after the quantifier. They behave like greedy quantifiers but never backtrack - once they consume characters, they do not give them back:
# Possessive: \d++ grabs all digits and refuses to backtrack
"12345abc" =~ /\d++\d/; # FAILS - \d++ takes all digits, nothing left for \d
# Greedy: \d+ grabs all digits, then backtracks one for the final \d
"12345abc" =~ /\d+\d/; # Matches "12345"
Possessive quantifiers are a performance optimization. Use them when you know backtracking would be pointless - the engine fails faster instead of trying every possible combination.
How the Engine Backtracks¶
The following diagram shows how Perl's regex engine processes the greedy pattern a+b against the string "aaac". The engine matches greedily, then backtracks when it cannot find b:
flowchart TD
S[Start: match a+b against 'aaac'] --> A1[a+ matches 'aaa' - greedy, takes all]
A1 --> B1{Next char is 'c' - does it match b?}
B1 -->|No| BT1[Backtrack: a+ gives back one 'a' - now matches 'aa']
BT1 --> B2{Next char is 'a' - does it match b?}
B2 -->|No| BT2[Backtrack: a+ gives back another - now matches 'a']
BT2 --> B3{Next char is 'a' - does it match b?}
B3 -->|No| BT3[Backtrack: a+ has minimum 1 - cannot give back more]
BT3 --> FAIL[Match fails at position 0]
FAIL --> NEXT[Advance start position and retry]
NEXT --> NOPE[No match found in string]
This backtracking behavior is what makes greedy quantifiers expensive on long strings with no match. Possessive quantifiers (a++b) would fail immediately at the first step - a++ takes all three a characters and refuses to backtrack, so the engine knows instantly that b cannot match.
Anchors¶
Anchors match a position in the string, not a character. They constrain where a pattern can match without consuming any text.
| Anchor | Position |
|---|---|
^ |
Start of string (or start of line with /m) |
$ |
End of string (or end of line with /m) |
\b |
Word boundary (between \w and \W) |
\B |
Not a word boundary |
\A |
Absolute start of string (ignores /m) |
\z |
Absolute end of string (ignores /m) |
\Z |
End of string or before final newline |
# Match only if the entire string is digits
if ($str =~ /^\d+$/) {
print "All digits\n";
}
# Match 'cat' only as a whole word, not inside 'concatenate'
if ($str =~ /\bcat\b/) {
print "Found the word 'cat'\n";
}
# Validate that a string starts with a letter
if ($str =~ /\A[a-zA-Z]/) {
print "Starts with a letter\n";
}
^ and $ with Multiline Mode¶
By default, ^ and $ match the start and end of the entire string. With the /m modifier, they match the start and end of each line:
my $text = "first line\nsecond line\nthird line\n";
# Without /m - matches only at string start
my @starts = ($text =~ /^(\w+)/g);
# @starts = ("first")
# With /m - matches at start of each line
my @starts = ($text =~ /^(\w+)/gm);
# @starts = ("first", "second", "third")
When you need to anchor to the absolute start or end regardless of /m, use \A and \z:
# Always matches absolute start - even with /m
if ($str =~ /\A#!/) {
print "Starts with shebang\n";
}
Word Boundaries¶
The \b anchor matches the zero-width position between a word character (\w) and a non-word character (\W), or between \w and the start/end of the string.
my $str = "caterpillar has a cat in it";
$str =~ /cat/; # matches "cat" in "caterpillar" (first occurrence)
$str =~ /\bcat\b/; # matches "cat" as a standalone word
# Practical: highlight whole words only
$str =~ s/\bcat\b/[CAT]/g;
# "caterpillar has a [CAT] in it"
Captures and Backreferences¶
Parentheses in a regex do two things: they group sub-patterns and they capture the matched text into numbered variables.
Numbered Captures¶
Each pair of parentheses creates a capture group. After a successful match, $1 holds the text matched by the first group, $2 the second, and so on:
my $date = "2025-01-15";
if ($date =~ /(\d{4})-(\d{2})-(\d{2})/) {
print "Year: $1\n"; # 2025
print "Month: $2\n"; # 01
print "Day: $3\n"; # 15
}
Captures are numbered by the position of their opening parenthesis, counting from left to right:
Captures in List Context¶
In list context, a match returns all captured groups as a list:
my ($year, $month, $day) = ("2025-01-15" =~ /(\d{4})-(\d{2})-(\d{2})/);
print "$month/$day/$year\n"; # 01/15/2025
Combined with /g, this extracts all occurrences:
Named Captures¶
Named captures (Perl 5.10+) use (?<name>...) syntax and store results in $+{name}:
my $line = "192.168.1.100 - admin [15/Jan/2025:09:23:45] GET /index.html";
if ($line =~ /(?<ip>[\d.]+)\s+-\s+(?<user>\w+)\s+\[(?<time>[^\]]+)\]\s+(?<method>\w+)\s+(?<path>\S+)/) {
print "IP: $+{ip}\n"; # 192.168.1.100
print "User: $+{user}\n"; # admin
print "Time: $+{time}\n"; # 15/Jan/2025:09:23:45
print "Method: $+{method}\n"; # GET
print "Path: $+{path}\n"; # /index.html
}
Named captures make complex patterns self-documenting. The numbered variables ($1, $2) still work alongside named ones.
Backreferences¶
Backreferences refer to a previously captured group within the same pattern. Use \1, \2, etc. inside the pattern itself:
# Match repeated words ("the the", "is is")
if ($text =~ /\b(\w+)\s+\1\b/i) {
print "Duplicate word: $1\n";
}
# Match matching quotes
if ($str =~ /(["']).*?\1/) {
print "Found quoted string\n";
}
Named backreferences use \k<name>:
Non-Capturing Groups¶
When you need grouping for alternation or quantifiers but do not need the captured text, use (?:...):
# Capturing: wastes a capture slot on something we don't need
if ($url =~ /(https?)(:\/\/.+)/) { ... }
# Non-capturing: groups without capturing
if ($url =~ /(?:https?):\/\/(.+)/) {
print "Host and path: $1\n"; # $1 is now the useful part
}
Non-capturing groups are a habit worth building. They keep your capture numbering clean and avoid unnecessary work.
Alternation and Grouping¶
The | operator means "or" - it matches the pattern on the left or the pattern on the right:
# Match any of these keywords
if ($line =~ /error|warning|critical/) {
print "Problem detected\n";
}
Precedence¶
Alternation has low precedence - lower than concatenation. This means /abc|def/ matches "abc" or "def", not "ab(c or d)ef". Use grouping to control scope:
# Without grouping: matches "gray" or "grey"... but also wrong readings
/gray|grey/
# With grouping: clearly matches "gray" or "grey"
/gr(?:a|e)y/
# Common pattern: match file extensions
/\.(?:jpg|jpeg|png|gif|webp)$/i
Alternation Ordering¶
The regex engine tries alternatives left to right and takes the first match. This matters when alternatives overlap:
# "catfish" matches "cat" (first alternative wins)
"catfish" =~ /cat|catfish/; # $& is "cat"
# Put longer alternatives first
"catfish" =~ /catfish|cat/; # $& is "catfish"
Alternation vs Character Class
For single characters, [abc] is more efficient than a|b|c. Character classes are optimized internally; alternation requires the engine to try each branch.
Substitution with s///¶
The s/// operator finds a pattern and replaces it:
The left side is a regex pattern. The right side is a replacement string (not a pattern). Captures from the left side are available in the replacement:
Common s/// Modifiers¶
# /g - replace ALL occurrences (not just the first)
my $str = "aaa bbb aaa";
$str =~ s/aaa/zzz/g;
# "zzz bbb zzz"
# /i - case-insensitive match
$str =~ s/hello/Hi/gi;
# /r - return modified copy, leave original unchanged (5.14+)
my $original = "Hello, World!";
my $modified = $original =~ s/World/Perl/r;
# $original is still "Hello, World!"
# $modified is "Hello, Perl!"
# /e - evaluate replacement as Perl code
my $text = "price: 100 dollars";
$text =~ s/(\d+)/$1 * 1.1/e;
# "price: 110 dollars"
The /r modifier is particularly valuable in pipelines where you want to transform without mutating:
Chaining Substitutions¶
Multiple s/// calls can be chained, especially with /r:
my $clean = $input
=~ s/^\s+//r # trim leading whitespace
=~ s/\s+$//r # trim trailing whitespace
=~ s/\s+/ /gr; # collapse internal whitespace
Modifiers¶
Pattern modifiers change how the regex engine interprets a pattern. You have already seen several - here is the complete set:
| Modifier | Effect |
|---|---|
/i |
Case-insensitive matching |
/g |
Global - match/replace all occurrences |
/m |
Multiline - ^ and $ match line boundaries |
/s |
Single-line - . matches \n |
/x |
Extended - ignore whitespace, allow comments |
/e |
Evaluate replacement as code (s/// only) |
/r |
Return modified copy (s/// only, 5.14+) |
/p |
Preserve match variables (${^MATCH}, etc.) |
/a |
ASCII - \d, \w, \s match ASCII only |
The /x Modifier for Readable Patterns¶
Complex patterns become unreadable fast. The /x modifier lets you add whitespace and comments:
# Without /x - good luck reading this
my $email_re = qr/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/;
# With /x - same pattern, now readable
my $email_re = qr/
^ # start of string
[a-zA-Z0-9._%+-]+ # local part (before @)
@ # literal @ sign
[a-zA-Z0-9.-]+ # domain name
\. # literal dot
[a-zA-Z]{2,} # TLD (2+ letters)
$ # end of string
/x;
With /x, whitespace is ignored (use \ or \s to match a literal space) and # starts a comment that runs to end of line. This is essential for any pattern longer than about 30 characters.
Combining Modifiers¶
Modifiers can be stacked:
# Case-insensitive global substitution
$str =~ s/error/WARNING/gi;
# Multiline, extended - parse a multi-line log block
$block =~ m/
^ # start of line (thanks to /m)
(\d{4}-\d{2}-\d{2}) # date
\s+
(\w+) # log level
\s+
(.+) # message
$ # end of line
/xm;
# Single-line mode - match across newlines
$html =~ m/<div.*?<\/div>/s;
Confusing Names: /m and /s
/m (multiline) changes ^ and $ to match line boundaries. /s (single-line) changes . to match newlines. Despite their names, they are independent - you can use both at once (/ms) when you need . to cross lines AND ^/$ to match per-line.
Lookahead and Lookbehind¶
Lookaround assertions check whether a pattern exists before or after the current position without including it in the match. They are zero-width - they assert a condition without consuming characters.
Lookahead¶
Positive lookahead (?=...) succeeds if the pattern ahead matches:
# Match "Perl" only if followed by a space and a version number
"Perl 5.40" =~ /Perl(?=\s\d)/;
# Matches "Perl" (not "Perl 5")
Negative lookahead (?!...) succeeds if the pattern ahead does NOT match:
# Match a number not followed by a percent sign
"50 items, 30% discount" =~ /\d+(?!%)/;
# Matches "50" (skips "30" because it is followed by %)
Lookbehind¶
Positive lookbehind (?<=...) succeeds if the pattern behind matches:
# Match a number preceded by a dollar sign
"Price: $42.99" =~ /(?<=\$)\d+/;
# Matches "42" (the $ is not part of the match)
Negative lookbehind (?<!...) succeeds if the pattern behind does NOT match:
Lookbehind Length Restriction
In Perl, lookbehind patterns must have a fixed or bounded length. You cannot use * or + inside a lookbehind: (?<=\d+) is not allowed. Use (?<=\d) or (?<=\d{1,10}) instead. Perl 5.30+ relaxed this for some variable-length lookbehinds, but fixed-length is safest.
Practical Lookaround Examples¶
# Add commas to a number: 1234567 -> 1,234,567
my $num = "1234567";
$num =~ s/(\d)(?=(\d{3})+(?!\d))/$1,/g;
print $num; # "1,234,567"
# Extract values from key=value pairs, only if key is "host"
my $config = "host=db.example.com port=5432 host=cache.local";
my @hosts = ($config =~ /(?<=host=)\S+/g);
# @hosts = ("db.example.com", "cache.local")
# Password validation: at least one digit and one uppercase
if ($pass =~ /(?=.*\d)(?=.*[A-Z]).{8,}/) {
print "Password meets requirements\n";
}
The password example stacks two positive lookaheads at position 0. Each one scans forward independently to verify a condition, then the final .{8,} actually consumes the string. This "stacked lookahead" technique is common for validating multiple conditions simultaneously.
split with Regex¶
The split function divides a string into a list using a regex as the delimiter:
my @fields = split /,/, "Alice,30,Engineer";
# @fields = ("Alice", "30", "Engineer")
my @words = split /\s+/, " hello world ";
# @words = ("hello", "world")
The Limit Parameter¶
A third argument limits how many fields are returned:
my @parts = split /:/, "one:two:three:four", 3;
# @parts = ("one", "two", "three:four")
# Third element contains the unsplit remainder
Special Cases¶
# split with no arguments splits $_ on whitespace (like awk)
for (" Alice 30 Engineer ") {
my @fields = split;
# @fields = ("Alice", "30", "Engineer")
}
# Single-character string splits into characters
my @chars = split //, "hello";
# @chars = ("h", "e", "l", "l", "o")
# split on literal string (not regex)
my @parts = split /\./, "www.example.com";
# @parts = ("www", "example", "com")
Capturing Separators¶
When the split pattern contains captures, the captured separators are included in the output list:
my @tokens = split /(\s+)/, "hello world";
# @tokens = ("hello", " ", "world")
# The captured whitespace appears between the fields
# Useful for preserving formatting
my @parts = split /(,\s*)/, "a, b,c, d";
# @parts = ("a", ", ", "b", ",", "c", ", ", "d")
split vs Regex Match
Use split when you know the delimiters and want the content between them. Use a regex match with captures when you know the field format and want to extract specific parts. For parsing a CSV line, split /,/ is simpler. For extracting a timestamp from a log line, a pattern with captures is clearer.
Compiled Patterns with qr//¶
The qr// operator compiles a regex pattern into a reusable object. This is useful when you need to store patterns in variables, build them dynamically, or reuse them across multiple matches:
my $date_re = qr/\d{4}-\d{2}-\d{2}/;
my $time_re = qr/\d{2}:\d{2}:\d{2}/;
if ($log_line =~ /^$date_re\s+$time_re/) {
print "Starts with a timestamp\n";
}
Compiled Patterns as Building Blocks¶
Complex patterns become manageable when assembled from named parts:
my $ip_octet = qr/(?:25[0-5]|2[0-4]\d|[01]?\d\d?)/;
my $ipv4 = qr/$ip_octet\.$ip_octet\.$ip_octet\.$ip_octet/;
my $port = qr/(?::\d{1,5})?/;
if ($addr =~ /^$ipv4$port$/) {
print "Valid IPv4 address (with optional port)\n";
}
Modifiers on qr//¶
Modifiers applied to qr// travel with the compiled pattern:
my $word = qr/hello/i; # case-insensitive
"HELLO WORLD" =~ /$word/; # matches, /i is baked in
my $verbose_date = qr/
(\d{4}) # year
-
(\d{2}) # month
-
(\d{2}) # day
/x;
Dynamic Pattern Construction¶
Build patterns from runtime data - but be careful about metacharacters:
# DANGEROUS - user input could contain regex metacharacters
my $search = "file.txt";
$str =~ /$search/; # The . matches ANY character
# SAFE - quotemeta escapes metacharacters
$str =~ /\Q$search\E/; # \Q...\E treats content as literal
# Building a pattern from a list of words
my @keywords = qw(error warning critical fatal);
my $pattern = join '|', map { quotemeta } @keywords;
my $re = qr/\b(?:$pattern)\b/i;
if ($log =~ $re) {
print "Found a problem keyword\n";
}
The \Q...\E escape sequence (or the quotemeta function) is essential when interpolating user-supplied strings into patterns. Without it, a search for "file.txt" would match "filextxt" because . is a regex metacharacter.
Common Patterns and Pitfalls¶
Regex is a powerful tool, but it has well-known failure modes. Knowing where regex breaks down is as important as knowing how to write patterns.
Email Validation - Do Not Roll Your Own¶
The "simple" email regex everyone writes is wrong:
This rejects valid addresses like "quoted string"@example.com and user+tag@[192.168.1.1]. The RFC 5322 email spec is notoriously complex. Use Email::Valid instead:
IP Address Validation¶
A naive IP pattern matches invalid addresses:
A correct pattern validates octet ranges:
my $octet = qr/(?:25[0-5]|2[0-4]\d|[01]?\d\d?)/;
my $ipv4 = qr/^$octet\.$octet\.$octet\.$octet$/;
# But even better - use a module
use Data::Validate::IP;
Catastrophic Backtracking¶
Certain patterns cause the regex engine to try an exponential number of paths. The classic example:
On a string like "aaaaaaaaaaaaaaaaac", the engine tries every way to partition the a characters between the inner and outer + quantifiers before concluding there is no b. Each additional a doubles the work.
Regex Denial of Service
Never use untrusted user input as a regex pattern without sanitizing it. A carefully crafted pattern can hang your program. Always use \Q...\E or quotemeta() when interpolating user strings into patterns.
Signs of backtracking trouble:
- Nested quantifiers:
(a+)+,(a*)*,(\w+\s*)+ - Overlapping alternatives:
(a|a)+ - Long strings that almost-but-do-not-quite match
Fixes:
- Use possessive quantifiers:
(a++)instead of(a+)+ - Use atomic groups:
(?>a+)prevents backtracking into the group - Restructure the pattern to eliminate ambiguity
When to Use Modules Instead of Hand-Rolled Regex¶
| Task | Module | Why |
|---|---|---|
| Email validation | Email::Valid |
RFC 5322 is too complex for a single regex |
| URL parsing | URI |
Handles schemes, encoding, relative paths |
| HTML parsing | HTML::Parser, Mojo::DOM |
HTML is not a regular language |
| CSV parsing | Text::CSV |
Handles quoting, escaping, edge cases |
| JSON parsing | JSON::PP, Cpanel::JSON::XS |
Regex cannot handle nested structures |
| Date parsing | Time::Piece, DateTime |
Calendar math needs more than pattern matching |
The Right Tool
Regex is perfect for pattern matching in strings - extracting fields, validating simple formats, search-and-replace. It is the wrong tool for parsing recursive structures (HTML, JSON, XML) or validating complex business rules. When you find yourself writing a regex longer than two lines, consider whether a module would be more maintainable.
Putting It All Together¶
Regular expressions in Perl are not just a feature - they are a design philosophy. The language was built around the idea that text processing should be concise and expressive. Here is what you have covered:
=~andm//bind strings to patterns and test for matches- Character classes define sets of characters to match against
- Quantifiers control repetition - greedy, non-greedy, and possessive
- Anchors constrain where patterns match without consuming text
- Captures extract parts of the match into numbered or named variables
- Alternation provides or-logic within patterns
s///replaces matched text with new content- Modifiers (
/i,/g,/m,/s,/x) change how the engine operates - Lookaround assertions check context without consuming characters
splitbreaks strings apart using regex delimitersqr//compiles and stores patterns for reuse- Defensive practices protect against backtracking, metacharacter injection, and hand-rolled validation
The key to writing good regex is the same as writing good code: clarity over cleverness. Use /x for complex patterns. Use named captures for self-documentation. Use modules when the problem outgrows a single pattern. And always test your patterns against both matching and non-matching input.
Further Reading¶
- perlre - Perl Regular Expressions - complete reference for Perl regex syntax and features
- perlretut - Perl Regular Expressions Tutorial - official tutorial that walks through regex fundamentals
- perlreref - Perl Regular Expressions Reference - concise quick-reference card for regex syntax
- perlop - Quote-Like Operators - documentation for
m//,s///,qr//, andtr/// - Mastering Regular Expressions, 3rd Edition - Jeffrey Friedl's definitive book on regex engines and optimization
- PCRE2 Specification - the Perl-compatible regex library used by most modern languages
Previous: Control Flow | Next: Subroutines and References | Back to Index