[Home]
[Edit this page]
[Recent Changes]
[Special Pages]
[Help]
PerlPMTutorial
/hello/
But what to we mean by "a pattern to match"? This is where the =~ operator is brought into the arena.
It will print "Matched". The Perl pattern matching engine has looked at $test and to see whether the pattern existed in there anywhere. It so happened that it did; hello was found at the 5th character. Therefore, ($test =~ /hello/) evaluated true, and the code in the if block was executed.
You can negate a character class by putting ^ as the first character within it; this causes it to match any character not in it.
NOTE: If you're writing security checking patterns that validate data for invalid characters, make sure you state the allowed characters, not the disallowed ones. You might miss one out and open up a security hole.
As you can see from the above example, we capture part of a pattern by putting what we want to capture in brackets. The first thing captured gets stored in the scalar $1, the second in $2, and so on and so forth. When you evaluate the next pattern, a new set of "capture scalars" are created, so if you want to keep anything you'd best assign it to another variable at some point!
That prints both "it's a dog" and "it's a cat". Sometimes you want to limit the scope of the or, in which case you need to bracket it off:-
The problem is that this captures cat or dog, which isn't always what we want. And the solution?
In that code cat or dog isn't captured, only matched against.
Remember that this will only print the first duplicate word found by the pattern. You may be wondering what the [ \.] is all about. It matches a space or a full stop; notice the \ before the . to stop it meaning all characters. It might not matter in a character class. This is so we can handle duplicate words at the end of sentences.
If we bind something using the =~ operator and write something in the form s/.../.../ then it matches the pattern between the first set of slashes, and replaces it with what is between the second two slashes.
The first two are very simple. We find a < and replace it with < or find a > and replace it with a > (a good idea for security). The other three capture various sections, and store them (like in a match) in $1, $2 etc. We then rebuild the substitution in terms of what we found. Note how we use .*? rather than .* to tell it to do minimal matching - that way it only looks for what's between a [b] and [/b] and not the next [/b] (because matches are greedy by default and grab as much as they can).
The canonical book on regular expressions is Mastering Regular Expressions by Jeffrey Friedl.
[Edit this page] [Page history] [What links here] [Discuss this topic] [Printer Friendly]
PerlPMTutorial
Pattern Matching In Perl
Perl is a great language for working with large amounts of data. Pattern matching provides an extremely powerful way of analysing, extracting and transforming data. Best of all, in Perl pattern matching is integrated with the rest of the language, not just hacked on to it as an extension.Writing Patterns
A pattern generally goes /between two slashes/, but it doesn't have to. For example, a pattern to match the word "hello" would look like:-/hello/
But what to we mean by "a pattern to match"? This is where the =~ operator is brought into the arena.
The =~ Operator
This operator is one way we can bind a variable to a pattern. Examine this code:-my $test = "Oh, hello world!";
if ($test =~ /hello/) {
print "Matched!\n";
}It will print "Matched". The Perl pattern matching engine has looked at $test and to see whether the pattern existed in there anywhere. It so happened that it did; hello was found at the 5th character. Therefore, ($test =~ /hello/) evaluated true, and the code in the if block was executed.
Character Classes
So far we've not really done anything that a standard InString style function (e.g. InStr in Visual Basic) wouldn't do. What would make things more powerful is if we could match any character of a certain type. This is where character classes come in. There are various ways to do this; you can use a pre-defined character class or define your own. Here are some simple examples.- . Matches any character apart from newline (it will match newline in Perl6)
- \d Matches any digit.
- \w Matches any letter or number (understands unicode too)
- \s Matches any whitespace, e.g. tab, space.
- \N Will match anything that's not a new line (it will be the Perl 6 equivalent of . in Perl 5)
- /w.n/ Matches a w, then any character, then an n. If this can be matched then the pattern will match. So anything containing win or won would match, for example.
- /\w\d/ Matches if there is any letter or digit, followed by a digit. For example, d2, A3 would match.
- /w[io]n Matches win or won. Better than what we had before, which would have matched wun as well.
- /[A..Z][a..z][0..9]/ Matches an uppercase letter between A and Z followed by a lowercase letter between a and z and a digit between 0 and 9. You could write 0123456789 and it would mean the same as 0..9.
You can negate a character class by putting ^ as the first character within it; this causes it to match any character not in it.
NOTE: If you're writing security checking patterns that validate data for invalid characters, make sure you state the allowed characters, not the disallowed ones. You might miss one out and open up a security hole.
A Couple Of Zero Width Assertions
A zero width assertion is where me match something that doesn't exist as a character. ^ matches the start of the text we want to match, and $ matches the end of the text we want to match. Generally, we'll only want to put these at the start and end of a pattern."abcde" =~ /abc/; # Will match. "abcde" =~ /^abc$/; # Will not match. "abcde" =~ /^ab/; # Will match "abcde" =~ /e$/; # Will match
Capturing
Sometimes you will want to capture bits of data from a string by matching it against a pattern. Let's take the string "Is Perl Just Another Hack?". Imagine we wanted to grab all the required bits out of it to generate "Just Another Perl Hacker"."Is Perl Just Another Hack" =~ /\w+ (\w+) (\w+) (\w+)(er) (\w+)/; print $1; # Prints Perl - the thing captured by the first brackets. print $2; # Prints Just - the thing captured by the second brackets. print $3; # Prints Anoth - the thing captured by the third brackets. print $4; # Prints er - the thing captured by the fourth brackets. print $5; # Prints Hack - the thing captured by the fifth brackets. print "$2 $3$4 $1 $5$4"; # Prints "Just Another Perl Hacker".![]()
As you can see from the above example, we capture part of a pattern by putting what we want to capture in brackets. The first thing captured gets stored in the scalar $1, the second in $2, and so on and so forth. When you evaluate the next pattern, a new set of "capture scalars" are created, so if you want to keep anything you'd best assign it to another variable at some point!
Alternatives
Sometimes you want to match something OR something else. Inside a pattern or is spelt | (not too different to || which you normally use for or). You can put as many alternatives as you want, seperating them by |s.print "it's a dog" if ("dog" =~ /cat|dog/);
print "it's a cat" if ("cat" =~ /cat|dog/);
That prints both "it's a dog" and "it's a cat". Sometimes you want to limit the scope of the or, in which case you need to bracket it off:-
"a dog" =~ /a (cat|dog)/;
The problem is that this captures cat or dog, which isn't always what we want. And the solution?
Clustering
If we want to cluster (group) a set of alternatives together then we use brackets, however we may actually only want to cluster, and not capture. If we want a pair of brackets to cluster instead of capture, then you simply write them as (?:...) instead of (...)."a dog" =~ /a (?:cat|dog)/;
In that code cat or dog isn't captured, only matched against.
Backreferences
What if you've already captured something in an earlier set of brackets and want to see if it occurs again? Backreferences are the answer. You simply but a backslash, followed by the number identifying the capture, e.g. if it is the first you'd use \1. It works just like $1 does outside the pattern. For example, to find duplicate words we might do this:-if ($text =~ / (\w) \1[ \.]/) {
print "Duplicate word $1 found.\n";
}Remember that this will only print the first duplicate word found by the pattern. You may be wondering what the [ \.] is all about. It matches a space or a full stop; notice the \ before the . to stop it meaning all characters. It might not matter in a character class. This is so we can handle duplicate words at the end of sentences.
Substitutions
Sometimes we want to match a string against a pattern and do a find and replace. Perl gives us real power in that we can build the replacement from things we've captured earlier on. For example, imagine that we wanted to write a Wiki rendering engine like this very one we might do something like this:-$input =~ s/</</g;
$input =~ s/>/>/g;
$input =~ s/\[b\](.*?)\[\/b\]/<b>$1<\/b>g;
$input =~ s/\[i\](.*?)\[\/i\]/<i>$1<\/i>g;
$input =~ s/\(.*?)\[\/color\]/<font color="$1">$2<\/font>/g;
If we bind something using the =~ operator and write something in the form s/.../.../ then it matches the pattern between the first set of slashes, and replaces it with what is between the second two slashes.
The first two are very simple. We find a < and replace it with < or find a > and replace it with a > (a good idea for security). The other three capture various sections, and store them (like in a match) in $1, $2 etc. We then rebuild the substitution in terms of what we found. Note how we use .*? rather than .* to tell it to do minimal matching - that way it only looks for what's between a [b] and [/b] and not the next [/b] (because matches are greedy by default and grab as much as they can).
Is That It?
There is more than this, but hopefully this guide will get you started. Maybe one day someone (maybe even me) will add more.The canonical book on regular expressions is Mastering Regular Expressions by Jeffrey Friedl.
[Edit this page] [Page history] [What links here] [Discuss this topic] [Printer Friendly]
