Matching text with Perl Regular Expressions

Uncopyrighted by Jan Labanowski in Oct. 2005. You can do whatever you want with it, and even put your name on it, if you think that it will make you look good or bring you money.

If you read this document, please read it twice, since some elements are introduced earlier than they are explained (otherwise, it would be much longer). This text is primarily intended to be used with the Perl regular expression exercise page at: http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl. As you read, you should use this form and enter the examples given below to see your matches highlighted. Yes, it will take time, but you will learn the power of regular expressions that have a lot of useful uses. Moreover, while regular expressions are similar in spirit to UNIX shell globs, the similarity is superficial. Regular expressions are much more complicated and have different syntax. E.g., the asterisk * in the shell glob like: ls -l *.doc would have to be represented in regular expression as .* that means: "zero or more repeats (*) of any character (.)". The whole shell glob *.doc would correspond to /.*\.doc/ regular expression.

Perl is a popular scripting language to process text. For this reason, it is often used for writing Web applications. The processing of entries from Web forms is frequently accomplished with the Perl interpreter due to its very powerful regular expression support. Perl is also a very convenient tool for converting input files from one format to another, for extracting needed data from large output files, and is an essential tool in bioinformatics/genomics. If you still do not know Perl, and do computing, it is time to learn it.

In this short presentation only the basic syntax of the regular expressions will be covered, and the substitution of matched text (search and replace) will not be even mentioned. Likewise, I will not deal with Unicode. But hopefully, the background presented here can be a good starting point and encouragement to study the Camel Book, as it is called by Perl aficionados, a primary reference for Perl language: Programming Perl by Larry Wall, Tom Christiansen, and Jon Orwant, published by O'Reilly (currently in its 3rd edition). After you read this tutorial, please complete practical exercises at the following link: http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl

Regular expressions are the UNIX way of specifying flexible text matches. I will simplify some things here, since general form and possible variants would take a book to describe (and indeed there are books on this topic alone). The popular form of using the flexible match in Perl is:

 $some_string =~ /some_regular_expression/some_modifiers

where some_regular_expression is matched against $some_string. The matching can be additionally modified by some_modifiers. If the regular expression matches the string, the =~ relation is TRUE, otherwise it is FALSE. There is also a negation form of the relation where = is replaced by !):

 $some_string !~ /some_regular_expression/some_modifiers

In this case the relation is TRUE if the regular expression does not match the string, and FALSE, if it matches. For example, if $some_string contains the text:
The spring is late this year
The relation:

  $some_string =~ /the/i

will be TRUE (the modifier i makes the match case insensitive). However, the relation:

  $some_string !~ /spring/i

will be FALSE due to the negation match !~ relation. In this simple case, you can match the words, and this is what is done most often. But it is sometimes not enough. Let say, that the $some_string contains text justified with spaces (multiple) and new lines, as in my famous poem The Spring and the Deeper Meaning of Life:

   The spring came late
      and this is our fate.
   The mood is not good
      and there is no food.

If we wanted to check if some specific word (i.e., series of alphanumeric characters surrounded by white space or punctuation marks) is present in the text, we have to use more sophistication. For example, if we wanted to check if the string contains the word the, the regular expression:

  $some_string =~ /the/i

would also match the word there. We would find the word the in the string:

   Spring came late
      and this is our fate.
   Mood is not good
      and there is no food.

where it is not present. You could, of course, try the expression:

  $some_string =~ / the /i

i.e., put spaces around it. But this approach would fail for the word fate (it would not be found, since the period, rather than a space, is following it), or the late (not a space, but a NEWLINE character is following it). Here we enter the magic world of special symbols (called metacharacters) in regular expressions. There are a dozen of them of them: \, |, (, ), [, {, ^, $, *, +, ? and . . I sometimes use metacharacter in a loose way meaning: "a character or a sequence of characters that mean something else than what is written". Check the links at the bottom of this page to find more.

You can find words in a number of ways:

  $some_string =~ /\Wthe\W/i

Here, the \W is a symbol saying: Match any "nonword" character (i.e., character other than a letter, digit or underscore). Equally good (or even better, since the previous example does not account for the beginning or the end of the string) we could do with:

  $some_string =~ /\bthe\b/i

Here the \b represents Match word boundary, i.e., a location rather than a character (it is also overloaded with the meaning Match the backspace character, i.e., the CTRL-H). We could also use the untidy:

  $some_string =~ /\sthe\s/i

i.e., match the word between two whitespace (\s) characters, but then we would not catch the words followed by punctuation marks, like period or comma (whitespace is ANY space character, i.e., new lines, TABs, form feeds, caret returns, etc). We could also try to list all popular characters that are found before and after the word, like:

  $some_string =~ /[\s.;:?!]the[\s.;:?!]/i

but obviously the list is much longer (Note: the notation [abc] means: match a character that is either a or b or c, while the notation [^abc] means: match a character that is neither a nor b nor c). We could finally do:

  $some_string =~ /[^a-z]the[^a-z]/i

i.e., the word the surrounded by nonletters (the hyphen means a range of characters, e.g., [a-zA-Z0-9] means all lower and upper case letters and digits). Latin letters and digits inside a regular expression mean usually what they stand for (do not quote them with the backslash \, since doing so changes them often to some special symbols as you saw above!). Many punctuation marks and other nonalphanumeric characters (but not all!!!), however, have a special meaning in regular expressions. The are called metacharacters. Moreover, some characters have special meaning in the Perl itself and you have to quote them, even when they do not have a special meaning in the regular expressions (it is a small oversimplification, but it will have to do here). For example, if you look for an e-mail address that contains @ character, you need to quote it as \@ since Perl uses this character to specify lists (arrays) of values [caveat: in some cases, e.g., when regular expressions is given as a Perl variable at execution time, the @ is not expanded into a list and keeps its literal meaning]. The regular expression:

  $some_string =~ /jlabanow@ccl/

in the Perl script would match the string "jlabanow@aol.com" or "jlabanowANYTHING" if the list @ccl was not defined or empty in the Perl script (or even worse: if the list had some entries, your matching would be quite surprising and quite unpredictable, since this varies with Perl releases). Conclusion: Test it before you use it!. I will list a few important examples of special characters or character sequences but there are scores of them. You can learn about all of them by checking the appropriate man pages under UNIX system on which Perl is installed, namely:
   man perlrequick      for a short tutorial
   man perlretut        for a longer tutorial
   man perlre           for a reference manual for Perl regexp

Repeated pattern matching and greedy vs. non-greedy match

There are often situations when you need to match a series of repetitions of characters or character sets. For example, if you want to match valid decimal numbers (without exponent) like: -12.123 you could use the expression: /[0-9.+-]+/, (match digits, period, plus and minus). The + means one or more times. But obviously, such expression would also match strings like: 1-234.2.4 that are not valid numbers. Expression like: /([+-]?[0-9]+\.?[0-9]*)|([+-]?[0-9]*\.?[0-9]+)/ or equivalent versions like: /([+-]?\d+\.?\d*)|([+-]?\d*\.?\d+)/  or   /((\+|-)?\d+\.?\d*)|((\+|-)?\d*\.?\d+)/ would only match valid numbers like: +123+123..1231.23+1.23,  etc. Here, the ? means: zero or one time, * means: zero or more times, the | denotes the alternative, parentheses enclose grouping of characters and the period ., being a metacharacter, needs to be protected with a backslash to retain its literal meaning. Note that metacharacters do not have to be backslashed (usually) within the square brackets (i.e., character classes). Note also that to retain the original meaning of the minus sign - it needs to be specified as the last character within square brackets, or it would be interpreted as a range.

We often have situations when it is important that the matched string is the shortest or the longest possible. The expression: /Bo+/ In the string Booting can match either Bo or Boo. By default, the repetition operators ( +*?  and {n,m} ) are greedy, i.e., they will try to match the longest possible string. In the example above, the expression: /Bo+/ will match Boo. To make them non-greedy (i.e., to make them match the shortest possible string), follow them by a ?. The expression: /Bo+?/ will match the Bo. While the greediness of the regular expression is mostly important when patterns are used for substitutions, it can also be important in searching. For example, if you search the valid HTML document, that starts from the <html> element and ends with the </html> element, the expression /<.+>/s will match the whole HTML document, while the expression /<.+?>/s will match only the opening <html> tag.

Popular metacharacters and special character sequences

Find examples of popular metacharacters below. They are essential for flexible pattern matching, and hopefully, after analyzing my poetry presented earlier, you will cherish their usefulness.

NEWLINE
This is a mess... The vendors of operating systems worked hard to make text files from one system to look like junk on the other so you are stuck. Of course, you are not stuck, you are only punished by their greed (or egos, as the case may be). The Internet standards and the Microsoft use two ASCII characters to mark the end of the line: CTRL-M followed by CTRL-J. The UNIX uses CTRL-J, and the Mac uses CTRL-M. The CTRL-M (caret return, \r) is octal 15, decimal 13, and hex 0D. The CTRL-J (line feed, \n) is octal 12, decimal 10, and hex 0A. UNIX often automatically converts the text from files with the UNIX newline convention (\n) to Internet/MS-DOS convention (\r\n) before feeding it to electronic mail or serving web pages. If you look for the new lines in regular expressions, use $ and the m modifier, or be a guru and search for (\r?\n|\r). In fact, if you made your data file on a PC under Windows or on the Mac and you want to feed it to a program on UNIX check if NEWLINEs adhere to UNIX convention or your input can be rejected. Use a dos2unix command or write a Perl script with the $my_data =~ s/(\r?\n|\r)/\n/g, but check with od -c myfile command first to see if you really have a problem.
 
/
the forward slash is not really a special character, but it is used as a default character for delimiting regular expression (i.e., marking the start and the end of the regular expression). For this reason it has to be quoted with a backslash within most regular expressions, so the Perl is not confused were is the start and the end of the regular expression. You can use a special syntax and specify some other character as a delimiter for regular expressions, but usually, we just quote the slash with a backslash to recover its natural meaning within the regular expressions. So to find the local subdirectory in the string, say: "/usr/local/bin" you would use:
    $a_string =~ /\/local\//;
 
\
the backslash is used for quoting. Many characters have special meaning within regular expressions. When you want to match the original character, you quote it with a backslash, like in UNIX shell. E.g., to find a period (which is a metacharacter) you would look for \. rather than a bare period. To match the backslash in the text give it as \\. But do not quote the letters and digits, since it would usually assign special meaning to them. For example: \A and \z match the beginning, and the end of the string, respectively (always, irrespectively on m and s modifiers), while \1, \2, etc. match the pieces of regular expression that were enclosed in parentheses (e.g., the expression /\b(\w+)\s+\1/ will match the repeated word in the text, while the /(\d+)\1/ will match the repeated digit or a series of digits in a number, and if you change it to /(\d+)\1+/ it will match the whole sequence of repetitions. If you are a genomics person, and you want to find the initiation codon followed by methionine(s) you would search for /(AUG)\1+/i, but it does not make sense, does it?). The backslash is also used to specify character codes in the regular expression. The \nnn specifies an octal code for a character, while the \xNN a hexadecimal code. For example: \141 and \x61 represent code for lowercase a. Some popular character codes have escape sequences assigned for convenience:
\a -- bell char, BEL, CTRL-G (\007)
\b -- backspace, BS, CTRL-H (\008)
\t -- horizontal tab, HT, CTRL-I (\011)
\n -- line feed, LF, CTRL-J (\012)
\f -- form feed, FF, CTRL-L (\014)
\r -- caret return, CR, CTRL-J (\012)
\e -- ASCII escape, ESC, CTRL-[ (\033).
Other control characters can be entered as \cX, e.g., the line feed, CTRL-J, can be entered also as \cJ, while the ESC code as \c[.
 
.
period matches any single character. This is not that simple however, when string contains new line characters. There are two modifiers (the stuff that follows the closing slash / of the regular expression), namely: m (default) and s that affect pattern matching properties. They tell Perl the following: If no m or s modifier is given, m is a default behavior. Therefore, period matches a NEWLINE character when the s modifier is used (since we lied to Perl, that there are no NEWLINEs). If we told Perl that the string has multiple lines of text by using m modifier (or accepting default), the NEWLINEs become special and denote end of lines -- special spots in the text. The period will not match them, and the metacharacters ^ (beginning of string) and $ (end of string) will refer not only to the real beginning and end of the string, but also to spots just after the NEWLINE, and just before the NEWLINE, respectively. When you want to match a period verbatim, you have to quote it as \. with a backlash. For example, the /B.T/si will match, "BLT", "100 MBits", "Ubot sunk", "The b.tch is a dog too.", and even "Rob\nTom" (where \n denotes a new line), while the expression /B.T/m will not match the the "RoB\nTom". The expressions /B\.T/s and the /B\.T/m will only match a string like "whateverB.Twhatever".
 
[ and ]
square brackets are used to specify character lists (as in examples above) for matching. You need to quote them with a backslash (i.e,. as \[ or \] ) to have them match themselves in the regexp. Many special characters (metacharacters) loose their meaning within the brackets. The period ., alternative |, parentheses ( or ), brackets { or }, asterisk *, question mark ? and plus + stand for their literal meaning. The escaped letters that denote single characters (e.g., whitespace \s, nonwhitespace \S, digit \d, nondigit \D, word char \w, nonword char \W, line feed \n, caret return \r, control char \cX, hex code \xNN, or octal code \nnn ) retain their special meaning. Two characters get special meaning: ^ after opening [ means negation while - used between two characters denotes character range. Usually, you do not need to quote special characters within the brackets, e.g., the period . is just a real period (but it is usually safe to quote special characters -- when in doubt, always quote everything beside letters and digits. You need to quote some characters, however. Obviously, you need to quote square brackets themselves, if you want them to stand for themselves, or the Perl would get confused where the character list starts and ends. The ^ (the caret) as a first character after the opening bracket [ means do not match those that follow me, e.g., /[^0-9]+/ means: match one or more nondigit characters. The - (hyphen) between two characters means a range of character codes, but you really would have to look at the ASCII table, to know what are the codes (under UNIX, just type: man ascii to list character codes). It is safe though to use with letters or digits (e.g., /[a-g]/ will match all notes in the C major scale in English (but not in German where b is h), while [0-7] represents all digits in the octal notation). When you put - before the closing bracket, or quote it with a backslash, it has its natural meaning
 
|
denotes alternative. The expression /a|b|c/ means: match a or b or c. For example, it will match "Matt", "D'Ambrosia", and "Jean-Christophe" but will not match "Anthony" (but /a|b|c/i would). This notation is slightly confusing when you match alternatives that are longer than one character. It is probably best to enclose them in parentheses that make groupings atomic. For example, expressions: /apples|oranges/, /(apples|oranges)/, /(apples)|(oranges)/ or /((apples)|(oranges))/ all match the "oranges" in the string "Apples and oranges" while the expression /apple(s)|(r)/i would match "Apples" in the same string.
 
( and )
parentheses mark groupings. For example: /(Frod|Drog|Bilb)o/ would match "Frodo", "Drogo", or "Bilbo". If you want to look for verbatim parentheses you need to quote them with a backslash (i.e., write them as \( or \) in the expression). The expression /(CO)/ would match strings: "CO", "ACORN", "Fe(CO)6" but not "C0"(with a zero instead of O). Expression /\(CO\)/ would only match the "Fe(CO)6". The parentheses make a group of characters atomic, i.e., behave like a whole. For example: the /bo+/ will match the string "booo", while /(bo)+/ will match "bobobobo". Parentheses also mark the backreferences to which you can refer in the regular expression or in the replacement string (that we do not discuss here). For references, you count the opening parentheses from the left as 1, 2, 3, ... and refer to the content that matched them as \1, \2, \3..., for example: /([a-z]+)([0-9]+)\1\2/ would match "_abc123abc123_" and "#a1a1#", but not "#a1a2#". You can also nest them: /(([a-z]+)([0-9]+))\2\1\3/ will match "#ab123abab123123#".
 
?
means match zero or one occurrences of a character (or a group in parentheses), e.g., /Bo?t/ will match Bt and Bot, but not Boot, or But. The /Many (thanks )?/ will match "Many ", "Many x", "Many thanks ", "Many thanks thanks " and "Many thanks x" but not "Many". If you want to look for a question mark, quote it as \? in your regexp. For example, the /Ab\?/ will match "Ab?" but will not match the "Ab".
 
*
means zero or more times. E.g., /Bo*t/ will match "Bt", "Bot", and even "Booooooooot". You need a backslash quote \* to look for a plain asterisk. The /2\**5/ will match "1256", "12*52", "2**573" and even "---2********5----" while the /2*5/ will match "5", and "25", "12256", etc. In the string that contains new line characters, the expressions /.*/ and /.*/m will match the the first line (without the ending new line character), while the expression /.*/s will always match the whole damned string, even when it is empty!!!
 
+
means: at least once. For example, expression /Bo+t/ will match "Bot", "Boot", and even "Boooooooot"). Quote the \+ with backslash to look for plain + in the text. The /.+/ and /.+/m will not match a string containing only the new line characters, while /.+/s will match a string that contains only a single new line character and nothing else. Neither /.+/ nor /.+/m nor /.+/s will match the empty string, though.
 
{n} or {n,m}
specifies how many repetitions to match. For example: /Bo{2}t/ will match only Boot, while /Bo{1,2}/ will match "UBot" or "My Boots". There is also a /Bo{2,}/ which means 2 or more times, e.g., "The Boot failed", "Uh... Booot...", "Boooot", etc.
 
^
the meaning depends on the modifiers s or m. If s is used, the caret matches beginning of the string. If m is used, the caret matches beginning of the string and the spot after the new line character (if it is present in the string). For example, the /^The/is will match "the dog" and "the dog\n" (\n denotes a new line character) but will not match "my cat likes\nthe dog" while /^The/im will match all four: "the dog", "\nthe dog", "the dog\n" and the "my cat likes\nthe dog".
 
$
Dollar sign is used only at the end of the regular expression (if you look for $ itself, quote it with the backslash). Since the $ also marks the beginning of variable in Perl, you cannot use it in the middle of the regular expression as Perl would try to replace it with the variable (scalar, as they call it). Therefore, in the expression /Many $s/ Perl will try to put the value of $s into the regular expression, and if it does not exist, it will put nothing there, and the above expression can match strings like: "Many " or "Many things". Like a caret, it is interpreted differently for the m and s modifiers. $ matches only the end of the string if s modifier is used. It matches the spot before the new line and the end of the string if m is used. For example: /time$/s will match "It's time" and "It's time\n" but will not match "It's time\nto go home" while /time$/m will match all three strings.
 

The letters quoted with a backslash often have special meaning. For this reason, avoid quoting letters, unless you know what you are doing. I will give here only a few examples. Check the URLs below, and the Perl man pages given earlier to learn all of them.

\d
Match a digit. It is a shortcut for [0-9].
 
\D
Match a nondigit character. It is a shortcut for [^0-9].
 
\s
Match any whitespace character (space, tab, newline, form feed, caret return, etc).
 
\S
Match a nonwhite space character (i.e., a character that uses ink on your printer).
 
\w
Match any word character (i.e., a letter, a digit, or an underscore -- now you know that the Perl is for programmers since these entities represent valid characters in the variable name). The \w can be replaced with [a-zA-Z0-9_] if you want to make your regular expression look fancy.
 
\W
Match any nonword character (i.e., anything that is not a letter a digit, or an underscore).
 
\b
Match a virtual boundary between the word and the nonword character that precedes or follows it. Why this is needed when we have both \w and \W? Since matches are used for substitutions and then it is important that matched piece of text does not include surrounding space or punctuation. It is also convenient in cases when we want to match the whole word rather then a piece of it, as in the example given earlier. Incidentally, it also matches the backspace character.
 
\B
By now, you should suspect that it is the opposite of \b, i.e., it does not match the positions around the word. So the /\B\./ will find a period that does not follow the word, like in "You do not put period after space  . like this".
 
\n
Matches a newline character. Problem with this is that on the UNIX machine the newline is a New Line Character (NL, CTRL-J), on the Mac it is a Caret Return Character (CR, CTRL-M), and under DOS/Windows, Web, Email, etc, it is a two-character sequence of CR followed by NL. We are impatiently waiting for the discovery of the NL CR sequence for the new line character to make text files even more incompatible. Of course, we could always match the new line as /\r?\n?\r?/ but than we would match also multiple empty lines coming from DOS.
 
\r
Match the Caret Return character (usually CR, i.e., CTRL-M, but the Macs are special, and there it matches NL, that is CTRL-J).
 
\A
Matches only the beginning of the string/text, irrespectively of m and s modifiers.
 
\z
Matches only the end of the string/text, irrespectively of m and s modifiers.
 
\Z
Matches before the NEWLINE character or at the end the string/text, irrespectively of m and s modifiers.
 

There are also other quoted letters, and special quantifiers, etc. Describing them would take a lot of space, and basically, if someone needs to use them, he/she has to go through the boot camp of the Camel Book first. After you read this tutorial, please complete practical exercises at the following link: http://www.ccl.net/cgi-bin/ccl/regexp/test_re.pl

Check the links:
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
http://virtual.park.uga.edu/humcomp/perl/regex2a.html
http://www.comp.leeds.ac.uk/Perl/matching.html
and study.

Jan K. Labanowski