Skip to main content

Crash course in regular expressions

So we were discussing regular expressions and how it can be completely unreadable, but I really love this language. Mostly because it is a one purpose language, string matching, and it serves that purpose extremly well. Here's a crash course in regular expressions.

Pattern matching

Consider the following example. The pattern matches everything marked in yellow.

# match a substring
String: Uncle is my mother's brother
Pattern: other

It will match "other" twice in the sentence. That's easy enough but not very useful. \w means word character and \s means whitespace. What do we write if we would like to match words?

# match words
String: The quick brown fox jumps over the lazy dog
Pattern: \w+

This makes a very inefficient string splitter.

# match one or more word character |or| dash
String: Lars-Åke Bengtsson
Pattern: [\w-]+

The beginning of the string matches with ^ and the end of the string as $. You can also use a|b for matching a or b.

# match first and last word of string
String: Luke, I am your father
Pattern: ^\w+|\w+$

Watch out, because $ could mean end of string or end of line, depending on the options you send into the regex parser

Match groups

A match group should be something that you would like to distinguish from other matches.

# Distinguish element name
String: <b>I can haz hatz!</b>
Pattern: </?(\w+)>
Matching group #1: <b>I can haz hatz!</b>

The questionmark `?` indicates that the preceeding character could exist in the match, but does not have to. We can name the matching groups like this.

# get parts of an e-mail address
String: [email protected]
Pattern: (?<username>.+)@(?<server>.+)
Matching group #username: spam@litemedia.se
Matching group #server: spam@litemedia.se

The dot `.` matches anything. .+ means, match something at least once.

Lazy and greedy

Consider the following match where we would like to find the format string in the expression.

# Get the first argument, format string in the expression
String: string.Format("I would like some {0}", "Bananas");
Pattern ".*"

The pattern matches everything up to the last qoute, where as we only would like it to match up to the first quote. This is because * is greedy by default. We can change it to lazy with a questionmark.

# Get the first argument, format string in the expression
String: string.Format("I would like some {0}", "Bananas");
Pattern ".*?"

Look-ahead and look-behind

You can match things that come before another expression, or after.

# Find first letter of sentence (positively look-behind)
String: Tree you are. Moss you are. You are violets with wind above them.
Pattern: (?<=^|\.\s*)\w

Word character that comes first in the string or after a dot and some whitespace.

# Any digit not before a 1
String: 11011001
pattern: \d(?!1)

Backreference

You can reference to a previously defined group.

# Match content within tags
String: <q>Hardware: The parts of a computer </system> that can be kicked.</q>
Pattern: (?<=<(?<el>\w+)>).*?(?=</\k<el>)
Matching group #el: q

The content should be preceeded by an opening tag and followed by a closing tag with the same element name.

Usage in C#

This is how you would use a regular expression in C#.

var expression = new Regex(@"(?<=<(?<el>\w+)>).*?(?=</\k<el>)");
var data = "<li>Tulip</li><li>Lily</li><li>Duffydil</li>";

var matches = expression.Matches(data);
foreach (Match match in matches)
{
    Console.WriteLine("Flower: {0}", match.Value);
}

This will print out all the flower names to the console window. One of the best resources for regular expressions I've found is regular-expressions.info. Now, go along and have fun!

comments powered by Disqus