Crash course in regular expressions
So we were discussing regular expressions and how it can be completely unreadable, but I really love this language. Mostly because it is a one purpose language, string matching, and it serves that purpose extremly well. Here's a crash course in regular expressions.
Pattern matching
Consider the following example. The pattern matches everything marked in yellow.
# match a substring String: Uncle is my mother's brother Pattern: other
It will match "other" twice in the sentence. That's easy enough but not very useful. \w means word character and \s means whitespace. What do we write if we would like to match words?
# match words String: The quick brown fox jumps over the lazy dog Pattern: \w+
This makes a very inefficient string splitter.
# match one or more word character |or| dash String: Lars-Åke Bengtsson Pattern: [\w-]+
The beginning of the string matches with ^ and the end of the string as $. You can also use a|b for matching a or b.
# match first and last word of string String: Luke, I am your father Pattern: ^\w+|\w+$
Watch out, because $ could mean end of string or end of line, depending on the options you send into the regex parser
Match groups
A match group should be something that you would like to distinguish from other matches.
# Distinguish element name String: <b>I can haz hatz!</b> Pattern: </?(\w+)> Matching group #1: <b>I can haz hatz!</b>
The questionmark `?` indicates that the preceeding character could exist in the match, but does not have to. We can name the matching groups like this.
# get parts of an e-mail address String: [email protected] Pattern: (?<username>.+)@(?<server>.+) Matching group #username: spam@litemedia.se Matching group #server: spam@litemedia.se
The dot `.` matches anything. .+ means, match something at least once.
Lazy and greedy
Consider the following match where we would like to find the format string in the expression.
# Get the first argument, format string in the expression String: string.Format("I would like some {0}", "Bananas"); Pattern ".*"
The pattern matches everything up to the last qoute, where as we only would like it to match up to the first quote. This is because * is greedy by default. We can change it to lazy with a questionmark.
# Get the first argument, format string in the expression String: string.Format("I would like some {0}", "Bananas"); Pattern ".*?"
Look-ahead and look-behind
You can match things that come before another expression, or after.
# Find first letter of sentence (positively look-behind) String: Tree you are. Moss you are. You are violets with wind above them. Pattern: (?<=^|\.\s*)\w
Word character that comes first in the string or after a dot and some whitespace.
# Any digit not before a 1 String: 11011001 pattern: \d(?!1)
Backreference
You can reference to a previously defined group.
# Match content within tags String: <q>Hardware: The parts of a computer </system> that can be kicked.</q> Pattern: (?<=<(?<el>\w+)>).*?(?=</\k<el>) Matching group #el: q
The content should be preceeded by an opening tag and followed by a closing tag with the same element name.
Usage in C#
This is how you would use a regular expression in C#.
This will print out all the flower names to the console window. One of the best resources for regular expressions I've found is regular-expressions.info. Now, go along and have fun!