Regular expressions

Regular expressions are a powerful search language for when you can’t rely on the data being at a set position, in given structure or containing a marker you can look for. At their most basic entering a word will search through for that word; but you get the same with the find method. The power or regular expressions come from its special characters. You maybe use to wildcards ? and * (_ and % in SQL) – the same can be achieved with . and .* in regular expressions. The dot matches any character and an asterisks matches the previous character zero or more times, hence .* matches anything. These can be built up to very complex matches.

Python supports regular expressions with the re module. The Python docs provide a brief summary of regular expression and the methods Python provides but cannot attempt to teach regular expressions and neither does this short blog post. Entire books have been dedicated to this but there are plenty of decent tutorials on the web to get you started. If you have a particular favourite you want to share put it in the comments.

As explained in the docs, unless you want to type in a lot of back slashes use the raw string format, r” or r””, when entering regular expressions in Python.

If you want to test whether your regular expression works try using regexr.com, put the regular expression in the top box and the sample text you want it to search through in the bottom box. It uses the PERL syntax, a regular expression starts with a slash (/) and continues to the last slash. You can then specify options (expression flags) after the last slash to control how the search works. So /Python/ is the regular expression to search for the word Python. At the time of writing regexr.com defaulted to the ([A-Z])\w+ which basically matches all words that begin with a capital letter (more later) storing the capital letter in question in group 1.

You can get a similar result to regexr.com in Python with the following code.

import re
text="Welcome to Python's Regular Expressions. I hope you enjoy what you F1nD."
regex=r"([A-Z])\w+"
for matchobj in re.finditer(regex,text):
    print('Matched %s with groups [%r]' % (matchobj.group(), matchobj.groups()))

I’ve skipped over what a match object is and the methods it provides but this should match Welcome, Python, Regular, Expressions and F1nD. Not quite what you were expecting when I said it matches all words that begin with a capital? The \w group doesn’t include any punctuation it stops at the hyphen with Python’s. But \w does include numbers and the underscore which is why it matches F1nD. Finally + requires one or more characters in the \w group after the capital it doesn’t match the capital I at the start of the second sentence.

If I wanted to include words with apostrophe’s and single letters you might be tempted to try ([A-Z])[a-z’]* but this matches any single capital letter so will also include F and D at the end (try it). As the only single letter words are A and I a working solution would be ([AI])|([A-Z])[a-z’]+

Hopefully this shows the power and the pitfalls of regular expressions and why getting matches to work can sometimes be harder than it first seems.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s