Yielding files

Time for a challenge, so I’m going to try 12 blogs of Christmas. The aim is to write 12 blog entries in December (or at least by 5th January which is the 12th day of Christmas). That is one blog entry every 3 days. It’s a catchy title for a challenge (always helps, think Movember) which could be used for any challenge and I’ve twisted my ankle so I doubt I’ll be running; the 12 runs of Christmas does sound nice anyway. Yes it is the 4th already so not a good start.

After the last post I’ve been thinking of other examples of where a generator would be useful that was more in keeping with the theme of this blog (sys administration with Python in case you’ve forgotten). Iterating through system calls or an API would be a good candidate but I’ve not been using anything recently that fitted the bill. Another case that sprang to mind was file searching.

A reasonable way to do this would be to create a list but why use the memory to create the list if the caller is unlikely to need a list and they can use list comprehension to create a list anyway. So this should make a good generator example.

Some of the work is done already by os.walk; this will iterate through each directory giving you a list of files and folders. Normally when you looking for files you would specify a wildcard pattern so I’m going to use regular expressions and return any file that matches using yield. I’ve covered regular expressions a few times before so I’ll skip any explanation and just present the code which takes a directory and a file pattern and returns all the matching files.

import os, re</pre>
<pre>def filesearch (root, pattern, exact=True):
    searchre = re.compile(pattern)
    for parent, dirs, files in os.walk(root):
        for filename in files:
            if exact:
                res = searchre.match(filename)
                res = searchre.search(filename)
            if res:
                yield os.path.join(parent,filename)

for filename in filesearch(r"C:\Temp",r".*\.exe"):
    print("%s has size %d" % (filename,os.path.getsize(filename)))

The only thing to note is I added a third option so you can do a match (the regular expression must match the whole filename) or a search (the regular expression only needs to match part of the filename). This defaults to true which is an exact match.

The example should find any executables in the C:\temp folder. Regular expressions are very powerful but not quite as simple using *.exe. Instead the asterisk becomes .* (match any character 0 or more times) and the dot has to be escaped as it is a special character. I’ve just printed the filename and size out but you could equally delete the file if it was bigger than a certain size etc.

And that’s my first post of 12 blogs of Christmas. Lets see if I can get all 12 done in time.


Regular expressions

Regular expressions are a powerful search language for when you can’t rely on the data being at a set position, in given structure or containing a marker you can look for. At their most basic entering a word will search through for that word; but you get the same with the find method. The power or regular expressions come from its special characters. You maybe use to wildcards ? and * (_ and % in SQL) – the same can be achieved with . and .* in regular expressions. The dot matches any character and an asterisks matches the previous character zero or more times, hence .* matches anything. These can be built up to very complex matches.

Python supports regular expressions with the re module. The Python docs provide a brief summary of regular expression and the methods Python provides but cannot attempt to teach regular expressions and neither does this short blog post. Entire books have been dedicated to this but there are plenty of decent tutorials on the web to get you started. If you have a particular favourite you want to share put it in the comments.

As explained in the docs, unless you want to type in a lot of back slashes use the raw string format, r” or r””, when entering regular expressions in Python.

If you want to test whether your regular expression works try using regexr.com, put the regular expression in the top box and the sample text you want it to search through in the bottom box. It uses the PERL syntax, a regular expression starts with a slash (/) and continues to the last slash. You can then specify options (expression flags) after the last slash to control how the search works. So /Python/ is the regular expression to search for the word Python. At the time of writing regexr.com defaulted to the ([A-Z])\w+ which basically matches all words that begin with a capital letter (more later) storing the capital letter in question in group 1.

You can get a similar result to regexr.com in Python with the following code.

import re
text="Welcome to Python's Regular Expressions. I hope you enjoy what you F1nD."
for matchobj in re.finditer(regex,text):
    print('Matched %s with groups [%r]' % (matchobj.group(), matchobj.groups()))

I’ve skipped over what a match object is and the methods it provides but this should match Welcome, Python, Regular, Expressions and F1nD. Not quite what you were expecting when I said it matches all words that begin with a capital? The \w group doesn’t include any punctuation it stops at the hyphen with Python’s. But \w does include numbers and the underscore which is why it matches F1nD. Finally + requires one or more characters in the \w group after the capital it doesn’t match the capital I at the start of the second sentence.

If I wanted to include words with apostrophe’s and single letters you might be tempted to try ([A-Z])[a-z’]* but this matches any single capital letter so will also include F and D at the end (try it). As the only single letter words are A and I a working solution would be ([AI])|([A-Z])[a-z’]+

Hopefully this shows the power and the pitfalls of regular expressions and why getting matches to work can sometimes be harder than it first seems.