Regular expressions

Using regular expressions to search / pattern matching

Yielding files

Time for a challenge, so I’m going to try 12 blogs of Christmas. The aim is to write 12 blog entries in December (or at least by 5th January which is the 12th day of Christmas). That is one blog entry every 3 days. It’s a catchy title for a challenge (always helps, think Movember) which could be used for any challenge and I’ve twisted my ankle so I doubt I’ll be running; the 12 runs of Christmas does sound nice anyway. Yes it is the 4th already so not a good start.

After the last post I’ve been thinking of other examples of where a generator would be useful that was more in keeping with the theme of this blog (sys administration with Python in case you’ve forgotten). Iterating through system calls or an API would be a good candidate but I’ve not been using anything recently that fitted the bill. Another case that sprang to mind was file searching.

A reasonable way to do this would be to create a list but why use the memory to create the list if the caller is unlikely to need a list and they can use list comprehension to create a list anyway. So this should make a good generator example.

Some of the work is done already by os.walk; this will iterate through each directory giving you a list of files and folders. Normally when you looking for files you would specify a wildcard pattern so I’m going to use regular expressions and return any file that matches using yield. I’ve covered regular expressions a few times before so I’ll skip any explanation and just present the code which takes a directory and a file pattern and returns all the matching files.

import os, re</pre>
<pre>def filesearch (root, pattern, exact=True):
    searchre = re.compile(pattern)
    for parent, dirs, files in os.walk(root):
        for filename in files:
            if exact:
                res = searchre.match(filename)
            else
                res = searchre.search(filename)
            if res:
                yield os.path.join(parent,filename)

for filename in filesearch(r"C:\Temp",r".*\.exe"):
    print("%s has size %d" % (filename,os.path.getsize(filename)))

The only thing to note is I added a third option so you can do a match (the regular expression must match the whole filename) or a search (the regular expression only needs to match part of the filename). This defaults to true which is an exact match.

The example should find any executables in the C:\temp folder. Regular expressions are very powerful but not quite as simple using *.exe. Instead the asterisk becomes .* (match any character 0 or more times) and the dot has to be escaped as it is a special character. I’ve just printed the filename and size out but you could equally delete the file if it was bigger than a certain size etc.

And that’s my first post of 12 blogs of Christmas. Lets see if I can get all 12 done in time.


									

Regular expression substituions

Following on from my introduction to regular expressions in Python, it is time to substitute the match with something more useful. This is done with the sub method. This takes at least 3 parameters; the regular expression, the replacement and the text to search. At its most basic you have the following

re.sub("PERL","Python","I program in PERL!")

This is not very exciting, the replace method on a string does exactly the same. But this basic example hides two powerful features; the first parameter is a regular expression and the second parameter can also be a function. Put this together with the example I used when introducing regular expressions and we have:

import re
def toupper (matchobj):
    return matchobj.group().upper()
text="Welcome to Python's Regular Expressions. I hope you enjoy what you F1nD."
regex=r"([A-Z])\w+"
print(re.sub(regex,toupper,text))

This matches the same words as previously but this time changes them to uppercase. The regex I’ve covered in some detail but the function parameter needs a bit more explaination. The function is passed the match object for each match and whatever the function returns is what is substituted into the text.

In the example above I’ve used the group method with no parameters to return the entire string that was matched. I simply turned this to uppercase so you can see something happening before returning it to the sub method. It is not much of stretch to go from this to basic template functionality.

I am going to look through the template for any substitution variables enclosed in double braces, {{ and }}, and replace it with result of a few functions. My first decision is how to get the name out from the matched string. I know it is two character in from both ends so I could use matchobj.group()[2:-2] but this would hard code the pattern. Instead I’ll use the grouping option of regular expressions and just enclose the variable name in parentheses and get the variable name using matchobj.group(1). This way, if I want to change the double braces to something else I can just change the regex pattern.

Then I need a way to map the variable name to the output I want. For this example I will just create a dictionary with the variable names as the key and the function to call as the value. This way if the variable name exists in the dictionary I can simply return the result of the function back.

To demonstrate I’ve create this example. I’ve included the template as a variable to make the example self contained. It should be self explanatory from the text this contains what is happening. The only other thing to mention is I change the matched string to lowercase to make the substitution case insensitive.

Regular expressions

Regular expressions are a powerful search language for when you can’t rely on the data being at a set position, in given structure or containing a marker you can look for. At their most basic entering a word will search through for that word; but you get the same with the find method. The power or regular expressions come from its special characters. You maybe use to wildcards ? and * (_ and % in SQL) – the same can be achieved with . and .* in regular expressions. The dot matches any character and an asterisks matches the previous character zero or more times, hence .* matches anything. These can be built up to very complex matches.

Python supports regular expressions with the re module. The Python docs provide a brief summary of regular expression and the methods Python provides but cannot attempt to teach regular expressions and neither does this short blog post. Entire books have been dedicated to this but there are plenty of decent tutorials on the web to get you started. If you have a particular favourite you want to share put it in the comments.

As explained in the docs, unless you want to type in a lot of back slashes use the raw string format, r” or r””, when entering regular expressions in Python.

If you want to test whether your regular expression works try using regexr.com, put the regular expression in the top box and the sample text you want it to search through in the bottom box. It uses the PERL syntax, a regular expression starts with a slash (/) and continues to the last slash. You can then specify options (expression flags) after the last slash to control how the search works. So /Python/ is the regular expression to search for the word Python. At the time of writing regexr.com defaulted to the ([A-Z])\w+ which basically matches all words that begin with a capital letter (more later) storing the capital letter in question in group 1.

You can get a similar result to regexr.com in Python with the following code.

import re
text="Welcome to Python's Regular Expressions. I hope you enjoy what you F1nD."
regex=r"([A-Z])\w+"
for matchobj in re.finditer(regex,text):
    print('Matched %s with groups [%r]' % (matchobj.group(), matchobj.groups()))

I’ve skipped over what a match object is and the methods it provides but this should match Welcome, Python, Regular, Expressions and F1nD. Not quite what you were expecting when I said it matches all words that begin with a capital? The \w group doesn’t include any punctuation it stops at the hyphen with Python’s. But \w does include numbers and the underscore which is why it matches F1nD. Finally + requires one or more characters in the \w group after the capital it doesn’t match the capital I at the start of the second sentence.

If I wanted to include words with apostrophe’s and single letters you might be tempted to try ([A-Z])[a-z’]* but this matches any single capital letter so will also include F and D at the end (try it). As the only single letter words are A and I a working solution would be ([AI])|([A-Z])[a-z’]+

Hopefully this shows the power and the pitfalls of regular expressions and why getting matches to work can sometimes be harder than it first seems.

Processing file names

I seem to have spent a lot of time working with files this month. My task today was summarise the output from an inventory tool. This tool (well vb script) had created a lot of text files with the computer name and a date serial in the name. What I wanted is this information in a CSV file to compare to our asset list.

There are lots of ways of doing this but as the computer name is variable length but otherwise the file is known I used a regular expression. Regular expressions can get complicated but it this case I’m just looking for any letter, number, underscore or dash followed by .example.com (for my computer name) then an underscore followed by a 12 digit datetime serial. The rest of the name is irrelevant.

I am also using grouping by enclosing the computer name and datetime serial in parentheses. This allows me to return the matched details using group().

For the code below to work you would need to define a function formatdate to turn the datetime serial into something more readable. I’ve left this out to improve the clarity of the example.

import os, re, csv
prog = re.compile(r"([a-zA-Z0-9_\-]+).example.com_([0-9]{12})")

def getinfo ( instr ):
    res = prog.match(instr)
    if res:
        return (res.group(1),formatdate(res.group(2))]

with open('names.csv','w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['Computer','Date']) # header
    for file in os.listdir(r'\\server\path\to\inventories'):
        fileinfo = getinfo(file)
        if fileinfo:
            csvwriter.writerow(fileinfo)

The natural progression would be to list comprehension to create a list and then write this all out in one go using writerows instead. However getting this to handle cases where the match failed resulted in code ugly code. If I work a way around this I’ll include it.

If you needed to include sub-directories as well then you could use os.walk instead of os.listdir then loop of the files list returned.

Export AD group membership to a file

When I put in a network license server for AutoCAD a couple of years the challenge was not the installation but enabling the management to be done by others. Autodesk use a FLEXlm which is lightweight and uses a text file to define the options. Users who are able to use the software or a feature are listed by username in the text file. However logging on to a server and editing text files is not something you want a Service Desk doing.

As we already use Active Directory for user management why not use it to control who can log on and borrow a license. All you need is an automated way to export members of a group to a (correctly structured) text file. There are lots of ways of doing this but after a couple of years I have settled on the following.

To export the members of a group I have used the function walk method in the active_directory module. This works the same way as os.walk in that it recursively walks through the groups from the given root group. We can then recursively process the list of users returned in the tuple to get the username (or sAMAccountName as it is referred to in AD). As it is plausible the same user may appear more than once I have used a set to store the users; if the same user appears in groups further down the tree they will be silently ignored.

Just writing this out in a file is not enough, the file must be structured in a way the license manager will understand. For this I will use a template file and the sub method from the regular expression (re) module. I’ve created a separate blog post on how this works. This will replace any occurrence of AD{group_name} with the members of that group.

Finally you need to be able to tell the license manager that the options have been changed. This will be dependent on the license manager; FLEXlm allows you to do this by running lmutils from the command prompt. To do this in Python you can use the subprocess module which I’ll cover in the next post.

If you are still awake after readying that, try out the example program. If you are in a domain just change the group name to one that exists and you will see the results. In reality the template file would be a text file read by the script but in the example file I’ve embedded it in a variable. I’ve used three groups; one for people who can run the program, another for a list of users who can loan a license and a group for IT so they can test the software runs.