Files

Markdown

I’ve successfully completed the 12 blog challenge I set myself (with just hours to go) even if 4 of those blogs were editorial rather than code. It has reminded me how difficult coming up with regular content is.

Another problem related to content creation is how to format it. The most universal format is of course the plain text file. This has great portability and diff tools work well to see version changes but they are hardly pleasing on the eye. A universal format which allows formatting would naturally be HTML. These retain the portable advantage of text files but are far more difficult to write by hand and don’t work well with diff tools.

There are of course plenty of platforms (wikis and blogging for example) that give you a near WYSIWYG (or visual) editor to hide the HTML code and cleverly store the pages so as to be able to generate a diff of versions. However this can lock you in to the platform losing the portability.

One solution that tries to give portability, basic formatting and works with diff tools is Markdown. Those of you like me who grew up with email in the 90s will recognise Markdown as the way text only email was formatted. It’s main advantage is it doesn’t require any markup as such (formatting is mostly contextual) but has a direct relationship to basic HTML.

With these advantages it is no wonder Markdown has found a home in version control systems which are used to dealing with text files and displaying changes between versions. If you create a readme file, sites like BitBucket and GitHub will automatically display this file along with the directory contents. Make this a readme.md (md is the common extension for a Markdown document) and this will be formatted correctly.

There is a Python Markdown module with takes Markdown text and converts it to HTML. It also supports extensions to add extra functionality. With this module installed (pip install markdown) conversion is just a method call away.

import markdown
md = markdown.Markdown()
print(md.convert("""# Hello
Welcome to a Markdown world!""")

As a more useful example I have used this with the bottle web micro framework to create a program that allow you to view all the markdown documents in a folder through a web browser. If you go to the root it will list all the Markdown documents and you view one by clicking on it. Simply run the Python program from the directory you wish to view.

Advertisements

Testing coverage

Python already has a unittest module based on the Java JUnit library. You can create a series of tests by creating a class that inherits from unittest.TestCase – each method in this class then becomes a test. You can check a condition is true using the standard assert command. To run the tests just call the unittest.main method.

Straight forward but this gets difficult to manage when the tests are split over multiple files. To help with this there is the nose module which can be installed with pip in the usual way. This removes the need for the boilerplate code. It will search through all the python code looking for not only classes derived from unittest.TestCase but also for any method or class that matches the regex – basically contains test at a word boundary and is in a module called test.

As a simple demonstration I have created this repository. It contains stack.py – the most primitive stack implementation I could come up with. I want to ensure this works as expected so lets come up with some tests. I don’t want test code littering the main code so I’ve created a directory called test for all my testing. In here I’ve created a file called test-stack.py which contains two methods, one for testing the stack when empty and one to test that I push to the stack, I get the correct value back in the correct order.

Even though I’ve not created any boilerplate code to run these tests, if I enter the following command from the main directory it will indeed find and run these two tests. Hopefully both should pass.

nosetests

Obviously this is a trivial example but hopefully it shows how quickly unit tests can be set up. There is a lot more to be said on testings which will have to wait for another blog post.

So you have written some tests for your code. How do you know your tests are testing all of the code? This is known as code coverage and there is another good module for this called coverage.py which can be installed with pip as usual. The reason for choosing these two is they work together. Once installed I can include code coverage just by adding the following parameters to nosetests

nosetests --with-coverage --cover-erase

Now as well as running the tests it will show me the amount of code the tests have executed. This is fine for a metric but if the code coverage is not 100% how do you know what the tests are missing? Add another parameter to the command, –cover-html, and coverage will create an HTML report (inside the cover sub-directory).

Load index.html into a browser to see the summary similar to what is displayed on the screen at the end of the tests. Click on the module name and this shows you the module code but which lines were executed by the tests and which lines was not. A thin green bar at the start of the line of code indicates the line was executed; a red bar indicates no code execution.

Typing all these parameters in each time will get a little tedious. Thankfully nose supports an ini file for configuration. For some reason I could not get this to be automatically detected so I had to specify it at the command line with

nosetests -c nose.cfg

As a final note, coverage is just a metric on how much of you code is being tested. It does not imply anything about the quality of the tests. You can have 100% coverage with worthless tests just the same as you can have really thorough tests that only test a small section of the code. At least if you follow the above you will know what your tests are missing.

Virtual environments in Visual Studio

A virtual environment in Python is a folder with everything needed to set up local configuration isolated from the rest of the system. This allows you can have modules installed locally which are different or do not exist in the global Python configuration. If you have used Node.js then you can think of virtual environments as npm default way of working – creating a local install of a package rather than a global one (pip’s default).

If you have multiple versions of Python installed on your machine then you can also specify which version of Python the virtual environment should use. This gives you ability to test your code against multiple versions of Python just by creating multiple virtual environments.

There are already plenty of good posts out there on virtual environments so the aim of this blog post is not to rehash why you should use virtual environments (see here for a good introductory blog post here) or as a quick setup guide (see the Hitchhikers Guide to Python post). It is a quick guide to using virtual environments within Visual Studio. If you have not used virtual environments before it is worth giving these posts a quick read before continuing.

As an aside, Python 3.3 introduced the venv module as an alternative for creating lightweight virtual environments (although the original wrapper pyvenv has already be depreciated in Python 3.6). While this is the correct way going forward, Visual Studio uses the older virtualenv method which is what I am concentrating on here.

Once you have created your Python solution expand it until you get to Python Environments. Right-click on this and choose Add Virtual Environment… from the menu list as shown belowvsve1

 

You can change the name of the folder (defaults to env) which is also used as the name of the virtual environment and the version of Python to use. Click Create to finish and you are ready to go (easy wasn’t it). If you expand the Python Environments node you should see the virtual environment appear.

In the background this has created a folder (the virtual environment) in your working directory with the name given. In case you are unsure, your working directory is the location is the location of the solution which defaults to X:\Users\me\Documents\VS20xx\Projects\Project Name\Solution Name\ – tip, change the default location). This could have been done manually by changing into the working directory and entering the following command (where X:\Python_xx is the installation directory for the version of Python you want to use and env is the name of the folder / virtual environment – if you just want your default version of Python then just pass the name of the folder).

virtualenv -p X:\Python_xx\python.exe env

To install a module into the virtual environment from Visual Studio just right-click on the virtual environment and select Install Python Package… from the menu or if you have a requirements.txt file you can select Install from requirements.txt. If you expand the virtual environment node you will see the modules installed. Once you have all the modules installed you can generate the requirements.txt file from the same menu and it will add the requirements.txt to your project for portability.

What if you want to use this virtual environment from the command line? Inside of the virtual environment is a Scripts directory with a script to make the necessary changes; the trick is to run the correct script from the working directory. The script to run depends upon whether you are running inside a PowerShell console (my recommendation) or from a command prompt. Change into the working directory and type in the following command (where env is the virtual environment folder)

PowerShell: .\env\Scripts\activate.ps1
Command prompt: env\Scripts\activate.bat

The prompt will change to the name of the virtual environment to show activation has succeeded. You can do everything you would normally do from the command line but now you are running against the virtual environment. To confirm the modules installed are only those you have specified type in ‘pip list’ and the version of Python is the one you specified with ‘python -v’.

Update: It appears I’m not the only one to be looking at virtual environments today, see this article if you want a similar introduction but from the command prompt only.

Storing data between runs

Sometimes there will be a need for your Python scripts to store information between runs. While configuration information is best kept in human readable files (like ini of yaml files) if it is data used internally then a better approach is needed. Unless the data to be stored is just plain text it is usually better long term to use a database for this. On first read, that last statement might seem overkill but a few years down the line when the script has stopped working or you need to extract the data stored within you will curse the decision to go with your own home-rolled file format.

We are only looking for ease of use here, not concurrent access or performance. With this in mind, SQLite is a very mature product and the DLL has been included in the Python standard library since v2.5 so there is nothing to install. It is powerful enough to be used as the data store for the Skype client. A good introduction to the SQLite module in Python (assuming you know a little SQL) can be found on Python Central.

Putting the snippets into a slight more complete example, we can create the cliched task list example with surprisingly little code. I’ve tried to choose the SQL statements to show off a few features you would probably want with your CRUD operations but this is not meant as a SQL tutorial.

import sqlite3

sqlcmds = ["CREATE TABLE IF NOT EXISTS persons (pid INTEGER PRIMARY KEY, givennames VARCHAR, surname VARCHAR);",
    "CREATE TABLE IF NOT EXISTS tasks (tid INTEGER PRIMARY KEY, pid INTEGER, name VARCHAR, percentcomplete INTEGER DEFAULT 0);",
    "INSERT INTO persons (pid,givennames,surname) VALUES (1,'Tom','Smith'),(2,'Dick','Jones'),(3,'Harry','Hat');",
    "INSERT INTO tasks (pid,name) VALUES (1,'Checktest'),(1,'Report Checktest'),(3,'Submit Report');",
    "UPDATE tasks SET percentcomplete = 25 WHERE name = 'Checktest';"]
sqlselect = """SELECT tasks.name, persons.givennames, persons.surname, tasks.percentcomplete
    FROM tasks inner join persons on tasks.pid = persons.pid"""

with sqlite3.connect(r"c:\path\to\sqlite.db") as dbconnection:
    cursor = dbconnection.cursor()
    for statement in sqlcmds:
        cursor.execute(statement)
        db.commit()
    cursor.execute(sqlselect)
    tasklist = ['"%s" with %s %s (%d%% complete)' % row for row in cursor]

print('\n'.join(tasklist))

Why is using SQLite an advantage? For starters you can easily view and execute arbitary SQL with the DB Browser for SQLite utility. When unexpected characters or a dodgy record breaks the code now you have a fighting chance at fixing it. Also SQLite doesn’t create large files, the example above is a mere 3Kb file.

An interesting take on this is y_serial, which basically pickles your Python object and stores it with a label in the SQLite table. I’ve not tested this, so make no recommendation for using it, but it is a novel way to store objects between runs without any SQL knowledge or installing any additional modules.

When I get chance to finish testing and uploading there will be a bigger example of this. I want to monitor the scheduled tasks for failures. There is a command line interface for this called schtasks which I can run using the code for running shell commands. As I am only interested in cases where the task has just failed I need to store the state on the last run and this is where SQLite will come in.

TBC

 

PDF extraction

PDF files have been a standard for formatted documents for a long time. Back in the early days of the Internet they were about the only way to send a file that could be both viewed on screen and printed out the same. As a file to work with they are a bit of a nightmare – the PDF format is basically the Postscript printer language and associated resources packaged into a single file.

This makes extracting content from a PDF file tricky. What you get will depend a lot on the process used to create the PDF. If it has come from a spreadsheet and created by a virtual printer driver then it is likely each cell is a separate item and trying to recreate the text (let alone work with it) is going to be tricky. Where possible you want to use the data before it was

When that is not possible there are PDF parser’s written in Python. A popular one for data extraction is PDFMiner. While the original library only supports Python 2.6 & 2.7, there is a fork that uses the six module to support both Python v2 & v3. Install it using the usual pip command

pip install pdfminer.six

The official documentation is not great. It doesn’t even work as an quick start guide – see this blog post for a better review and quick start guide. Instead creating the basic text extractor is more enlightening.

First thing to create is a PDFResourceManager instance. This is basically just a dictionary to cache fonts and similar shared resources in. We don’t use this directly, just pass a reference to it in a few method calls. Next we want to layout settings which come from the slightly confusingly named LAParams class. You can just use the default settings from the constructor. Although we don’t change any of the settings you cannot just miss this out of the method call as it defaults to no settings which results in some strange results.

With these couple of objects we are ready to load the PDF file. The file is opened as normal and read in with the PDFPage.get_pages method. Once loaded we turn the data into our required output using an interpreter. Calling it an interpreter is again a confusing name as it relies on a separate converter to process the output in the desired format, it just handles the procedural side which would be the same regardless of the output being produced.

PDFMiner comes with three converters, for text only, html and xml output. For this example we just want the text converter which just ignores any non-text items. All three converters take an output stream as the second parameter (after the resource manager). This could be a file object but for this example I want the text in memory so I going to use the io module. With all the text now in this variable I can print it after suitably encoding any unicode characters.

I’ve put the example code on Bitbucket.

Yielding files

Time for a challenge, so I’m going to try 12 blogs of Christmas. The aim is to write 12 blog entries in December (or at least by 5th January which is the 12th day of Christmas). That is one blog entry every 3 days. It’s a catchy title for a challenge (always helps, think Movember) which could be used for any challenge and I’ve twisted my ankle so I doubt I’ll be running; the 12 runs of Christmas does sound nice anyway. Yes it is the 4th already so not a good start.

After the last post I’ve been thinking of other examples of where a generator would be useful that was more in keeping with the theme of this blog (sys administration with Python in case you’ve forgotten). Iterating through system calls or an API would be a good candidate but I’ve not been using anything recently that fitted the bill. Another case that sprang to mind was file searching.

A reasonable way to do this would be to create a list but why use the memory to create the list if the caller is unlikely to need a list and they can use list comprehension to create a list anyway. So this should make a good generator example.

Some of the work is done already by os.walk; this will iterate through each directory giving you a list of files and folders. Normally when you looking for files you would specify a wildcard pattern so I’m going to use regular expressions and return any file that matches using yield. I’ve covered regular expressions a few times before so I’ll skip any explanation and just present the code which takes a directory and a file pattern and returns all the matching files.

import os, re</pre>
<pre>def filesearch (root, pattern, exact=True):
    searchre = re.compile(pattern)
    for parent, dirs, files in os.walk(root):
        for filename in files:
            if exact:
                res = searchre.match(filename)
            else
                res = searchre.search(filename)
            if res:
                yield os.path.join(parent,filename)

for filename in filesearch(r"C:\Temp",r".*\.exe"):
    print("%s has size %d" % (filename,os.path.getsize(filename)))

The only thing to note is I added a third option so you can do a match (the regular expression must match the whole filename) or a search (the regular expression only needs to match part of the filename). This defaults to true which is an exact match.

The example should find any executables in the C:\temp folder. Regular expressions are very powerful but not quite as simple using *.exe. Instead the asterisk becomes .* (match any character 0 or more times) and the dot has to be escaped as it is a special character. I’ve just printed the filename and size out but you could equally delete the file if it was bigger than a certain size etc.

And that’s my first post of 12 blogs of Christmas. Lets see if I can get all 12 done in time.


									

YAML

I have been looking into Ansible for automation which has meant looking at YAML files again. I’ve looked at YAML before which in theory offers benefits over ini, JSON and XML based files but the lack of built in module has resulted in me using choosing ini or JSON formats. I have recently been favouring JSON in projects and as this is a subset of YAML it is time to take another look at format.

For a long time the defacto standard library has been PyYAML. There are a bunch of Windows installers on the page which can be used to get up and running. However this module has received few updates in recent years and doesn’t appear to support the later YAML 1.2 standard. Up has stepped Anthon van der Neut who has used this as the basis for his own ruamel.yaml module. This can be installed using pip in the usual way.

Whichever one you choose, you can use safe_load to load a YAML file (or convert a string) and dump to create a YAML string

try:
    import ruamel.yaml as yaml
except ImportError:
    import yaml

pydic = yaml.safe_load("""
# example yaml (this is a comment)
name: test
version: 1.0
inlinelist: [ "tinker", "tailor", "soldier", "spy" ]
yamllist:
- first item
- second item
dictionary:
    name: subdictionary
    usage: anything
""")
print(pydic)
with open('test.yml','w') as ymlfile:
    ymlfile.writelines(yaml.dump(pydic))

There is a limit on what how readable the output from the dump method is as you will see from the above example. If you are using YAML files for configuration you can be much more verbose and use whitespace. There is a full reference card of the YAML 1.1 spec in a single page on the yaml.org website.

Finally, if you need to pass information to a JavaScript program (either in a browser or to node – which would usually force going with JSON) there is even a port of PyYAML to JavaScript.

Unzip a file in memory

The zipfile module is fairly flexible but there are occasions when you cannot pass it a filename (as a string) or a file like object; for example the open method on AWS S3 buckets does not return a suitable object. What to do if you can read the zip file into memory – writing it to disk just to read it back in again seems a waste.

Python, as is often the case, already has a module to solve this problem, in this case StringIO. This allows you to treat a string, or in this case the entire file in memory, as if it was a file.

This allows us to write our unzip procedure compactly as

# module imports and S3 connection omitted for brevity (and beyond scope)
s3file = s3connection.get_bucket(bucketname).get_key(filename)
if s3file:
 s3file.open()
 zf = udbfile.read()
 s3file.close()
 zip = zipfile.ZipFile(StringIO.StringIO(zf))
 zip.extractall()

Writing lists to a file

After creating my list from the previous post I needed to save this to a text file to send back to the person who had requested the information. I just wrote a standard loop over the list but this got me thinking about a better way to do this as it’s a common task. As it turns out not really but it did bring up a few titbits worth mentioning.

If you are from a PERL background you might be tempted to use a single write and a join to turn the list into a string as follows

file.write("\n".join(mylist))

Don’t. This creates a new string in memory from the list and then writes that. Ok if it is a small list but a huge waste of memory for larger ones.

For the more experienced Python user, using a generator produces arguably a cleaner syntax and saves a line of typing. As with all generators, you can make it conditonal by adding an if clause to the end.

with open(r'x:\path\to\file.txt','w') as txtfile:
  txtfile.writelines( "%s\n" % item for item in mylist )

That covers off the text file, but there may be a better option depending upon what is going to be done with the resulting file. I’ll finish the following couple of examples.

# if the file is going to Excel, maybe a CSV file
import csv
with open(r'x:\path\to\file.csv','w') as csvfile:
  csv = csv.writer(csvfile)
  for item in mylist:
    csv.writerow([item])

# or how about json if it is going to another program
import json # Python 2.5 use simplejson
with open(r'x:\path\to\file.json','w') as jsonfile:
  json.dump(mylist,jsonfile)

See the Python docs for the csv module and json module.

Processing file names

I seem to have spent a lot of time working with files this month. My task today was summarise the output from an inventory tool. This tool (well vb script) had created a lot of text files with the computer name and a date serial in the name. What I wanted is this information in a CSV file to compare to our asset list.

There are lots of ways of doing this but as the computer name is variable length but otherwise the file is known I used a regular expression. Regular expressions can get complicated but it this case I’m just looking for any letter, number, underscore or dash followed by .example.com (for my computer name) then an underscore followed by a 12 digit datetime serial. The rest of the name is irrelevant.

I am also using grouping by enclosing the computer name and datetime serial in parentheses. This allows me to return the matched details using group().

For the code below to work you would need to define a function formatdate to turn the datetime serial into something more readable. I’ve left this out to improve the clarity of the example.

import os, re, csv
prog = re.compile(r"([a-zA-Z0-9_\-]+).example.com_([0-9]{12})")

def getinfo ( instr ):
    res = prog.match(instr)
    if res:
        return (res.group(1),formatdate(res.group(2))]

with open('names.csv','w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerow(['Computer','Date']) # header
    for file in os.listdir(r'\\server\path\to\inventories'):
        fileinfo = getinfo(file)
        if fileinfo:
            csvwriter.writerow(fileinfo)

The natural progression would be to list comprehension to create a list and then write this all out in one go using writerows instead. However getting this to handle cases where the match failed resulted in code ugly code. If I work a way around this I’ll include it.

If you needed to include sub-directories as well then you could use os.walk instead of os.listdir then loop of the files list returned.