Files

Log Parser GUI

Log Parser is an old but still incredibly useful utility which I covered way back in this blog post. If you are fighting log files then I still recommend you give the post a read.

Since that post, v2 of a GUI for Log Parser has been released . For those who are more accustomed to using SSMS or similar to write queries this may be more to your tastes. It can be downloaded from here. See this Microsoft blog post for a summary of what has been added to v2.

There is already a decent tutorial from Lizard Labs on using the GUI but it is not very clear about where the options are so refer to the image below if you struggle to get started.

LogParserGUI

A little aside for Windows 8/ Server 2012 and above accessing the event log files. Don’t try to open the event logs directory (%SystemRoot%\system32\winevt\logs by default) directly. You will probably be unable to open it because the folder as it does not have All Application Packages in the security permissions.

There is no need to do this. Log Parser already knows how to access event logs, just use the event log name, Application, Security or System, as shown in the tutorial and example above.

Advertisements

AsciiDoc and DocBook

I’ve covered Markdown (.md) in other posts but another text format gaining popularity is AsciiDoc which is a plain text interpretation of DocBook XML. These files generally use a verbose .asciidoc file extension but you can sometimes see them using the text file (.txt) extension.

The main AsciiDoc program is written in Python, but there is no pip install method. Instead you need to get it from Github directly. Also it was written for Python v2 although a fork has been done for Python v3. For Python v2, clone the directory using a Git client from Github (Python v2) or alternatively use the Download ZIP option from the Clone or download button and unpack the zip file. Once downloaded you can build the documentation for the AsciiDoc program with the following command:

python asciidoc.py doc\asciidoc.txt

For Python v3, either clone or download and unzip from the Python v3 Github site. The programs have all gained a 3 suffix to their name so the equivalent build command for the AsciiDoc documentation is:

python asciidoc3.py doc\asciidoc.txt

This also acts as a way to test the basic setup. If all goes well you should see no error messages and it should create a doc\asciidoc.html file which you can open with any web browser.

To get from AsciiDoc to most other formats, the program converts the text file to the DocBook XML format and then acts as a wrapper around DocBook to create the necessary file. DocBook is not aimed at Windows users so getting it installed is not straight forward. Thankfully combining this blog post and this SO post gives us the installation steps below.

First go to DocBooks SourceForge site and download the zip file. Unpack this to the C: drive (or wherever you want it) and optionally rename the directory to docbook-xsl, that is remove the version number from the folder. Add this folder to your path environment variable.

Now you need libxml2, libxslt, libxmlsec , zlib and iconv. A Windows build of all these can be obtained from ftp://ftp.zlatkovic.com/libxml/. Download the latest version of the zip file for each library and extract the contents of the bin directory of each zip file into the docbook-xsl directory created above.

So avoid calling several programs to create the other files, there is also a2x.py provided. This is a wrapper around the various software programs that need to be called. To create an epub ebook of the instructions above the command becomes:

python a2x.py -L -f epub -v doc\asciidoc.txt

or for the v3 fork:

python a2x3.py -L -f epub -v doc\asciidoc.txt

Markdown

I’ve successfully completed the 12 blog challenge I set myself (with just hours to go) even if 4 of those blogs were editorial rather than code. It has reminded me how difficult coming up with regular content is.

Another problem related to content creation is how to format it. The most universal format is of course the plain text file. This has great portability and diff tools work well to see version changes but they are hardly pleasing on the eye. A universal format which allows formatting would naturally be HTML. These retain the portable advantage of text files but are far more difficult to write by hand and don’t work well with diff tools.

There are of course plenty of platforms (wikis and blogging for example) that give you a near WYSIWYG (or visual) editor to hide the HTML code and cleverly store the pages so as to be able to generate a diff of versions. However this can lock you in to the platform losing the portability.

One solution that tries to give portability, basic formatting and works with diff tools is Markdown. Those of you like me who grew up with email in the 90s will recognise Markdown as the way text only email was formatted. It’s main advantage is it doesn’t require any markup as such (formatting is mostly contextual) but has a direct relationship to basic HTML.

With these advantages it is no wonder Markdown has found a home in version control systems which are used to dealing with text files and displaying changes between versions. If you create a readme file, sites like BitBucket and GitHub will automatically display this file along with the directory contents. Make this a readme.md (md is the common extension for a Markdown document) and this will be formatted correctly.

There is a Python Markdown module with takes Markdown text and converts it to HTML. It also supports extensions to add extra functionality. With this module installed (pip install markdown) conversion is just a method call away.

import markdown
md = markdown.Markdown()
print(md.convert("""# Hello
Welcome to a Markdown world!""")

As a more useful example I have used this with the bottle web micro framework to create a program that allow you to view all the markdown documents in a folder through a web browser. If you go to the root it will list all the Markdown documents and you view one by clicking on it. Simply run the Python program from the directory you wish to view.

Testing coverage

Python already has a unittest module based on the Java JUnit library. You can create a series of tests by creating a class that inherits from unittest.TestCase – each method in this class then becomes a test. You can check a condition is true using the standard assert command. To run the tests just call the unittest.main method.

Straight forward but this gets difficult to manage when the tests are split over multiple files. To help with this there is the nose module which can be installed with pip in the usual way. This removes the need for the boilerplate code. It will search through all the python code looking for not only classes derived from unittest.TestCase but also for any method or class that matches the regex – basically contains test at a word boundary and is in a module called test.

As a simple demonstration I have created this repository. It contains stack.py – the most primitive stack implementation I could come up with. I want to ensure this works as expected so lets come up with some tests. I don’t want test code littering the main code so I’ve created a directory called test for all my testing. In here I’ve created a file called test-stack.py which contains two methods, one for testing the stack when empty and one to test that I push to the stack, I get the correct value back in the correct order.

Even though I’ve not created any boilerplate code to run these tests, if I enter the following command from the main directory it will indeed find and run these two tests. Hopefully both should pass.

nosetests

Obviously this is a trivial example but hopefully it shows how quickly unit tests can be set up. There is a lot more to be said on testings which will have to wait for another blog post.

So you have written some tests for your code. How do you know your tests are testing all of the code? This is known as code coverage and there is another good module for this called coverage.py which can be installed with pip as usual. The reason for choosing these two is they work together. Once installed I can include code coverage just by adding the following parameters to nosetests

nosetests --with-coverage --cover-erase

Now as well as running the tests it will show me the amount of code the tests have executed. This is fine for a metric but if the code coverage is not 100% how do you know what the tests are missing? Add another parameter to the command, –cover-html, and coverage will create an HTML report (inside the cover sub-directory).

Load index.html into a browser to see the summary similar to what is displayed on the screen at the end of the tests. Click on the module name and this shows you the module code but which lines were executed by the tests and which lines was not. A thin green bar at the start of the line of code indicates the line was executed; a red bar indicates no code execution.

Typing all these parameters in each time will get a little tedious. Thankfully nose supports an ini file for configuration. For some reason I could not get this to be automatically detected so I had to specify it at the command line with

nosetests -c nose.cfg

As a final note, coverage is just a metric on how much of you code is being tested. It does not imply anything about the quality of the tests. You can have 100% coverage with worthless tests just the same as you can have really thorough tests that only test a small section of the code. At least if you follow the above you will know what your tests are missing.

Virtual environments in Visual Studio

A virtual environment in Python is a folder with everything needed to set up local configuration isolated from the rest of the system. This allows you can have modules installed locally which are different or do not exist in the global Python configuration. If you have used Node.js then you can think of virtual environments as npm default way of working – creating a local install of a package rather than a global one (pip’s default).

If you have multiple versions of Python installed on your machine then you can also specify which version of Python the virtual environment should use. This gives you ability to test your code against multiple versions of Python just by creating multiple virtual environments.

There are already plenty of good posts out there on virtual environments so the aim of this blog post is not to rehash why you should use virtual environments (see here for a good introductory blog post here) or as a quick setup guide (see the Hitchhikers Guide to Python post). It is a quick guide to using virtual environments within Visual Studio. If you have not used virtual environments before it is worth giving these posts a quick read before continuing.

As an aside, Python 3.3 introduced the venv module as an alternative for creating lightweight virtual environments (although the original wrapper pyvenv has already be depreciated in Python 3.6). While this is the correct way going forward, Visual Studio uses the older virtualenv method which is what I am concentrating on here.

Once you have created your Python solution expand it until you get to Python Environments. Right-click on this and choose Add Virtual Environment… from the menu list as shown belowvsve1

 

You can change the name of the folder (defaults to env) which is also used as the name of the virtual environment and the version of Python to use. Click Create to finish and you are ready to go (easy wasn’t it). If you expand the Python Environments node you should see the virtual environment appear.

In the background this has created a folder (the virtual environment) in your working directory with the name given. In case you are unsure, your working directory is the location is the location of the solution which defaults to X:\Users\me\Documents\VS20xx\Projects\Project Name\Solution Name\ – tip, change the default location). This could have been done manually by changing into the working directory and entering the following command (where X:\Python_xx is the installation directory for the version of Python you want to use and env is the name of the folder / virtual environment – if you just want your default version of Python then just pass the name of the folder).

virtualenv -p X:\Python_xx\python.exe env

To install a module into the virtual environment from Visual Studio just right-click on the virtual environment and select Install Python Package… from the menu or if you have a requirements.txt file you can select Install from requirements.txt. If you expand the virtual environment node you will see the modules installed. Once you have all the modules installed you can generate the requirements.txt file from the same menu and it will add the requirements.txt to your project for portability.

What if you want to use this virtual environment from the command line? Inside of the virtual environment is a Scripts directory with a script to make the necessary changes; the trick is to run the correct script from the working directory. The script to run depends upon whether you are running inside a PowerShell console (my recommendation) or from a command prompt. Change into the working directory and type in the following command (where env is the virtual environment folder)

PowerShell: .\env\Scripts\activate.ps1
Command prompt: env\Scripts\activate.bat

The prompt will change to the name of the virtual environment to show activation has succeeded. You can do everything you would normally do from the command line but now you are running against the virtual environment. To confirm the modules installed are only those you have specified type in ‘pip list’ and the version of Python is the one you specified with ‘python -v’.

Update: It appears I’m not the only one to be looking at virtual environments today, see this article if you want a similar introduction but from the command prompt only.

Storing data between runs

Sometimes there will be a need for your Python scripts to store information between runs. While configuration information is best kept in human readable files (like ini of yaml files) if it is data used internally then a better approach is needed. Unless the data to be stored is just plain text it is usually better long term to use a database for this. On first read, that last statement might seem overkill but a few years down the line when the script has stopped working or you need to extract the data stored within you will curse the decision to go with your own home-rolled file format.

We are only looking for ease of use here, not concurrent access or performance. With this in mind, SQLite is a very mature product and the DLL has been included in the Python standard library since v2.5 so there is nothing to install. It is powerful enough to be used as the data store for the Skype client. A good introduction to the SQLite module in Python (assuming you know a little SQL) can be found on Python Central.

Putting the snippets into a slight more complete example, we can create the cliched task list example with surprisingly little code. I’ve tried to choose the SQL statements to show off a few features you would probably want with your CRUD operations but this is not meant as a SQL tutorial.

import sqlite3

sqlcmds = ["CREATE TABLE IF NOT EXISTS persons (pid INTEGER PRIMARY KEY, givennames VARCHAR, surname VARCHAR);",
    "CREATE TABLE IF NOT EXISTS tasks (tid INTEGER PRIMARY KEY, pid INTEGER, name VARCHAR, percentcomplete INTEGER DEFAULT 0);",
    "INSERT INTO persons (pid,givennames,surname) VALUES (1,'Tom','Smith'),(2,'Dick','Jones'),(3,'Harry','Hat');",
    "INSERT INTO tasks (pid,name) VALUES (1,'Checktest'),(1,'Report Checktest'),(3,'Submit Report');",
    "UPDATE tasks SET percentcomplete = 25 WHERE name = 'Checktest';"]
sqlselect = """SELECT tasks.name, persons.givennames, persons.surname, tasks.percentcomplete
    FROM tasks inner join persons on tasks.pid = persons.pid"""

with sqlite3.connect(r"c:\path\to\sqlite.db") as dbconnection:
    cursor = dbconnection.cursor()
    for statement in sqlcmds:
        cursor.execute(statement)
        db.commit()
    cursor.execute(sqlselect)
    tasklist = ['"%s" with %s %s (%d%% complete)' % row for row in cursor]

print('\n'.join(tasklist))

Why is using SQLite an advantage? For starters you can easily view and execute arbitary SQL with the DB Browser for SQLite utility. When unexpected characters or a dodgy record breaks the code now you have a fighting chance at fixing it. Also SQLite doesn’t create large files, the example above is a mere 3Kb file.

An interesting take on this is y_serial, which basically pickles your Python object and stores it with a label in the SQLite table. I’ve not tested this, so make no recommendation for using it, but it is a novel way to store objects between runs without any SQL knowledge or installing any additional modules.

When I get chance to finish testing and uploading there will be a bigger example of this. I want to monitor the scheduled tasks for failures. There is a command line interface for this called schtasks which I can run using the code for running shell commands. As I am only interested in cases where the task has just failed I need to store the state on the last run and this is where SQLite will come in.

TBC

 

PDF extraction

PDF files have been a standard for formatted documents for a long time. Back in the early days of the Internet they were about the only way to send a file that could be both viewed on screen and printed out the same. As a file to work with they are a bit of a nightmare – the PDF format is basically the Postscript printer language and associated resources packaged into a single file.

This makes extracting content from a PDF file tricky. What you get will depend a lot on the process used to create the PDF. If it has come from a spreadsheet and created by a virtual printer driver then it is likely each cell is a separate item and trying to recreate the text (let alone work with it) is going to be tricky. Where possible you want to use the data before it was

When that is not possible there are PDF parser’s written in Python. A popular one for data extraction is PDFMiner. While the original library only supports Python 2.6 & 2.7, there is a fork that uses the six module to support both Python v2 & v3. Install it using the usual pip command

pip install pdfminer.six

The official documentation is not great. It doesn’t even work as an quick start guide – see this blog post for a better review and quick start guide. Instead creating the basic text extractor is more enlightening.

First thing to create is a PDFResourceManager instance. This is basically just a dictionary to cache fonts and similar shared resources in. We don’t use this directly, just pass a reference to it in a few method calls. Next we want to layout settings which come from the slightly confusingly named LAParams class. You can just use the default settings from the constructor. Although we don’t change any of the settings you cannot just miss this out of the method call as it defaults to no settings which results in some strange results.

With these couple of objects we are ready to load the PDF file. The file is opened as normal and read in with the PDFPage.get_pages method. Once loaded we turn the data into our required output using an interpreter. Calling it an interpreter is again a confusing name as it relies on a separate converter to process the output in the desired format, it just handles the procedural side which would be the same regardless of the output being produced.

PDFMiner comes with three converters, for text only, html and xml output. For this example we just want the text converter which just ignores any non-text items. All three converters take an output stream as the second parameter (after the resource manager). This could be a file object but for this example I want the text in memory so I going to use the io module. With all the text now in this variable I can print it after suitably encoding any unicode characters.

I’ve put the example code on Bitbucket.

Yielding files

Time for a challenge, so I’m going to try 12 blogs of Christmas. The aim is to write 12 blog entries in December (or at least by 5th January which is the 12th day of Christmas). That is one blog entry every 3 days. It’s a catchy title for a challenge (always helps, think Movember) which could be used for any challenge and I’ve twisted my ankle so I doubt I’ll be running; the 12 runs of Christmas does sound nice anyway. Yes it is the 4th already so not a good start.

After the last post I’ve been thinking of other examples of where a generator would be useful that was more in keeping with the theme of this blog (sys administration with Python in case you’ve forgotten). Iterating through system calls or an API would be a good candidate but I’ve not been using anything recently that fitted the bill. Another case that sprang to mind was file searching.

A reasonable way to do this would be to create a list but why use the memory to create the list if the caller is unlikely to need a list and they can use list comprehension to create a list anyway. So this should make a good generator example.

Some of the work is done already by os.walk; this will iterate through each directory giving you a list of files and folders. Normally when you looking for files you would specify a wildcard pattern so I’m going to use regular expressions and return any file that matches using yield. I’ve covered regular expressions a few times before so I’ll skip any explanation and just present the code which takes a directory and a file pattern and returns all the matching files.

import os, re</pre>
<pre>def filesearch (root, pattern, exact=True):
    searchre = re.compile(pattern)
    for parent, dirs, files in os.walk(root):
        for filename in files:
            if exact:
                res = searchre.match(filename)
            else
                res = searchre.search(filename)
            if res:
                yield os.path.join(parent,filename)

for filename in filesearch(r"C:\Temp",r".*\.exe"):
    print("%s has size %d" % (filename,os.path.getsize(filename)))

The only thing to note is I added a third option so you can do a match (the regular expression must match the whole filename) or a search (the regular expression only needs to match part of the filename). This defaults to true which is an exact match.

The example should find any executables in the C:\temp folder. Regular expressions are very powerful but not quite as simple using *.exe. Instead the asterisk becomes .* (match any character 0 or more times) and the dot has to be escaped as it is a special character. I’ve just printed the filename and size out but you could equally delete the file if it was bigger than a certain size etc.

And that’s my first post of 12 blogs of Christmas. Lets see if I can get all 12 done in time.


									

YAML

I have been looking into Ansible for automation which has meant looking at YAML files again. I’ve looked at YAML before which in theory offers benefits over ini, JSON and XML based files but the lack of built in module has resulted in me using choosing ini or JSON formats. I have recently been favouring JSON in projects and as this is a subset of YAML it is time to take another look at format.

For a long time the defacto standard library has been PyYAML. There are a bunch of Windows installers on the page which can be used to get up and running. However this module has received few updates in recent years and doesn’t appear to support the later YAML 1.2 standard. Up has stepped Anthon van der Neut who has used this as the basis for his own ruamel.yaml module. This can be installed using pip in the usual way.

Whichever one you choose, you can use safe_load to load a YAML file (or convert a string) and dump to create a YAML string

try:
    import ruamel.yaml as yaml
except ImportError:
    import yaml

pydic = yaml.safe_load("""
# example yaml (this is a comment)
name: test
version: 1.0
inlinelist: [ "tinker", "tailor", "soldier", "spy" ]
yamllist:
- first item
- second item
dictionary:
    name: subdictionary
    usage: anything
""")
print(pydic)
with open('test.yml','w') as ymlfile:
    ymlfile.writelines(yaml.dump(pydic))

There is a limit on what how readable the output from the dump method is as you will see from the above example. If you are using YAML files for configuration you can be much more verbose and use whitespace. There is a full reference card of the YAML 1.1 spec in a single page on the yaml.org website.

Finally, if you need to pass information to a JavaScript program (either in a browser or to node – which would usually force going with JSON) there is even a port of PyYAML to JavaScript.

Unzip a file in memory

The zipfile module is fairly flexible but there are occasions when you cannot pass it a filename (as a string) or a file like object; for example the open method on AWS S3 buckets does not return a suitable object. What to do if you can read the zip file into memory – writing it to disk just to read it back in again seems a waste.

Python, as is often the case, already has a module to solve this problem, in this case StringIO. This allows you to treat a string, or in this case the entire file in memory, as if it was a file.

This allows us to write our unzip procedure compactly as

# module imports and S3 connection omitted for brevity (and beyond scope)
s3file = s3connection.get_bucket(bucketname).get_key(filename)
if s3file:
 s3file.open()
 zf = udbfile.read()
 s3file.close()
 zip = zipfile.ZipFile(StringIO.StringIO(zf))
 zip.extractall()