# Raspberry Pi projects – Raspi Boy

Python and the Raspberry Pi go hand-in-handand if you want to learn Python from something other than sys admin and server jobs then a Raspberry Pi is good choice. Being a board only, others have also used it as the brains behind their projects.

One of these Kickstarter projects that has caught my eye is the Raspi Boy – basically a screen, battery and case that looks like a GameBoy. As well as a great way to play all those games from my childhood it will also be a fully portable Python system. I have made my pledge on Kickstarter and I’ll give you a review of it when it turns up.

# Markdown

I’ve successfully completed the 12 blog challenge I set myself (with just hours to go) even if 4 of those blogs were editorial rather than code. It has reminded me how difficult coming up with regular content is.

Another problem related to content creation is how to format it. The most universal format is of course the plain text file. This has great portability and diff tools work well to see version changes but they are hardly pleasing on the eye. A universal format which allows formatting would naturally be HTML. These retain the portable advantage of text files but are far more difficult to write by hand and don’t work well with diff tools.

There are of course plenty of platforms (wikis and blogging for example) that give you a near WYSIWYG (or visual) editor to hide the HTML code and cleverly store the pages so as to be able to generate a diff of versions. However this can lock you in to the platform losing the portability.

One solution that tries to give portability, basic formatting and works with diff tools is Markdown. Those of you like me who grew up with email in the 90s will recognise Markdown as the way text only email was formatted. It’s main advantage is it doesn’t require any markup as such (formatting is mostly contextual) but has a direct relationship to basic HTML.

With these advantages it is no wonder Markdown has found a home in version control systems which are used to dealing with text files and displaying changes between versions. If you create a readme file, sites like BitBucket and GitHub will automatically display this file along with the directory contents. Make this a readme.md (md is the common extension for a Markdown document) and this will be formatted correctly.

There is a Python Markdown module with takes Markdown text and converts it to HTML. It also supports extensions to add extra functionality. With this module installed (pip install markdown) conversion is just a method call away.

import markdown
md = markdown.Markdown()
print(md.convert("""# Hello
Welcome to a Markdown world!""")


As a more useful example I have used this with the bottle web micro framework to create a program that allow you to view all the markdown documents in a folder through a web browser. If you go to the root it will list all the Markdown documents and you view one by clicking on it. Simply run the Python program from the directory you wish to view.

# Testing coverage

Python already has a unittest module based on the Java JUnit library. You can create a series of tests by creating a class that inherits from unittest.TestCase – each method in this class then becomes a test. You can check a condition is true using the standard assert command. To run the tests just call the unittest.main method.

Straight forward but this gets difficult to manage when the tests are split over multiple files. To help with this there is the nose module which can be installed with pip in the usual way. This removes the need for the boilerplate code. It will search through all the python code looking for not only classes derived from unittest.TestCase but also for any method or class that matches the regex – basically contains test at a word boundary and is in a module called test.

As a simple demonstration I have created this repository. It contains stack.py – the most primitive stack implementation I could come up with. I want to ensure this works as expected so lets come up with some tests. I don’t want test code littering the main code so I’ve created a directory called test for all my testing. In here I’ve created a file called test-stack.py which contains two methods, one for testing the stack when empty and one to test that I push to the stack, I get the correct value back in the correct order.

Even though I’ve not created any boilerplate code to run these tests, if I enter the following command from the main directory it will indeed find and run these two tests. Hopefully both should pass.

nosetests

Obviously this is a trivial example but hopefully it shows how quickly unit tests can be set up. There is a lot more to be said on testings which will have to wait for another blog post.

So you have written some tests for your code. How do you know your tests are testing all of the code? This is known as code coverage and there is another good module for this called coverage.py which can be installed with pip as usual. The reason for choosing these two is they work together. Once installed I can include code coverage just by adding the following parameters to nosetests

nosetests --with-coverage --cover-erase

Now as well as running the tests it will show me the amount of code the tests have executed. This is fine for a metric but if the code coverage is not 100% how do you know what the tests are missing? Add another parameter to the command, –cover-html, and coverage will create an HTML report (inside the cover sub-directory).

Load index.html into a browser to see the summary similar to what is displayed on the screen at the end of the tests. Click on the module name and this shows you the module code but which lines were executed by the tests and which lines was not. A thin green bar at the start of the line of code indicates the line was executed; a red bar indicates no code execution.

Typing all these parameters in each time will get a little tedious. Thankfully nose supports an ini file for configuration. For some reason I could not get this to be automatically detected so I had to specify it at the command line with

nosetests -c nose.cfg

As a final note, coverage is just a metric on how much of you code is being tested. It does not imply anything about the quality of the tests. You can have 100% coverage with worthless tests just the same as you can have really thorough tests that only test a small section of the code. At least if you follow the above you will know what your tests are missing.

# Virtual environments in Visual Studio

A virtual environment in Python is a folder with everything needed to set up local configuration isolated from the rest of the system. This allows you can have modules installed locally which are different or do not exist in the global Python configuration. If you have used Node.js then you can think of virtual environments as npm default way of working – creating a local install of a package rather than a global one (pip’s default).

If you have multiple versions of Python installed on your machine then you can also specify which version of Python the virtual environment should use. This gives you ability to test your code against multiple versions of Python just by creating multiple virtual environments.

There are already plenty of good posts out there on virtual environments so the aim of this blog post is not to rehash why you should use virtual environments (see here for a good introductory blog post here) or as a quick setup guide (see the Hitchhikers Guide to Python post). It is a quick guide to using virtual environments within Visual Studio. If you have not used virtual environments before it is worth giving these posts a quick read before continuing.

As an aside, Python 3.3 introduced the venv module as an alternative for creating lightweight virtual environments (although the original wrapper pyvenv has already be depreciated in Python 3.6). While this is the correct way going forward, Visual Studio uses the older virtualenv method which is what I am concentrating on here.

Once you have created your Python solution expand it until you get to Python Environments. Right-click on this and choose Add Virtual Environment… from the menu list as shown below

You can change the name of the folder (defaults to env) which is also used as the name of the virtual environment and the version of Python to use. Click Create to finish and you are ready to go (easy wasn’t it). If you expand the Python Environments node you should see the virtual environment appear.

In the background this has created a folder (the virtual environment) in your working directory with the name given. In case you are unsure, your working directory is the location is the location of the solution which defaults to X:\Users\me\Documents\VS20xx\Projects\Project Name\Solution Name\ – tip, change the default location). This could have been done manually by changing into the working directory and entering the following command (where X:\Python_xx is the installation directory for the version of Python you want to use and env is the name of the folder / virtual environment – if you just want your default version of Python then just pass the name of the folder).

virtualenv -p X:\Python_xx\python.exe env

To install a module into the virtual environment from Visual Studio just right-click on the virtual environment and select Install Python Package… from the menu or if you have a requirements.txt file you can select Install from requirements.txt. If you expand the virtual environment node you will see the modules installed. Once you have all the modules installed you can generate the requirements.txt file from the same menu and it will add the requirements.txt to your project for portability.

What if you want to use this virtual environment from the command line? Inside of the virtual environment is a Scripts directory with a script to make the necessary changes; the trick is to run the correct script from the working directory. The script to run depends upon whether you are running inside a PowerShell console (my recommendation) or from a command prompt. Change into the working directory and type in the following command (where env is the virtual environment folder)

PowerShell: .\env\Scripts\activate.ps1
Command prompt: env\Scripts\activate.bat

The prompt will change to the name of the virtual environment to show activation has succeeded. You can do everything you would normally do from the command line but now you are running against the virtual environment. To confirm the modules installed are only those you have specified type in ‘pip list’ and the version of Python is the one you specified with ‘python -v’.

Update: It appears I’m not the only one to be looking at virtual environments today, see this article if you want a similar introduction but from the command prompt only.

# Visual Studio 2017

Visual Studio 2017 has reached the release candidate 2 stage and it is good to see the community version has been retained. I’ve covered the VS 2015 community version in this previous post. Everything from that post remains true in the later version expect who qualifies – open source projects, academic research, training and education remain but only a single professional developer now qualifies, a small team would now have to use the more costly Professional version.

Edit: After I originally wrote this blog post at the end of December MS remove Python from the release candidate without warning. Finally in May it is back and Python appears as a development option next to Node.

The installer for 2017 is different but once installed, the main product retains a similar look and feel. The basic installation does not include the Python tools, not unsurprising given the effort to make VS more modular. Additional support can be added if you go back into the installer; see my comment above and the blog post link for details about what has happened to the Python tools – no longer will a separate download required.

Against upgrading once Microsoft has reinstated Python and it is not longer a release candidate; Python 3.6 is built with VS2015 (as is 3.5) so those compiling modules or embedding Python should stay on the previous version for now. Also if you are already using VS2015 then I’ve not seen anything so would make me want to switch.

It is not an either-or choice as the different versions of Visual Studio will install side-by-side so there is nothing stopping you checking out the latest incarnation like I’ve done.

# Storing data between runs

Sometimes there will be a need for your Python scripts to store information between runs. While configuration information is best kept in human readable files (like ini of yaml files) if it is data used internally then a better approach is needed. Unless the data to be stored is just plain text it is usually better long term to use a database for this. On first read, that last statement might seem overkill but a few years down the line when the script has stopped working or you need to extract the data stored within you will curse the decision to go with your own home-rolled file format.

We are only looking for ease of use here, not concurrent access or performance. With this in mind, SQLite is a very mature product and the DLL has been included in the Python standard library since v2.5 so there is nothing to install. It is powerful enough to be used as the data store for the Skype client. A good introduction to the SQLite module in Python (assuming you know a little SQL) can be found on Python Central.

Putting the snippets into a slight more complete example, we can create the cliched task list example with surprisingly little code. I’ve tried to choose the SQL statements to show off a few features you would probably want with your CRUD operations but this is not meant as a SQL tutorial.

import sqlite3

sqlcmds = ["CREATE TABLE IF NOT EXISTS persons (pid INTEGER PRIMARY KEY, givennames VARCHAR, surname VARCHAR);",
"CREATE TABLE IF NOT EXISTS tasks (tid INTEGER PRIMARY KEY, pid INTEGER, name VARCHAR, percentcomplete INTEGER DEFAULT 0);",
"INSERT INTO persons (pid,givennames,surname) VALUES (1,'Tom','Smith'),(2,'Dick','Jones'),(3,'Harry','Hat');",
"INSERT INTO tasks (pid,name) VALUES (1,'Checktest'),(1,'Report Checktest'),(3,'Submit Report');",
"UPDATE tasks SET percentcomplete = 25 WHERE name = 'Checktest';"]

with sqlite3.connect(r"c:\path\to\sqlite.db") as dbconnection:
cursor = dbconnection.cursor()
for statement in sqlcmds:
cursor.execute(statement)
db.commit()
cursor.execute(sqlselect)
tasklist = ['"%s" with %s %s (%d%% complete)' % row for row in cursor]



Why is using SQLite an advantage? For starters you can easily view and execute arbitary SQL with the DB Browser for SQLite utility. When unexpected characters or a dodgy record breaks the code now you have a fighting chance at fixing it. Also SQLite doesn’t create large files, the example above is a mere 3Kb file.

An interesting take on this is y_serial, which basically pickles your Python object and stores it with a label in the SQLite table. I’ve not tested this, so make no recommendation for using it, but it is a novel way to store objects between runs without any SQL knowledge or installing any additional modules.

When I get chance to finish testing and uploading there will be a bigger example of this. I want to monitor the scheduled tasks for failures. There is a command line interface for this called schtasks which I can run using the code for running shell commands. As I am only interested in cases where the task has just failed I need to store the state on the last run and this is where SQLite will come in.

TBC

# PDF extraction

PDF files have been a standard for formatted documents for a long time. Back in the early days of the Internet they were about the only way to send a file that could be both viewed on screen and printed out the same. As a file to work with they are a bit of a nightmare – the PDF format is basically the Postscript printer language and associated resources packaged into a single file.

This makes extracting content from a PDF file tricky. What you get will depend a lot on the process used to create the PDF. If it has come from a spreadsheet and created by a virtual printer driver then it is likely each cell is a separate item and trying to recreate the text (let alone work with it) is going to be tricky. Where possible you want to use the data before it was

When that is not possible there are PDF parser’s written in Python. A popular one for data extraction is PDFMiner. While the original library only supports Python 2.6 & 2.7, there is a fork that uses the six module to support both Python v2 & v3. Install it using the usual pip command

pip install pdfminer.six

The official documentation is not great. It doesn’t even work as an quick start guide – see this blog post for a better review and quick start guide. Instead creating the basic text extractor is more enlightening.

First thing to create is a PDFResourceManager instance. This is basically just a dictionary to cache fonts and similar shared resources in. We don’t use this directly, just pass a reference to it in a few method calls. Next we want to layout settings which come from the slightly confusingly named LAParams class. You can just use the default settings from the constructor. Although we don’t change any of the settings you cannot just miss this out of the method call as it defaults to no settings which results in some strange results.

With these couple of objects we are ready to load the PDF file. The file is opened as normal and read in with the PDFPage.get_pages method. Once loaded we turn the data into our required output using an interpreter. Calling it an interpreter is again a confusing name as it relies on a separate converter to process the output in the desired format, it just handles the procedural side which would be the same regardless of the output being produced.

PDFMiner comes with three converters, for text only, html and xml output. For this example we just want the text converter which just ignores any non-text items. All three converters take an output stream as the second parameter (after the resource manager). This could be a file object but for this example I want the text in memory so I going to use the io module. With all the text now in this variable I can print it after suitably encoding any unicode characters.

I’ve put the example code on Bitbucket.

# Python 3.6

Almost like a Christmas present, Python 3.6 has been released, just fifteen months after 3.5 was released (compared to an 18 month average for the 3.x branch). You can see the official what’s new page here, or if dry lists of features are not your thing, try this summary of the improvements. If nothing else the speed improvements might end one of the arguments for staying on v2.

What will be interesting to see is the take up of the asynchronous additions of Python 3.5 which have further been improved in 3.6. Node.js has shown just how efficient asynchronous programming can be and hopefully async / await can make this just as accessible in Python. If you’ve need seen these new keywords before, see this blog post for a decent introduction.

I’m hoping to investigate asynchronous programming in the New Year so there should be a post on here in the near future.

# Pip requirements

You should be used to installing new modules using pip. You have probably used a requirements.txt file to install multiple modules together with the command.

pip install -r requirements.txt

But what about if you need more flexibility. Why would you ever need more flexibility? If you look at my introduction to YAML post, the code supports either the yaml or ruamel.yaml module. There is no way to add conditional logic to a requirements.txt file so a different strategy is needed.

pip is just a module so it can be imported like any other module. This not only gives you access to the main method, which takes an argument list just as if you were calling pip from the command line, but also to its various methods and classes. One of these is the WorkingSet class which creates a collection of the installed modules (or active distributions as the documentation calls them). Using this we can create the conditional logic needed to ensure one of the yaml modules is installed as below.

import pip
package_names = [ ws.project_name for ws in pip._vendor.pkg_resources.WorkingSet() ]
if ('yaml' not in package_names) and ('ruamel.yaml' not in package_names):
pip.main(['install','ruamel.yaml'])


WorkingSet returns a few other useful properties and methods apart from the package_name. The location property returns the path to where the module is installed and the version property naturally returns the version installed. The requires method returns a list of dependencies.

As with most modules, if you’re interested in finding out more dig around in the source code.

With Linux instances, the public key of the key pair you specify when creating the instance is placed in the .ssh/authorized_keys directory. When you SSH in it encrypts the initial communication details with your public key so that only someone with the corresponding private key can decrypt the details and complete the connection.

Windows instances do not work in the same way. Instead when the instance is created a random password is created. This password is then encrypted with the public key.  You can request this encrypted password but you then need the private key to decrypt it. This can be done through the AWS console but if you are going to use boto to automate AWS then you really want a Python solution.

I have seen a couple of solutions using the PyCrypto module but I wanted a pure Python solution. Luckily there is an rsa module (pip install rsa) which is written in pure Python. With that and the boto module you can decrypt the password with the following code.

import rsa, boto, base64
instance_id = 'i-0123456789abcdef'
key_path = r'C:\path\to\private.pem'

ec2 = boto.connect_ec2() #access_key,secret_key