Visual Studio 2017

Visual Studio 2017 has reached the release candidate 2 stage and it is good to see the community version has been retained. I’ve covered the VS 2015 community version in this previous post. Everything from that post remains true in the later version expect who qualifies – open source projects, academic research, training and education remain but only a single professional developer now qualifies, a small team would now have to use the more costly Professional version.

Edit: After I originally wrote this blog post at the end of December MS remove Python from the release candidate without warning. Finally in May it is back and Python appears as a development option next to Node.

The installer for 2017 is different but once installed, the main product retains a similar look and feel. The basic installation does not include the Python tools, not unsurprising given the effort to make VS more modular. Additional support can be added if you go back into the installer; see my comment above and the blog post link for details about what has happened to the Python tools – no longer will a separate download required.

Against upgrading once Microsoft has reinstated Python and it is not longer a release candidate; Python 3.6 is built with VS2015 (as is 3.5) so those compiling modules or embedding Python should stay on the previous version for now. Also if you are already using VS2015 then I’ve not seen anything so would make me want to switch.

It is not an either-or choice as the different versions of Visual Studio will install side-by-side so there is nothing stopping you checking out the latest incarnation like I’ve done.

Storing data between runs

Sometimes there will be a need for your Python scripts to store information between runs. While configuration information is best kept in human readable files (like ini of yaml files) if it is data used internally then a better approach is needed. Unless the data to be stored is just plain text it is usually better long term to use a database for this. On first read, that last statement might seem overkill but a few years down the line when the script has stopped working or you need to extract the data stored within you will curse the decision to go with your own home-rolled file format.

We are only looking for ease of use here, not concurrent access or performance. With this in mind, SQLite is a very mature product and the DLL has been included in the Python standard library since v2.5 so there is nothing to install. It is powerful enough to be used as the data store for the Skype client. A good introduction to the SQLite module in Python (assuming you know a little SQL) can be found on Python Central.

Putting the snippets into a slight more complete example, we can create the cliched task list example with surprisingly little code. I’ve tried to choose the SQL statements to show off a few features you would probably want with your CRUD operations but this is not meant as a SQL tutorial.

import sqlite3

sqlcmds = ["CREATE TABLE IF NOT EXISTS persons (pid INTEGER PRIMARY KEY, givennames VARCHAR, surname VARCHAR);",
    "INSERT INTO persons (pid,givennames,surname) VALUES (1,'Tom','Smith'),(2,'Dick','Jones'),(3,'Harry','Hat');",
    "INSERT INTO tasks (pid,name) VALUES (1,'Checktest'),(1,'Report Checktest'),(3,'Submit Report');",
    "UPDATE tasks SET percentcomplete = 25 WHERE name = 'Checktest';"]
sqlselect = """SELECT, persons.givennames, persons.surname, tasks.percentcomplete
    FROM tasks inner join persons on ="""

with sqlite3.connect(r"c:\path\to\sqlite.db") as dbconnection:
    cursor = dbconnection.cursor()
    for statement in sqlcmds:
    tasklist = ['"%s" with %s %s (%d%% complete)' % row for row in cursor]


Why is using SQLite an advantage? For starters you can easily view and execute arbitary SQL with the DB Browser for SQLite utility. When unexpected characters or a dodgy record breaks the code now you have a fighting chance at fixing it. Also SQLite doesn’t create large files, the example above is a mere 3Kb file.

An interesting take on this is y_serial, which basically pickles your Python object and stores it with a label in the SQLite table. I’ve not tested this, so make no recommendation for using it, but it is a novel way to store objects between runs without any SQL knowledge or installing any additional modules.

When I get chance to finish testing and uploading there will be a bigger example of this. I want to monitor the scheduled tasks for failures. There is a command line interface for this called schtasks which I can run using the code for running shell commands. As I am only interested in cases where the task has just failed I need to store the state on the last run and this is where SQLite will come in.



PDF extraction

PDF files have been a standard for formatted documents for a long time. Back in the early days of the Internet they were about the only way to send a file that could be both viewed on screen and printed out the same. As a file to work with they are a bit of a nightmare – the PDF format is basically the Postscript printer language and associated resources packaged into a single file.

This makes extracting content from a PDF file tricky. What you get will depend a lot on the process used to create the PDF. If it has come from a spreadsheet and created by a virtual printer driver then it is likely each cell is a separate item and trying to recreate the text (let alone work with it) is going to be tricky. Where possible you want to use the data before it was

When that is not possible there are PDF parser’s written in Python. A popular one for data extraction is PDFMiner. While the original library only supports Python 2.6 & 2.7, there is a fork that uses the six module to support both Python v2 & v3. Install it using the usual pip command

pip install pdfminer.six

The official documentation is not great. It doesn’t even work as an quick start guide – see this blog post for a better review and quick start guide. Instead creating the basic text extractor is more enlightening.

First thing to create is a PDFResourceManager instance. This is basically just a dictionary to cache fonts and similar shared resources in. We don’t use this directly, just pass a reference to it in a few method calls. Next we want to layout settings which come from the slightly confusingly named LAParams class. You can just use the default settings from the constructor. Although we don’t change any of the settings you cannot just miss this out of the method call as it defaults to no settings which results in some strange results.

With these couple of objects we are ready to load the PDF file. The file is opened as normal and read in with the PDFPage.get_pages method. Once loaded we turn the data into our required output using an interpreter. Calling it an interpreter is again a confusing name as it relies on a separate converter to process the output in the desired format, it just handles the procedural side which would be the same regardless of the output being produced.

PDFMiner comes with three converters, for text only, html and xml output. For this example we just want the text converter which just ignores any non-text items. All three converters take an output stream as the second parameter (after the resource manager). This could be a file object but for this example I want the text in memory so I going to use the io module. With all the text now in this variable I can print it after suitably encoding any unicode characters.

I’ve put the example code on Bitbucket.

Python 3.6

Almost like a Christmas present, Python 3.6 has been released, just fifteen months after 3.5 was released (compared to an 18 month average for the 3.x branch). You can see the official what’s new page here, or if dry lists of features are not your thing, try this summary of the improvements. If nothing else the speed improvements might end one of the arguments for staying on v2.

What will be interesting to see is the take up of the asynchronous additions of Python 3.5 which have further been improved in 3.6. Node.js has shown just how efficient asynchronous programming can be and hopefully async / await can make this just as accessible in Python. If you’ve need seen these new keywords before, see this blog post for a decent introduction.

I’m hoping to investigate asynchronous programming in the New Year so there should be a post on here in the near future.

Pip requirements

You should be used to installing new modules using pip. You have probably used a requirements.txt file to install multiple modules together with the command.

pip install -r requirements.txt

But what about if you need more flexibility. Why would you ever need more flexibility? If you look at my introduction to YAML post, the code supports either the yaml or ruamel.yaml module. There is no way to add conditional logic to a requirements.txt file so a different strategy is needed.

pip is just a module so it can be imported like any other module. This not only gives you access to the main method, which takes an argument list just as if you were calling pip from the command line, but also to its various methods and classes. One of these is the WorkingSet class which creates a collection of the installed modules (or active distributions as the documentation calls them). Using this we can create the conditional logic needed to ensure one of the yaml modules is installed as below.

import pip
package_names = [ ws.project_name for ws in pip._vendor.pkg_resources.WorkingSet() ]
if ('yaml' not in package_names) and ('ruamel.yaml' not in package_names):

WorkingSet returns a few other useful properties and methods apart from the package_name. The location property returns the path to where the module is installed and the version property naturally returns the version installed. The requires method returns a list of dependencies.

As with most modules, if you’re interested in finding out more dig around in the source code.

Decrypting AWS Windows passwords

With Linux instances, the public key of the key pair you specify when creating the instance is placed in the .ssh/authorized_keys directory. When you SSH in it encrypts the initial communication details with your public key so that only someone with the corresponding private key can decrypt the details and complete the connection.

Windows instances do not work in the same way. Instead when the instance is created a random password is created. This password is then encrypted with the public key.  You can request this encrypted password but you then need the private key to decrypt it. This can be done through the AWS console but if you are going to use boto to automate AWS then you really want a Python solution.

I have seen a couple of solutions using the PyCrypto module but I wanted a pure Python solution. Luckily there is an rsa module (pip install rsa) which is written in pure Python. With that and the boto module you can decrypt the password with the following code.

import rsa, boto, base64
instance_id = 'i-0123456789abcdef'
key_path = r'C:\path\to\private.pem'

ec2 = boto.connect_ec2() #access_key,secret_key
passwd = base64.b64decode(ec2.get_password_data(instance_id))
if (passwd):
    with open (key_path,'r') as privkeyfile:
        priv = rsa.PrivateKey.load_pkcs1(
    key = rsa.decrypt(passwd,priv)
    key = 'Wait at least 4 minutes after creation before the admin password is available'


You just need to specify the instance_id and the private key file location (2nd and 3rd lines). The connect_ec2 method will use the credentials in the .aws file in your home directory if it exists. Alternatively you can specify the access key and secret key as parameters to the method. Assuming you haven’t jumped the gun this should print out the admin password.

Email configuration

Sending an email when a system breaks, warning level reached or just simply when a job completes is standard stuff. The smtplib module handles the sending and the email module makes building even a multipart mime email straightforward. If you have not used these libraries to send an email before then there are lots of other articles on the Internet with examples for example Matt’s blog post for a quick overview on sending plain text and then text + html multipart emails.

What can be surprising is the slight variations in what different SMTP gateways will require in order to work. The following table gives you the different settings needed to get the most common gateways working.

Service Server Port SSL (TLS) Username
Exchange your dns entry 25 no n/a *
Exchange your dns entry 587 yes Windows username
w/ domain
Office 365 587 yes email address 587 yes email address
Gmail 587 yes email address !
Yahoo mail 587 yes email address !
yes Access key +

* Exchange with use the credentials of the user running the Python command to determine what rights they have to send email. You do not need to login when using this method.
! You will have to allow access to less less secure apps. See here for Gmail and here for YahooMail.
+ Password is the secret access key. Verify email address before sending with this guide.

For a local Exchange server where you control the network (top option in the table above) you can use the code from Matt’s blog post above without change. However where the server is remote or you want to specify login credentials for Exchange you should look to using the Extended SMTP commands. This can be done by changing the sendmail code with the following.

if 'STARTTLS' in s.ehlo():

As a final note, I’ve included the ehlo command (Extended Hello) in the example above although all of the gateways listed will work without this. This is best practice as it informs the gateway you want to use extended commands and will also return a list of commands accepted. I’ve used this to check the gateway supports STARTTLS (I don’t want my username and password sent unencrypted to the server).