Visual Studio 2017

Visual Studio 2017 has reached the release candidate 2 stage and it is good to see the community version has been retained. I’ve covered the VS 2015 community version in this previous post. Everything from that post remains true in the later version expect who qualifies – open source projects, academic research, training and education remain but only a single professional developer now qualifies, a small team would now have to use the more costly Professional version.

Edit: After I originally wrote this blog post at the end of December MS remove Python from the release candidate without warning. Finally in May it is back and Python appears as a development option next to Node.

The installer for 2017 is different but once installed, the main product retains a similar look and feel. The basic installation does not include the Python tools, not unsurprising given the effort to make VS more modular. Additional support can be added if you go back into the installer; see my comment above and the blog post link for details about what has happened to the Python tools – no longer will a separate download required.

Against upgrading once Microsoft has reinstated Python and it is not longer a release candidate; Python 3.6 is built with VS2015 (as is 3.5) so those compiling modules or embedding Python should stay on the previous version for now. Also if you are already using VS2015 then I’ve not seen anything so would make me want to switch.

It is not an either-or choice as the different versions of Visual Studio will install side-by-side so there is nothing stopping you checking out the latest incarnation like I’ve done.

Storing data between runs

Sometimes there will be a need for your Python scripts to store information between runs. While configuration information is best kept in human readable files (like ini of yaml files) if it is data used internally then a better approach is needed. Unless the data to be stored is just plain text it is usually better long term to use a database for this. On first read, that last statement might seem overkill but a few years down the line when the script has stopped working or you need to extract the data stored within you will curse the decision to go with your own home-rolled file format.

We are only looking for ease of use here, not concurrent access or performance. With this in mind, SQLite is a very mature product and the DLL has been included in the Python standard library since v2.5 so there is nothing to install. It is powerful enough to be used as the data store for the Skype client. A good introduction to the SQLite module in Python (assuming you know a little SQL) can be found on Python Central.

Putting the snippets into a slight more complete example, we can create the cliched task list example with surprisingly little code. I’ve tried to choose the SQL statements to show off a few features you would probably want with your CRUD operations but this is not meant as a SQL tutorial.

import sqlite3

sqlcmds = ["CREATE TABLE IF NOT EXISTS persons (pid INTEGER PRIMARY KEY, givennames VARCHAR, surname VARCHAR);",
    "CREATE TABLE IF NOT EXISTS tasks (tid INTEGER PRIMARY KEY, pid INTEGER, name VARCHAR, percentcomplete INTEGER DEFAULT 0);",
    "INSERT INTO persons (pid,givennames,surname) VALUES (1,'Tom','Smith'),(2,'Dick','Jones'),(3,'Harry','Hat');",
    "INSERT INTO tasks (pid,name) VALUES (1,'Checktest'),(1,'Report Checktest'),(3,'Submit Report');",
    "UPDATE tasks SET percentcomplete = 25 WHERE name = 'Checktest';"]
sqlselect = """SELECT tasks.name, persons.givennames, persons.surname, tasks.percentcomplete
    FROM tasks inner join persons on tasks.pid = persons.pid"""

with sqlite3.connect(r"c:\path\to\sqlite.db") as dbconnection:
    cursor = dbconnection.cursor()
    for statement in sqlcmds:
        cursor.execute(statement)
        db.commit()
    cursor.execute(sqlselect)
    tasklist = ['"%s" with %s %s (%d%% complete)' % row for row in cursor]

print('\n'.join(tasklist))

Why is using SQLite an advantage? For starters you can easily view and execute arbitary SQL with the DB Browser for SQLite utility. When unexpected characters or a dodgy record breaks the code now you have a fighting chance at fixing it. Also SQLite doesn’t create large files, the example above is a mere 3Kb file.

An interesting take on this is y_serial, which basically pickles your Python object and stores it with a label in the SQLite table. I’ve not tested this, so make no recommendation for using it, but it is a novel way to store objects between runs without any SQL knowledge or installing any additional modules.

When I get chance to finish testing and uploading there will be a bigger example of this. I want to monitor the scheduled tasks for failures. There is a command line interface for this called schtasks which I can run using the code for running shell commands. As I am only interested in cases where the task has just failed I need to store the state on the last run and this is where SQLite will come in.

TBC

 

PDF extraction

PDF files have been a standard for formatted documents for a long time. Back in the early days of the Internet they were about the only way to send a file that could be both viewed on screen and printed out the same. As a file to work with they are a bit of a nightmare – the PDF format is basically the Postscript printer language and associated resources packaged into a single file.

This makes extracting content from a PDF file tricky. What you get will depend a lot on the process used to create the PDF. If it has come from a spreadsheet and created by a virtual printer driver then it is likely each cell is a separate item and trying to recreate the text (let alone work with it) is going to be tricky. Where possible you want to use the data before it was

When that is not possible there are PDF parser’s written in Python. A popular one for data extraction is PDFMiner. While the original library only supports Python 2.6 & 2.7, there is a fork that uses the six module to support both Python v2 & v3. Install it using the usual pip command

pip install pdfminer.six

The official documentation is not great. It doesn’t even work as an quick start guide – see this blog post for a better review and quick start guide. Instead creating the basic text extractor is more enlightening.

First thing to create is a PDFResourceManager instance. This is basically just a dictionary to cache fonts and similar shared resources in. We don’t use this directly, just pass a reference to it in a few method calls. Next we want to layout settings which come from the slightly confusingly named LAParams class. You can just use the default settings from the constructor. Although we don’t change any of the settings you cannot just miss this out of the method call as it defaults to no settings which results in some strange results.

With these couple of objects we are ready to load the PDF file. The file is opened as normal and read in with the PDFPage.get_pages method. Once loaded we turn the data into our required output using an interpreter. Calling it an interpreter is again a confusing name as it relies on a separate converter to process the output in the desired format, it just handles the procedural side which would be the same regardless of the output being produced.

PDFMiner comes with three converters, for text only, html and xml output. For this example we just want the text converter which just ignores any non-text items. All three converters take an output stream as the second parameter (after the resource manager). This could be a file object but for this example I want the text in memory so I going to use the io module. With all the text now in this variable I can print it after suitably encoding any unicode characters.

I’ve put the example code on Bitbucket.

Python 3.6

Almost like a Christmas present, Python 3.6 has been released, just fifteen months after 3.5 was released (compared to an 18 month average for the 3.x branch). You can see the official what’s new page here, or if dry lists of features are not your thing, try this summary of the improvements. If nothing else the speed improvements might end one of the arguments for staying on v2.

What will be interesting to see is the take up of the asynchronous additions of Python 3.5 which have further been improved in 3.6. Node.js has shown just how efficient asynchronous programming can be and hopefully async / await can make this just as accessible in Python. If you’ve need seen these new keywords before, see this blog post for a decent introduction.

I’m hoping to investigate asynchronous programming in the New Year so there should be a post on here in the near future.

Pip requirements

You should be used to installing new modules using pip. You have probably used a requirements.txt file to install multiple modules together with the command.

pip install -r requirements.txt

But what about if you need more flexibility. Why would you ever need more flexibility? If you look at my introduction to YAML post, the code supports either the yaml or ruamel.yaml module. There is no way to add conditional logic to a requirements.txt file so a different strategy is needed.

pip is just a module so it can be imported like any other module. This not only gives you access to the main method, which takes an argument list just as if you were calling pip from the command line, but also to its various methods and classes. One of these is the WorkingSet class which creates a collection of the installed modules (or active distributions as the documentation calls them). Using this we can create the conditional logic needed to ensure one of the yaml modules is installed as below.

import pip
package_names = [ ws.project_name for ws in pip._vendor.pkg_resources.WorkingSet() ]
if ('yaml' not in package_names) and ('ruamel.yaml' not in package_names):
    pip.main(['install','ruamel.yaml'])

WorkingSet returns a few other useful properties and methods apart from the package_name. The location property returns the path to where the module is installed and the version property naturally returns the version installed. The requires method returns a list of dependencies.

As with most modules, if you’re interested in finding out more dig around in the source code.

Decrypting AWS Windows passwords

With Linux instances, the public key of the key pair you specify when creating the instance is placed in the .ssh/authorized_keys directory. When you SSH in it encrypts the initial communication details with your public key so that only someone with the corresponding private key can decrypt the details and complete the connection.

Windows instances do not work in the same way. Instead when the instance is created a random password is created. This password is then encrypted with the public key.  You can request this encrypted password but you then need the private key to decrypt it. This can be done through the AWS console but if you are going to use boto to automate AWS then you really want a Python solution.

I have seen a couple of solutions using the PyCrypto module but I wanted a pure Python solution. Luckily there is an rsa module (pip install rsa) which is written in pure Python. With that and the boto module you can decrypt the password with the following code.

import rsa, boto, base64
instance_id = 'i-0123456789abcdef'
key_path = r'C:\path\to\private.pem'

ec2 = boto.connect_ec2() #access_key,secret_key
passwd = base64.b64decode(ec2.get_password_data(instance_id))
if (passwd):
    with open (key_path,'r') as privkeyfile:
        priv = rsa.PrivateKey.load_pkcs1(privkeyfile.read())
    key = rsa.decrypt(passwd,priv)
else:
    key = 'Wait at least 4 minutes after creation before the admin password is available'

print(key)

You just need to specify the instance_id and the private key file location (2nd and 3rd lines). The connect_ec2 method will use the credentials in the .aws file in your home directory if it exists. Alternatively you can specify the access key and secret key as parameters to the method. Assuming you haven’t jumped the gun this should print out the admin password.

Email configuration

Sending an email when a system breaks, warning level reached or just simply when a job completes is standard stuff. The smtplib module handles the sending and the email module makes building even a multipart mime email straightforward. If you have not used these libraries to send an email before then there are lots of other articles on the Internet with examples for example Matt’s blog post for a quick overview on sending plain text and then text + html multipart emails.

What can be surprising is the slight variations in what different SMTP gateways will require in order to work. The following table gives you the different settings needed to get the most common gateways working.

Service Server Port SSL (TLS) Username
Exchange your dns entry 25 no n/a *
Exchange your dns entry 587 yes Windows username
w/ domain
Office 365 smtp.office365.com 587 yes email address
Outlook.com  smtp.live.com 587 yes email address
Gmail  smtp.gmail.com 587 yes email address !
Yahoo mail  smtp.mail.yahoo.com 587 yes email address !
AWS SES email-smtp.us-east-1.amazonaws.com
email-smtp.us-west-2.amazonaws.com
email-smtp.eu-west-1.amazonaws.com
587
or
2587
yes Access key +

Notes:
* Exchange with use the credentials of the user running the Python command to determine what rights they have to send email. You do not need to login when using this method.
! You will have to allow access to less less secure apps. See here for Gmail and here for YahooMail.
+ Password is the secret access key. Verify email address before sending with this guide.

For a local Exchange server where you control the network (top option in the table above) you can use the code from Matt’s blog post above without change. However where the server is remote or you want to specify login credentials for Exchange you should look to using the Extended SMTP commands. This can be done by changing the sendmail code with the following.

s=smtplib.SMTP(Server,Port)
if 'STARTTLS' in s.ehlo():
    s.starttls()
    s.login(Username,Password)
    s.sendmail(...)
s.quit()

As a final note, I’ve included the ehlo command (Extended Hello) in the example above although all of the gateways listed will work without this. This is best practice as it informs the gateway you want to use extended commands and will also return a list of commands accepted. I’ve used this to check the gateway supports STARTTLS (I don’t want my username and password sent unencrypted to the server).

Microsoft Azure courses

I’ve mentioned Microsoft Virtual Academy before as a great resource for free online courses. As it is a Microsoft site the courses are either on Microsoft products or are presented with a Microsoft slant. There are even a few Python ones on there.

Microsoft has now also launched Microsoft Learning, with Open edX based courses (the same technology used to provide the Mongo University). At present there are only Azure courses on there, we’ll see if Microsoft keep it that way or add additional courses for other technologies over time. If you want to quickly learn about Microsoft’s cloud platform the courses appear clear and concise from bits I’ve done so far.

Yielding files

Time for a challenge, so I’m going to try 12 blogs of Christmas. The aim is to write 12 blog entries in December (or at least by 5th January which is the 12th day of Christmas). That is one blog entry every 3 days. It’s a catchy title for a challenge (always helps, think Movember) which could be used for any challenge and I’ve twisted my ankle so I doubt I’ll be running; the 12 runs of Christmas does sound nice anyway. Yes it is the 4th already so not a good start.

After the last post I’ve been thinking of other examples of where a generator would be useful that was more in keeping with the theme of this blog (sys administration with Python in case you’ve forgotten). Iterating through system calls or an API would be a good candidate but I’ve not been using anything recently that fitted the bill. Another case that sprang to mind was file searching.

A reasonable way to do this would be to create a list but why use the memory to create the list if the caller is unlikely to need a list and they can use list comprehension to create a list anyway. So this should make a good generator example.

Some of the work is done already by os.walk; this will iterate through each directory giving you a list of files and folders. Normally when you looking for files you would specify a wildcard pattern so I’m going to use regular expressions and return any file that matches using yield. I’ve covered regular expressions a few times before so I’ll skip any explanation and just present the code which takes a directory and a file pattern and returns all the matching files.

import os, re</pre>
<pre>def filesearch (root, pattern, exact=True):
    searchre = re.compile(pattern)
    for parent, dirs, files in os.walk(root):
        for filename in files:
            if exact:
                res = searchre.match(filename)
            else
                res = searchre.search(filename)
            if res:
                yield os.path.join(parent,filename)

for filename in filesearch(r"C:\Temp",r".*\.exe"):
    print("%s has size %d" % (filename,os.path.getsize(filename)))

The only thing to note is I added a third option so you can do a match (the regular expression must match the whole filename) or a search (the regular expression only needs to match part of the filename). This defaults to true which is an exact match.

The example should find any executables in the C:\temp folder. Regular expressions are very powerful but not quite as simple using *.exe. Instead the asterisk becomes .* (match any character 0 or more times) and the dot has to be escaped as it is a special character. I’ve just printed the filename and size out but you could equally delete the file if it was bigger than a certain size etc.

And that’s my first post of 12 blogs of Christmas. Lets see if I can get all 12 done in time.


									

Generators and yield

A source of confusion for a lot of people new to Python and for anyone who has not used them for a while is the yield keyword. As this must be the third time I’ve had to relearn generators I thought I’d make a few notes.

My way of visualizing a generator is a function that returns (or should that be generates?) an iterator. Having the yield keyword in a function is enough to turn it into a generator. Once you have called the generator to get the iterator you can use it as you would any other iterator. Jeff Knupp has a much fuller explanation on his blog so give it a read and then return.

For an example I created a Fibonacci number generator with the following code along with a few examples using an iterator.

def fibonacci(a = 0,b = 1,maxiter=-1):
    while True:
        yield a
        a,b = b,a+b
        if maxiter > 0:
           maxiter -= 1
           if maxiter < 1:
              return

print([f for f in fibonacci(maxiter=10)])
for f in fibonacci(3,5,20):
    print(f)

First a few notes on the generator itself. You can specify the starting numbers a and b (naturally defaulting to 0 and 1) and a maximum number of Fibonacci numbers (or iterations to perform) with maxiter when you call the generator. Without setting maxiter the iterator will continue indefinitely and could not be used for list comprehension (the first example) and the for loop would be an infinite loop.

The use of return in a generator ends the iteration and is equivalent to raising a StopIteration exception (see PEP 255).  Replace the return keyword with raise StopIteration if you want to prove it.

Lastly, if you are wondering about the line a,b=b,a+b it is just a compact (and I think elegant) way of writing:

temp = a + b
a = b
b = temp

Behind the scenes the loop is calling a next method to get the next value from the iterator. There is nothing to stop you manually calling the next method as shown below. Also the generator will create a new iterator each time it is called. Each iterator will encapsulate their own values for a, b and maxiter as shown below.

i = fibonacci()
j = fibonacci(13,21)
print("variable i is %s\nvariable j is " % (i,j))
print("First 3 from i: %d , %d , %d" % (i.next(),i.next(),i.next()))
print("First 3 from j: %d , %d , %d" % (j.next(),j.next(),j.next()))

Hopefully Jeff’s explanation and my example above goes some way to explaining how generators work.