Month: December 2012

Pre-processing a file with a generator

While answering a forum post on a function that processed a list I got thinking about how it would run in a real-life situation. Rather than a list being passed it would probably be a file. This almost worked except the line returns were passed in and I needed those stripped out. I was hoping to find an elegant solution and I did, a generator.

If you have not used generators before this wiki post is a good starting point. If you have used list comprehension then it is exactly the same just with different brackets. I’ll use collections.Counter() in place of the function to demonstrate; for those using a Python version earlier than 2.7 you will to create your own function.

First an example with a list which acts as the starting point:

def basicCounter ( mylist ):
  # Python 2.7+ users could use collections.Counter instead
  retdic = dict()
  for item in mylist:
    retdic[item] = retdic.get(item,0) + 1
  return retdic

mylist = ['1','2','2','3','3','3']
counted = basicCounter(mylist)
print counted

Now let create a generator to process the lines in a file to remove the whitespace and line returns. The strip() function does this for a string, we just need to do this for every line in the file. This gives us our generator; (line.strip() for line in file).

Add a bit of code for opening the file and we have our version of the above which uses the contents of a file for the input instead.

#  basicCounter as before
# Python 2.5 users need the following line
# from __future__ import with_statement
with open(r'C:\path\to\file.txt') as myfile:
  counted = basicCounter(line.strip() for line in myfile)
print counted

There is nothing to stop you making the processing much more complex; simply create your function and replace line.strip() with yourfunction(line). You can also make the processing conditional by adding an if clause at the end.

Windows binaries

One annoyance of using Python in a Windows environment is finding a really useful library only to find out you need to compile everything from source. Building from source is not a strong point of Windows.

A good resource is the Unofficial Windows Binaries for Python Extension Packages maintained by Christoph Gohlke. Chris has done the hard work compiling the libraries and creating an installer. All you have to do is run the correct version (there is often a version for each Python version and 32- and 64-bit versions).

I came across this page after looking at lxml.html. Once you have downloaded and installed the correct library you’ll be able to run the following script which displays all the links on the a page

import lxml.html
htmlpage = lxml.html.parse("https://quackajack.wordpress.com")
for item in htmlpage.getiterator():
	if item.tag == "a":
		print "%s=%s" % (item.text,item.values())