PDF extraction

PDF files have been a standard for formatted documents for a long time. Back in the early days of the Internet they were about the only way to send a file that could be both viewed on screen and printed out the same. As a file to work with they are a bit of a nightmare – the PDF format is basically the Postscript printer language and associated resources packaged into a single file.

This makes extracting content from a PDF file tricky. What you get will depend a lot on the process used to create the PDF. If it has come from a spreadsheet and created by a virtual printer driver then it is likely each cell is a separate item and trying to recreate the text (let alone work with it) is going to be tricky. Where possible you want to use the data before it was

When that is not possible there are PDF parser’s written in Python. A popular one for data extraction is PDFMiner. While the original library only supports Python 2.6 & 2.7, there is a fork that uses the six module to support both Python v2 & v3. Install it using the usual pip command

pip install pdfminer.six

The official documentation is not great. It doesn’t even work as an quick start guide – see this blog post for a better review and quick start guide. Instead creating the basic text extractor is more enlightening.

First thing to create is a PDFResourceManager instance. This is basically just a dictionary to cache fonts and similar shared resources in. We don’t use this directly, just pass a reference to it in a few method calls. Next we want to layout settings which come from the slightly confusingly named LAParams class. You can just use the default settings from the constructor. Although we don’t change any of the settings you cannot just miss this out of the method call as it defaults to no settings which results in some strange results.

With these couple of objects we are ready to load the PDF file. The file is opened as normal and read in with the PDFPage.get_pages method. Once loaded we turn the data into our required output using an interpreter. Calling it an interpreter is again a confusing name as it relies on a separate converter to process the output in the desired format, it just handles the procedural side which would be the same regardless of the output being produced.

PDFMiner comes with three converters, for text only, html and xml output. For this example we just want the text converter which just ignores any non-text items. All three converters take an output stream as the second parameter (after the resource manager). This could be a file object but for this example I want the text in memory so I going to use the io module. With all the text now in this variable I can print it after suitably encoding any unicode characters.

I’ve put the example code on Bitbucket.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s