March 17, 2007

Tagging pdf documents with python

Tagging as a method of organizing and searching is commonly used for music files, pictures and favourite websites. For documents, the traditional method of searching has been based on indexing the content. All the modern desktop searches will index your pdf files. But using tags obviously has its advantages and people would go to great pains to manually add tags to each file as metadata or for organizing with itunes !

My personal itch is the need to search my collection of journal articles saved as pdf files. The problem with full text indexing of these files is that there is a long reference list at the end of these articles which misleads any attempt to search for a particular author, journal or title. I have struggled with this for some time and looked at applying tags to each file as a solution. The tags could be added as extended attributes in a linux file system like this or as metadata in a windows environment and used for searching. Even more useful might be applications that allow tagging and searching by tags. Tracker and leaftag come to mind for this purpose in Linux. The main hurdle, however, is that adding these tags manually is too tedious. So I experimented with ways to get the information for each pdf file from pubmed. The Biopython module provides a simple interface to the pubmed database from Python. So all one has to do is convert the pdf to text and then parse the text for some information that will allow correct identification at pubmed. The first step is relatively easy. Xpdf provides tools for pdf to text conversion, though I preferred to use beagle's data extraction tool since I already had beagle installed.

import commands
convertcommand = "beagle-extract-content \"" +pdffile +"\""
pdftext = commands.getoutput(convertcommand)

The second part is more difficult because there is no consistent formatting of contents between different publishers. The best way to do it turned out be using the doi. The doi or Digital Object Identifier is a unique name given to any digital object and is usually included in the publication. Parsing it was a matter of searching for
'DOI:' or 'doi:' in the text.

doi = pdftext.lower().split('doi:')[1].strip().split(' ')[0]
searchstring = doi +"[AID]"

The searchstring is constructed by adding the tag [AID] for Article Identifier. Searching in pubmed with this string with this string turns up the pubmed ID (pid) for the article. This allows retrieval of the formatted article information including
title, authors, journal name, year of publication, volume and page numbers.

from Bio import PubMed
from Bio import Medline

rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)

pmid = PubMed.search_for(searchstring)[0]

record = medline_dict[pmid]

print "title is ", record.title

print "author is ", record.authors
print "source = ", record.source

Some journals, however, still do not provide the doi in the publication. So we need something to fall back on. I chose to use the article information which is usually published on the first page in a standard format like this - yr;volume:first page - last page (example 2007;41:272-275). There may sometimes be a space after the
semicolon or the colon, so the search with regex looks like this.

import re
texttosearch = pdftext[:6000]
pattern = "[0-9]{4,};[ ]*[0-9]+:[ ]*[0-9]+"
m =,texttosearch)
info =
yr = info.split(';')[0]
(vol,pg) = info.split(';')[1].split(':')
searchstring = vol +"[volume] AND "+ pg +"[page] AND "+ yr +"[pdat] + "+

For testing, I chose 10 journals comprising the prominent publications in medicine and cardiology and randomly picked an article each from Sep 2005 and Feb-Mar 2007. Correct information was obtained for all the 20 articles. Here is part of the output.
In conclusion, this seems a promising approach to automatically obtain information for each pdf file in my library. This information could be added to the file as extended attributes or used as tags for the file. Like Ruby, Python also has an xattr library and adding them automatically would be easy. The automatic retrieval will fail for some files, but the information could be added manually in such cases.