March 17, 2007

Tagging pdf documents with python

Tagging as a method of organizing and searching is commonly used for music files, pictures and favourite websites. For documents, the traditional method of searching has been based on indexing the content. All the modern desktop searches will index your pdf files. But using tags obviously has its advantages and people would go to great pains to manually add tags to each file as metadata or for organizing with itunes !

My personal itch is the need to search my collection of journal articles saved as pdf files. The problem with full text indexing of these files is that there is a long reference list at the end of these articles which misleads any attempt to search for a particular author, journal or title. I have struggled with this for some time and looked at applying tags to each file as a solution. The tags could be added as extended attributes in a linux file system like this or as metadata in a windows environment and used for searching. Even more useful might be applications that allow tagging and searching by tags. Tracker and leaftag come to mind for this purpose in Linux. The main hurdle, however, is that adding these tags manually is too tedious. So I experimented with ways to get the information for each pdf file from pubmed. The Biopython module provides a simple interface to the pubmed database from Python. So all one has to do is convert the pdf to text and then parse the text for some information that will allow correct identification at pubmed. The first step is relatively easy. Xpdf provides tools for pdf to text conversion, though I preferred to use beagle's data extraction tool since I already had beagle installed.

import commands
convertcommand = "beagle-extract-content \"" +pdffile +"\""
pdftext = commands.getoutput(convertcommand)


The second part is more difficult because there is no consistent formatting of contents between different publishers. The best way to do it turned out be using the doi. The doi or Digital Object Identifier is a unique name given to any digital object and is usually included in the publication. Parsing it was a matter of searching for
'DOI:' or 'doi:' in the text.

doi = pdftext.lower().split('doi:')[1].strip().split(' ')[0]
searchstring = doi +"[AID]"


The searchstring is constructed by adding the tag [AID] for Article Identifier. Searching in pubmed with this string with this string turns up the pubmed ID (pid) for the article. This allows retrieval of the formatted article information including
title, authors, journal name, year of publication, volume and page numbers.

from Bio import PubMed
from Bio import Medline

rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)

pmid = PubMed.search_for(searchstring)[0]

record = medline_dict[pmid]


print "title is ", record.title

print "author is ", record.authors
print "source = ", record.source


Some journals, however, still do not provide the doi in the publication. So we need something to fall back on. I chose to use the article information which is usually published on the first page in a standard format like this - yr;volume:first page - last page (example 2007;41:272-275). There may sometimes be a space after the
semicolon or the colon, so the search with regex looks like this.

import re
texttosearch = pdftext[:6000]
pattern = "[0-9]{4,};[ ]*[0-9]+:[ ]*[0-9]+"
m = re.search(pattern,texttosearch)
info = m.group(0)
yr = info.split(';')[0]
(vol,pg) = info.split(';')[1].split(':')
searchstring = vol +"[volume] AND "+ pg +"[page] AND "+ yr +"[pdat] + "+
"English[lang]"

For testing, I chose 10 journals comprising the prominent publications in medicine and cardiology and randomly picked an article each from Sep 2005 and Feb-Mar 2007. Correct information was obtained for all the 20 articles. Here is part of the output.
In conclusion, this seems a promising approach to automatically obtain information for each pdf file in my library. This information could be added to the file as extended attributes or used as tags for the file. Like Ruby, Python also has an xattr library and adding them automatically would be easy. The automatic retrieval will fail for some files, but the information could be added manually in such cases.


March 4, 2007

The User Interface of the future ?

Arguing whether the Command Line Interface (CLI) is superior to the Graphical User Interface (GUI) or vice versa strikes me as a futile exercise. It quickly becomes clear to anyone who has used both that the best way to interact with the computer is to use both methods. A CLI + GUI interface is vastly more powerful than a plain GUI interface. As you use both, you find more and more uses for the former, where it clearly surpasses the GUI. Why, you even start ordering your pizzas from the command line.

However, the difficulty many face in approaching the CLI initially is what Eric Raymond calls the 'mnemonic load'. Don Norman writes how the next major UI breakthrough should be in CLI. He anticipates the development of a more flexible command line language, with more resemblance to natural language and not requiring a strict adherence to an idiosyncratic syntax.

An alternative to an entire new language for the command line is to use a user-friendly intermediate layer which translates the user input into the syntax that the command line understands. This is analogous to the frequently used concept in Unix when a user friendly GUI actually uses command line tools, but provides a user friendly front end.

What you see below is a working prototype of such a program written in python. In its main loop, it collects user input and outputs a command to the terminal. When the input is a valid command line input, it is passed unchanged. But when it is not, it is 'translated' into a valid one. I call it the Genie. Here are a few examples of the genie in action.


The prompt includes a battery status monitor - I find it useful and it shows that the prompt is easily customizable. 'Normal' shell commands are interpreted directly - like 'ls -l | grep 2007-03' in the example. Navigation to usual places is easy just - 'go home' or 'go desk'. 'space' functions like an alias mapping to 'df -h / /home'. Note the genie says line which lets you know the command that is passed to the shell. So that a newcomer also learns shell syntax along the way.


When you want to install an application just 'install '. If the app is found in the apt-cache, installation is started. Otherwise you are allowed to choose from the matches in the apt-cache.


Any calculations are automatically recognized and passed on to bc. Finally, a listing of directory names is stored and searchable. So if you want to navigate to the python site-packages folder and didn't remember where it was, genie can help you.

Obviously, the possibilities are almost endless. The genie can be taught to understand new commands as you desire. It may be necessary to be able to carry genie around so that you have your custom genie on any computer you have to use. But finally, will the added ease of use facilitate introduction of the command line to new users or will the use of a simple interface like this preclude users from learning shell commands and thereby never being able to make full use of it ? Comments are welcome.