Reach beyond Grasp: research

Showing posts with label research. Show all posts

March 17, 2007

Tagging pdf documents with python

Tagging as a method of organizing and searching is commonly used for music files, pictures and favourite websites. For documents, the traditional method of searching has been based on indexing the content. All the modern desktop searches will index your pdf files. But using tags obviously has its advantages and people would go to great pains to manually add tags to each file as metadata or for organizing with itunes !

My personal itch is the need to search my collection of journal articles saved as pdf files. The problem with full text indexing of these files is that there is a long reference list at the end of these articles which misleads any attempt to search for a particular author, journal or title. I have struggled with this for some time and looked at applying tags to each file as a solution. The tags could be added as extended attributes in a linux file system like this or as metadata in a windows environment and used for searching. Even more useful might be applications that allow tagging and searching by tags. Tracker and leaftag come to mind for this purpose in Linux. The main hurdle, however, is that adding these tags manually is too tedious. So I experimented with ways to get the information for each pdf file from pubmed. The Biopython module provides a simple interface to the pubmed database from Python. So all one has to do is convert the pdf to text and then parse the text for some information that will allow correct identification at pubmed. The first step is relatively easy. Xpdf provides tools for pdf to text conversion, though I preferred to use beagle's data extraction tool since I already had beagle installed.

import commands
convertcommand = "beagle-extract-content \"" +pdffile +"\""
pdftext = commands.getoutput(convertcommand)

The second part is more difficult because there is no consistent formatting of contents between different publishers. The best way to do it turned out be using the doi. The doi or Digital Object Identifier is a unique name given to any digital object and is usually included in the publication. Parsing it was a matter of searching for 'DOI:' or 'doi:' in the text.

doi = pdftext.lower().split('doi:')[1].strip().split(' ')[0]
searchstring = doi +"[AID]"

The searchstring is constructed by adding the tag [AID] for Article Identifier. Searching in pubmed with this string with this string turns up the pubmed ID (pid) for the article. This allows retrieval of the formatted article information including title, authors, journal name, year of publication, volume and page numbers.

from Bio import PubMed
from Bio import Medline
rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)

pmid = PubMed.search_for(searchstring)[0]
record = medline_dict[pmid]

print "title is ", record.title
print "author is ", record.authors
print "source = ", record.source

Some journals, however, still do not provide the doi in the publication. So we need something to fall back on. I chose to use the article information which is usually published on the first page in a standard format like this - yr;volume:first page - last page (example 2007;41:272-275). There may sometimes be a space after the semicolon or the colon, so the search with regex looks like this.

import re
texttosearch = pdftext[:6000]
pattern = "[0-9]{4,};[ ]*[0-9]+:[ ]*[0-9]+"
m = re.search(pattern,texttosearch)
info = m.group(0)
yr = info.split(';')[0]
(vol,pg) = info.split(';')[1].split(':')
searchstring = vol +"[volume] AND "+ pg +"[page] AND "+ yr +"[pdat] + "+ "English[lang]"

For testing, I chose 10 journals comprising the prominent publications in medicine and cardiology and randomly picked an article each from Sep 2005 and Feb-Mar 2007. Correct information was obtained for all the 20 articles. Here is part of the output.

In conclusion, this seems a promising approach to automatically obtain information for each pdf file in my library. This information could be added to the file as extended attributes or used as tags for the file. Like Ruby, Python also has an xattr library and adding them automatically would be easy. The automatic retrieval will fail for some files, but the information could be added manually in such cases.

February 24, 2007

Hide and show panels with a keyboard shortcut in Ubuntu

I usually keep the default top and bottom panels in gnome, but especially on my laptop, place great value on the screen estate that I can get by hiding them. The usual way I do this is by using autohide and then setting the hidden size to 0 or 1. Mostly I dont need the panels because I can launch applications with Alt-F2 or with shortcuts. But when I need to access something from the panel, I have to mouse over the hidden panel to bring it up. The other minor irritation is that the panel may spring out when you dont really want it if your mouse wanders close to it.

In a recent post in the Ubuntu forums, it was suggested that it would be nice to set up a keyboard shortcut to show the panels when needed. Since it is easy to access the gconf-editor from the command line, it was easy to write a script to toggle the hide status of the panel. Here is a short how-to if someone is interested.

Copy this script and save it as "toggle.sh". A good location to keep it would be /home/<username>/.toggle.sh.  Make the file executable :

chmod  +x  ~/.toggle.sh

#!/bin/bash

#find the current state of the panels

state=`gconftool-2 --get "/apps/panel/toplevels/top_panel_screen0/auto_hide"`

#if autohide on, turn it off

if [ $state = "true" ]; then

    gconftool-2 --set "/apps/panel/toplevels/top_panel_screen0/auto_hide" --type bool "false"

    gconftool-2 --set "/apps/panel/toplevels/bottom_panel_screen0/auto_hide" --type bool "false"

fi

#if autohide off, turn it on

if [ $state = "false" ]; then

    gconftool-2 --set "/apps/panel/toplevels/top_panel_screen0/auto_hide" --type bool "true"

    gconftool-2 --set "/apps/panel/toplevels/bottom_panel_screen0/auto_hide" --type bool "false"

fi

Open gconf-editor now, and in /apps/metacity/keybinding_commands, change the value of command_1 (or any other command which is unused) to /home/<username>/.toggle.sh.

 

Then go to /apps/metacity/global_keybindings and change the value of run_command_1 (if you mapped the script to command_1) to <Control>F12 or any other key combination you choose. Close gconf-editor and try it out ! I tested this with both metacity and beryl and it works perfectly.

February 17, 2007

Keeping up with literature is now easy. Just read the feed !

Everyone in medicine knows very well the difficulty of keeping up with current literature. More than 6 million articles are published each year and even the fraction of these that an individual physician has to read can be overwhelming. Of course, today I don't need to go down to the library to scan through the recently published articles. But even looking at each journal's website for the abstract of the latest articles can be a daunting task. In this article I will show you how I use RSS today to keep up with the journals I am interested in.

Step 1: Pick an RSS reader to use. While I use Google reader and therefore use it in my examples, there are a wealth of RSS readers for you to choose from.

Step 2: Subscribe to RSS feeds from the website for journals that offer RSS feeds. For example this shows the RSS link from Heart Online. All you have to do is copy the link and add it as a subscription in google reader.

Step 3: For the many journals that still do not provide RSS, you can set up an RSS fees of their contents from pubmed. Set up a search using limits set to the journal of your interest. Then select 'send to RSS' from the dropdown menu to get the link.

Now Google reader (or any other reader you choose to use) becomes a central location for you to review the titles and abstracts of the latest publications in the journals you have chosen. Note how I have organize the journals under one folder in Google reader. I also mark items I want to read in detail later with a star. Using Google reader and RSS has given me an immense advantage in keeping abreast of the latest literature in my field of interest. Try it out and let me know how you find it!

More Info:
Creating RSS for a feedless journal
RSS from pubmed search

Reach beyond Grasp

March 17, 2007

Tagging pdf documents with python

February 24, 2007

Hide and show panels with a keyboard shortcut in Ubuntu

February 17, 2007

Keeping up with literature is now easy. Just read the feed !

About Me

My photos at Flickr

Blog Archive

Labels

My Bookshelf