I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting. For example, these are the documents I’ve recently downloaded:
wodet3-paper12.pdf
jong_afst.pdf
tut_gpu_2012_03.pdf
lecture1-1.pdf
natella_binary_sfi_edcc_2012.pdf
TR-Farrukh-58.pdf
730959.pdf
NLSEmagic_Paper.pdf
M23584378H1770Q2.pdf
G89T37P10W263075.pdf
journal_online.pdf
manus_Jour-INFORMATION-Camera.pdf
12011.VitekJan.Paper.pdf
R3X8722476T2X278.pdf
1203.0321.pdf
I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.
I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.
Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution. It also resembles a rube-goldberg machine. After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format. From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:
[gist id=2078056]