running a pdf crawler with heritrix

I’ve used the Heritrix web crawler quite a few times in the past.  It’s a great piece of software, and has enough features to handle most crawling tasks with ease.
Recently, I wanted to crawl a whole bunch of PDF’s, and since I didn’t know where the PDF’s were going to come from, Heritrix seemed like a natural fit to help me out.  I’ll go over some of the less intuitive steps:

Download the right version of the crawler

That is to say, version 1.*.  Version 2 seems to have been dropped, and version 3 does not yet have all of the features from version 1 implemented (not to mention, the user interface seems to have gone downhill).

For your convenience, here’s a link to the download page.

Make sure you’re rejecting almost everything

You almost certainly don’t want all the web has to offer.  You only want a tiny fraction of it.  For instance, I use a MatchesRegexpDecideRule to drop any media content with the following expression:

.*.(jpg|jpeg|gif|png|mpg|mpeg|txt|css|js|ppt|JPG|tar.gz|flv|MPG|zip|exe|avi|tvd)$

Similarly, you’ll want to drop pesky calendar like applications:

.*(calendar|/api|lecture).*

And any dynamic pages that want to suck up your bandwidth:

.*?.*

Save only what you need

Heritrix has a nice property of allowing for decision rules to be placed almost anywhere, including just before when a file gets written to disk. To avoid writing files you’re uninterested in, you can request that only certain mimetypes are allowed through – add a default reject rule, and then only accept files you want – in my case pdfs or postscript files:

.*(pdf|postscript).*

Regular expressions are full, not partial matches

You need to ensure your regular expression matches the entire item, not just part of it. This means pre and post-pending

.*

to your normal patterns.

If you’re feeling lazy, you can download the crawl order I used and use it as a base for your crawl. Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *