I’ve been using the pdftohtml tool recently to convert PDF document into a convenient XML form. Unfortunately, about 10% of the time, the output XML isn’t quite XML and can’t be parsed (normally it’s the result of some kind of HTML tag that’s been left to cause trouble).
Initially, I was just catching these errors and tossing the documents, but that was throwing out a lot of good with the bad. The tagsoup library provides an easy way around this — you can plug it into the normal Scala XML framework, and voilà, all your parsing issues go away. (Well, you might end up with crazy mal-formed document trees, but it’s a lossy business).
It’s as simple as adding the tagsoup dependency:
"org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1"
and then changing from:
val parser = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl().newSAXParser() val adapter = new scala.xml.parsing.NoBindingFactoryAdapter adapter.loadXML(Source.fromString(stripDtd(document)), parser)
And that’s it! You’ve now gone from accepting correct XML to accepting damn-near anything. Others might be inclined to call this a bad thing, but at the same time, you have to work with what you’re given. And given the choice between some slightly funky XML and the pains of understanding PDF’s directly, I’ll take the quasi-XML anyday.