Mountain View

I’m in Mountain View for the summer for an MSR internship. I lived in the bay area for 5 years, but somehow I had forgotten how far apart everything is here.

A fruitfly just drowned itself in my cup of coffee this morning. Hopefully that wasn’t an omen, but just a morbid insect.

backtracking parsers

This may be obvious in retrospect, but I was having some issues with my citation parser – it wasn’t backtracking through disjunctions properly:

lazy val authors: PackratParser[Authors] = (
      author ~ "," ~ authors |
      author ~ "and" ~ author <~ endOfAuthors |
      author ~ "," ~ etal <~ endOfAuthors |
      author ~ etal <~ endOfAuthors

It turns out that it’s a bad idea to try and match together Packrat and normal parsers if you don’t know what you’re doing (aka, if you’re me). I had a parser definition for names that I was using:

lazy val word: Parser[String] = "[p{L}0-9]+".r

Simply forcing this to be a PackratParser cleared things up:

lazy val word: PackratParser[String] = regex("[p{L}0-9]+".r)

version numbers

Ah, Hadoop-land, where version numbers signify much more then a version:

  • 1.0.X – current stable version, 1.0 release
  • 1.1.X – current beta version, 1.1 release
  • 0.23.X – current alpha version, MR2
  • 0.22.X – does not include security
  • 0.20.203.X – legacy stable version
  • 0.20.X – legacy version

0.2 is legacy, 0.23 is alpha, 0.22 is (legacy?) without security, 1.0 is stable, and 1.1 is beta.

scala type woes

You can sometimes encounter warnings like:

not enough arguments for method toArray: (implicit evidence$1: ClassManifest[B])Array[B].
[error] Unspecified value parameter evidence$1.
[error] lazy val randomKeys = shuffle(myCollection.toArray)
[error] ^

in relation to code like this:

def shuffle[T](v : Array[T]) : Array[T] = {...}

The error is a bit misleading here – what the compiler is actually trying to say is that it doesn’t have enough information to figure out which version of the shuffle method to dispatch, so you have to help it:


To the best of my understanding, the reason the compiler can’t infer the proper types in this case is due to how the JVM handles parameterized types (via erasure). The explicit type annotation isn’t necessary, for instance if shuffle is defined to take a Scala List[T]:

def shuffle[T](v : List[T]) : List[T] = {...}

In this case, I’m assuming that enough information can be carried through via ClassManifest to determine what needs to happen.

parsing (bad) xml in scala

I’ve been using the pdftohtml tool recently to convert PDF document into a convenient XML form. Unfortunately, about 10% of the time, the output XML isn’t quite XML and can’t be parsed (normally it’s the result of some kind of HTML tag that’s been left to cause trouble).

Initially, I was just catching these errors and tossing the documents, but that was throwing out a lot of good with the bad. The tagsoup library provides an easy way around this — you can plug it into the normal Scala XML framework, and voilà, all your parsing issues go away. (Well, you might end up with crazy mal-formed document trees, but it’s a lossy business).

It’s as simple as adding the tagsoup dependency:

"org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1"

and then changing from:



  val parser = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl().newSAXParser()
  val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
  adapter.loadXML(Source.fromString(stripDtd(document)), parser)

And that’s it! You’ve now gone from accepting correct XML to accepting damn-near anything. Others might be inclined to call this a bad thing, but at the same time, you have to work with what you’re given. And given the choice between some slightly funky XML and the pains of understanding PDF’s directly, I’ll take the quasi-XML anyday.

sending large messages

Somehow it took me a long time to find this, even though it’s listed in the Akka configuration reference quite plainly.

If you have messages with larger payload sizes – as in you see these kinds of TooLongFrameException errors:

Error[org.jboss.netty.handler.codec.frame.TooLongFrameException:Adjusted frame length exceeds 1048576: 1292205 - discarded

you can adjust the default maximum frame size from 1MB using the following config line:

akka {
    remote.netty.message-frame-size = 100 MiB

It would be really nice if the error message dumped out when a frame is overfilled mentioned this…

distributed workers with akka

I’ve been playing with the Akka framework for Scala recently, and have found it mostly enjoyable, despite a few kinks and quirks. I find it doesn’t take too much time for me to wrap my head around the usage, which implies that it must be very well written indeed :).

I did run into one issue recently with respect to remoting – I wanted to start a number of remote workers and have them bind on their local address, but the documentation and howto’s generally refer to running on a single machine. It turns out it’s fairly simple – just use an empty hostname in your application.conf:

akka.remote {
  transport = "akka.remote.netty.NettyRemoteTransport"
  netty {
    hostname = ""
    port = 9999

Now your servers will bind to the *( address. When addressing them, you have to use the IP address directly, as unfortunately, they won’t respond to hostname based addresses – e.g:

val addr = "akka://[email protected]:9999/user/search-worker" // works
val addr = "
akka://[email protected]:9999/user/search-worker" // refuses messages

yet another start script plugin for sbt

My mental powers aren’t what they used to be, or at least that’s what I start feeling like when I try to use sbt for anything more then the compile step. It’s either breathtakingly brilliant or incredibly obfuscated, or both. I’m really not sure yet.

In any case, I wanted to generate scripts to launch a few simple tasks from my project. My solution, in case it’s useful

import sbt._
import Keys._

object GenerateScripts extends Plugin {
val scriptClasses = SettingKey[List[String]](
  "script-classes", "List of classes to generate start scripts for.")

val genScripts = TaskKey[Unit](
  "generate-scripts", "Generate start scripts."

val scriptTemplate = """



def genScriptsTask = (streams, target, scriptClasses, fullClasspath in Runtime) map {
(stream, target, scripts, cp) => {
  val log = stream.log
  for (f <- scripts) {
    val scriptName = f.split('.').last
    val targetFile = (target / scriptName).asFile
    val classPath =":")"Generating script for %s".format(f))
    IO.write(targetFile, scriptTemplate.format(classPath, f))

override val settings = Seq(
  scriptClasses := List(),
  genScripts <<= genScriptsTask

Add this to your project/Build.scala, and add:

scriptClasses := List("", "")

to your build.sbt.

If someone would like to explain to me the difference between TaskKeys and Commands, as well as how to add this as a dependency of the compile step, I’d be much obliged.