scala type woes

You can sometimes encounter warnings like:

not enough arguments for method toArray: (implicit evidence$1: ClassManifest[B])Array[B].
[error] Unspecified value parameter evidence$1.
[error] lazy val randomKeys = shuffle(myCollection.toArray)
[error] ^

in relation to code like this:

def shuffle[T](v : Array[T]) : Array[T] = {...}
shuffle(myCollection.toArray)

The error is a bit misleading here – what the compiler is actually trying to say is that it doesn’t have enough information to figure out which version of the shuffle method to dispatch, so you have to help it:

shuffle[MyCollection](myCollection.toArray)

To the best of my understanding, the reason the compiler can’t infer the proper types in this case is due to how the JVM handles parameterized types (via erasure). The explicit type annotation isn’t necessary, for instance if shuffle is defined to take a Scala List[T]:

def shuffle[T](v : List[T]) : List[T] = {...}
shuffle(myCollection.toList)

In this case, I’m assuming that enough information can be carried through via ClassManifest to determine what needs to happen.

parsing (bad) xml in scala

I’ve been using the pdftohtml tool recently to convert PDF document into a convenient XML form. Unfortunately, about 10% of the time, the output XML isn’t quite XML and can’t be parsed (normally it’s the result of some kind of HTML tag that’s been left to cause trouble).

Initially, I was just catching these errors and tossing the documents, but that was throwing out a lot of good with the bad. The tagsoup library provides an easy way around this — you can plug it into the normal Scala XML framework, and voilà, all your parsing issues go away. (Well, you might end up with crazy mal-formed document trees, but it’s a lossy business).

It’s as simple as adding the tagsoup dependency:

"org.ccil.cowan.tagsoup" % "tagsoup" % "1.2.1"

and then changing from:

XML.loadString(myDocument)

to

  val parser = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl().newSAXParser()
  val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
  adapter.loadXML(Source.fromString(stripDtd(document)), parser)

And that’s it! You’ve now gone from accepting correct XML to accepting damn-near anything. Others might be inclined to call this a bad thing, but at the same time, you have to work with what you’re given. And given the choice between some slightly funky XML and the pains of understanding PDF’s directly, I’ll take the quasi-XML anyday.

distributed workers with akka

I’ve been playing with the Akka framework for Scala recently, and have found it mostly enjoyable, despite a few kinks and quirks. I find it doesn’t take too much time for me to wrap my head around the usage, which implies that it must be very well written indeed :).

I did run into one issue recently with respect to remoting – I wanted to start a number of remote workers and have them bind on their local address, but the documentation and howto’s generally refer to running on a single machine. It turns out it’s fairly simple – just use an empty hostname in your application.conf:

akka.remote {
  transport = "akka.remote.netty.NettyRemoteTransport"
  netty {
    hostname = ""
    port = 9999
  }
}

Now your servers will bind to the *(0.0.0.0) address. When addressing them, you have to use the IP address directly, as unfortunately, they won’t respond to hostname based addresses – e.g:

      
val addr = "akka://[email protected]:9999/user/search-worker" // works
val addr = "
akka://[email protected]:9999/user/search-worker" // refuses messages

yet another start script plugin for sbt

My mental powers aren’t what they used to be, or at least that’s what I start feeling like when I try to use sbt for anything more then the compile step. It’s either breathtakingly brilliant or incredibly obfuscated, or both. I’m really not sure yet.

In any case, I wanted to generate scripts to launch a few simple tasks from my project. My solution, in case it’s useful

import sbt._
import Keys._

object GenerateScripts extends Plugin {
val scriptClasses = SettingKey[List[String]](
  "script-classes", "List of classes to generate start scripts for.")

val genScripts = TaskKey[Unit](
  "generate-scripts", "Generate start scripts."
)

val scriptTemplate = """
#!/bin/bash

CLASSPATH="%s"
MAINCLASS="%s"
java -cp $CLASSPATH $MAINCLASS

"""

def genScriptsTask = (streams, target, scriptClasses, fullClasspath in Runtime) map {
(stream, target, scripts, cp) => {
  val log = stream.log
  for (f <- scripts) {
    val scriptName = f.split('.').last
    val targetFile = (target / scriptName).asFile
    val classPath = cp.map(_.data).mkString(":")
    log.info("Generating script for %s".format(f))
    IO.write(targetFile, scriptTemplate.format(classPath, f))
    targetFile.setExecutable(true)
  }
}
}

override val settings = Seq(
  scriptClasses := List(),
  genScripts <<= genScriptsTask
)
 
}

Add this to your project/Build.scala, and add:

scriptClasses := List("a.b.foo", "b.c.bar")

to your build.sbt.

If someone would like to explain to me the difference between TaskKeys and Commands, as well as how to add this as a dependency of the compile step, I’d be much obliged.

experiences in serialization

The short of it: use Jerkson — you can easily serialize case classes:

case class Author(name : String) {}
case class Document(title : String, data : String, authors : List[Author]) {}
...
val encoded = Json.generate(doc)
val decoded = Json.parse[Document](encoded)
decoded should (equal(d))

I’ve been working some more with Scala, and found that I needed to serialize some data (I’m working with Hadoop).

Unfortunately, as with most things in life, I’ve been presented with all too many choices:

I first tried Avro, given that it’s part of the Hadoop project so I thought it would be the most seamless way of getting things workly.

Sadly, this was not to be. Avro generated classes do not support the Writable type for Hadoop, allowing them to be dropped in. Instead, you’re required to change all of your mappers/reducers to take in AvroKey/AvroValue wrapped items, and to set your output/input via the AvroInput/OutputFormats. This, while tedious, would be fine, except that I hit a Scala compiler bug when trying to get it all working.

My other thought was to simple convert the Avro object instances into strings myself, and then output strings from Hadoop. Hacky, but, hey it would work.

Except: After digging through the Avro documentation I couldn’t find a way of just turning my Avro structures into a serialized string. I could send them off to another server via an RPC, but dumping them to a file myself was out of the question. Sigh.

A note to serialization designers: please, please, please — give me an easy way to turn an object into a string.

I then started puttering around with the various JSON projects for Scala. Since there is no standard way of doing it, there are a lot of various cobbled together options that I had to try before finding out one that worked. Jerkson, despite the odd name, “just works”. I specify my objects as case classes, and magic, they can be serialized.

So now I’ve gone from outputting nice typed structures from Hadoop to just dumping strings and interpreting them myself. But I’m okay with that – it works.

the buddhist nature of swimming

I’ve been reading The Empty Mirror. It’s a fascinating little book about a Dutchman and a year he spent in a Japanese Zen monastery a few years after WWII. It’s a really quick read, and I recommend it – I just happened upon it in the local used bookstore, but I think it’s worth the Amazon price too.

The book, as expected focuses on the authors time in the monastery and his interactions with the monk, and the trials, stresses and achievement he gets out of the whole thing. Since I’ve also been swimming more frequently these days, it was natural for me to think about how the activities are somewhat related. In some sense, swimming is my form of meditation – it’s an activity where you can completely empty your mind. It’s especially true when you’re working hard and the pain from your muscles wipes everything else out.

I’ve never really done an extensive survey on this, but from personal experience, swimmers are pretty mellow people. Maybe it’s due to their extensive Buddhist training? I should arrange for a conference between some monks and swimmers to investigate more. But first I should probably practice meditating (or swimming) some more.

fun with shared libraries — version GLIBC_2.14 not found

The shared library system for linux, especially when it comes to libc, is (choose one): archaic|complex|awesome. Today in trying to run some software on our local cluster, I encountered this lovely error:

/lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.14' not found

Why was this occurring? For some reason, there was a version bump in the glibc for the memcpy instruction. How memcpy’s interface could possibly changed in a non-backward-compatible fashion, I don’t know. Still, since my local machine has a later libc then our cluster machines, I’m effectively boned.

There is a way around this, namely, I can ask the compiler to emit references to a specific, older version:

__asm__(".symver memcpy,[email protected]_2.2.5");

Unfortunately, this code has to appear before any other uses of memcpy! Rather then update a whole lot of code, I took advantage of the -include option for gcc. I simply created a file with my desired symbol override:

# gcc-preinclude.h 

__asm__(".symver memcpy,[email protected]_2.2.5");

and forced the build (here, autotools) to always include that before any other code:

./configure CC=gcc -include /path/to/gcc-preinclude.h ...

And away we go. A full rebuild and I’m back to a usable program again. Hurray!

Hello world!

Welcome to WordPress.com. After you read this, you should delete and write your own post, with a new title above. Or hit Add New on the left (of the admin dashboard) to start a fresh post.

Here are some suggestions for your first post.

  1. You can find new ideas for what to blog about by reading the Daily Post.
  2. Add PressThis to your browser. It creates a new blog post for you about any interesting  page you read on the web.
  3. Make some changes to this page, and then hit preview on the right. You can always preview any post or edit it before you share it to the world.