email habits over time

I was curious if my sleeping/waking habits had really changed over the years – I definitely don’t feel I work as late now as when I was 22, but it’s hard to tell. To test this, I looked over all of the timestamps of mail I’ve sent in the past few years and tried to make a pretty graph.

I’m not sure how meaningful it is, but thanks to ggplot, it is pretty, at least.

The plotting code is straightforward — try it out!

robust pdf title extraction

I end up with a lot of PDF documents lying around – at last glance, this amounted to a few thousand files. Unfortunately, most of these documents end up with rather obscure names, making it rather annoying to find what I want, or what is interesting.  For example, these are the documents I’ve recently downloaded:


I previously tried to organize everything using something like Papers, which is a lovely product, but still required effort from me and isn’t very useful now that I no longer have a Mac.

I’ve also tried to rectify this situation via half-hearted attempts at using pdftotext, and grabbing the first 10 words of text, but more often then not I was left with more incomprehensible garbage.

Today, I had some spare time, and far too much interest in this problem, but I managed to come up with an easy and fairly effective solution.  It also resembles a rube-goldberg machine.  After digging around for various pdf conversion utilities, I discovered that pdftohtml not only generated reasonable output, but it could also be set to output to an easily parsed xml format.  From there it was a simple bit of BeautifulSoup to get nice titles for most of my documents:

[gist id=2078056]

tcp timelines with ggplot2

I’ve come across the need to analyze TCP flows from time to time, and while scripts like flowtime and EasyTimeline are nice, they aren’t really, well, pretty.  ggplot2, on the other hand is, and it turns out to be really easy to get nice, somewhat useful plots. Here’s an example conversation between my local browser and (warning, gigantic) You can easily see the importance of fast DNS resolution, with almost 2 seconds of time spent idle waiting for the first resolver hit.  Then we see a large number of connections opened up, as modern browsers and sites try to work around the small TCP initial congestion window.  Finally there’s the petering out of the connections and the final FIN packets as the browser finishes the page. It’s at least slightly more informative then staring at wireshark dumps, and it provides another excuse to practice my R. The code is pretty straightforward, and mostly dedicated to munging the tshark field output to make streams show up in a reasonable way: [gist id=2141140]