scidb performance

We searched around trying to find any reasonable comparison of Scidb performance to existing systems (specifically, we’re looking at doing straightforward bulk-parallel operations like logistic regression/k-means).  So far, we’ve found the performance is very poor — an order of magnitude or 2 worse then single machine runs or systems like Spark.

Any ideas what we’re doing wrong? The code is below. We’re running this across 4 machines with 32GB of memory each. The example code is just trying to do the regression on 100000 samples with 10 dimensions.

The

insert(multiply(X, W))

query itself seems to take several seconds. What’s going on here? On a single machine, this operation is less then a millisecond; even accounting for disk reads/network issues I’d expect this to be a hundred times faster then it is.

Trying to run with more data causes the system to run out of memory.

#!/bin/bash

function create_randoms(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating random array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,
y=0:$y,$CHUNK_SIZE,0], double(1)/(random() % 10 + 1)), $3)" >/dev/null
}

function create_zeros(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating zero array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,y=0:$y,$CHUNK_SIZE,0],0),$3)" >/dev/null
}

function create_template(){
  x=$(($1 - 1))
  iquery -nq "create array Template [x=0:$x,$CHUNK_SIZE,0]" >/dev/null
}

function exe(){
  iquery -o csv+ -q  "$1"
}

function exe_silent(){
  iquery -nq "$1" >/dev/null
}

function clean(){
  exe_silent "drop array W"
  exe_silent "drop array X"
  exe_silent "drop array Y"
  exe_silent "drop array Pred"
  exe_silent "drop array Diff"
  exe_silent "drop array Grad"
  exe_silent "drop array Temp"
  exe_silent "drop array Template"
}

function init(){
  create_randoms $N $D X
  create_randoms $N 1 Y
  create_randoms $D 1 W
  create_zeros $N 1 Pred
  create_zeros $D 1 Grad
  create_zeros $N 1 Diff
  create_template $N
}

D=10 #Dimensions.
N=100000 #Points.
CHUNK_SIZE=10000

clean
init #Create arrays.

for i in {1..5}
do
iquery <<HERE
set no fetch;

set lang afl;
insert(multiply(X, W), Pred);

set lang aql;
update Pred set val= double(1)/(double(1) + exp(-val));
select Pred.val - Y.val into Diff from Pred, Y;
select sum(new_val) as val into Temp from (select X.val * R.val as
new_val from X join reshape(Diff, Template) as R on X.x = R.x) group
by y;
select val into Grad from reshape(substitute(Temp, build(Temp, 0)), Grad);

set fetch;
select W.val + 0.001 * Grad.val into W from W, Grad;
HERE
done

clean

a simple watchdog for subprocesses

A problem I often encounter when writing code that deals with subprocesses is the issue of orphans. These are perfectly useful in many situations, such as keeping program running via disown. They’re also real annoyances when you’re trying to fork worker processes; every time your parent process has a bug, you end up with orphaned workers that you have to manually terminate.

For instance, I might want to launch a bunch of worker processes, each listening on a separate port:

for i in range(num_procs):
  subprocess.Popen('python worker.py %d' % 9999 + i)

The orphan problem crops up if the parent dies without terminating the worker process (gracefully or otherwise). After the parent process is killed, the workers are left running in the background; holding onto both resources and their port. We now have to manually terminate them in order to try running again:

pkill -9 -f worker.py

You can use the atexit functionality from most languages to try to cleanup, but this doesn’t help if your process segfaults, or has some other ignominious end.

In a “real” environment, this is typically solved by having a cluster manager which schedules jobs and can start and kill processes as needed (for example SLURM or Torque. But often we’re running in an environment where this is unavailable (either our local machine, or we just haven’t bothered to set something up). What do we do?

The solution here is watchdog timer which terminates the workers if the master process dies. This can be accomplished in a number of ways, including via the RPC system, but I’ve found a simple mechanism that works well for me: polling the stdin channel.

class FileWatchdog(threading.Thread):
  """Watchdog for a file (typically sys.stdin).
 
  When the file closes, terminate the process.
  (This typically occurs when the parent process is lost.)
  """
  def __init__(self, file_handle):
    threading.Thread.__init__(self, name='WatchdogThread')
    self.setDaemon(True)
    self.file_handle = file_handle
    # self.log = open('/tmp/watchdog.%d' % os.getpid(), 'w')
 
  def run(self):
    f = [self.file_handle]
    while 1:
      r, w, x = select.select(f, f, f, 1.0)
      # print >>self.log, 'Watchdog running: %s %s %s' % (r,w,x)
      # self.log.flush()
      if w:
        # print >>self.log, 'Watchdog: file closed.  Shutting down.'
        # self.log.flush()
        os._exit(1)
      time.sleep(1)

I just have my worker processes spawn a watchdog thread at startup time, and they will terminate themselves as soon as they lose their connection to the master process. This simple mechanism falls apart if you’re using stdin to communicate to your child processes, but it works well for most other simple needs.

thread profiling in Python

Python has accumulated a lot of… character over the years.  We’ve got no less then 3 profiling libraries for single threaded execution and a multi-threaded profiler with an incompatible interface (Yappi).  Since many applications use more then one thread, this can be a bit annoying.

Yappi works most of the time.  Except it can sometimes cause your application to hang for unknown reasons (I blame signals, personally). The other issue is that Yappi doesn’t have a way of collecting call-stack information. (I don’t necessarily care that memcpy takes all of the time, I want to know who called memcpy). In particular, the lovely gprof2dot can take in pstats dumps and output a very nice profile graph.

To address this for my uses, I glom together cProfile runs from multiple threads. In case it might be useful for other people I wrote a quick gist illustrating how to do it. To make it easy to drop in, I monkey-patch the Thread.run method, but you can use a more maintainable approach if you like (I create a subclass ProfileThread in my applications).

from threading import Thread
 
import cProfile
import pstats
 
def enable_thread_profiling():
  '''Monkey-patch Thread.run to enable global profiling.
  
Each thread creates a local profiler; statistics are pooled
to the global stats object on run completion.'''
  Thread.stats = None
  thread_run = Thread.run
  
  def profile_run(self):
    self._prof = cProfile.Profile()
    self._prof.enable()
    thread_run(self)
    self._prof.disable()
    
    if Thread.stats is None:
      Thread.stats = pstats.Stats(self._prof)
    else:
      Thread.stats.add(self._prof)
  
  Thread.run = profile_run
  
def get_thread_stats():
  stats = getattr(Thread, 'stats', None)
  if stats is None:
    raise ValueError, 'Thread profiling was not enabled,'
                      'or no threads finished running.'
  return stats
 
if __name__ == '__main__':
  enable_thread_profiling()
  import time
  t = Thread(target=time.sleep, args=(1,))
  t.start()
  t.join()
  
  get_thread_stats().print_stats()

Swig+Directors = Subclassing from Python!

Swig is a fabulous tool — I generally rely on it to extricate myself from the holes I’ve managed to dig myself into using C++.  Swig parses C++ code and generates wrappers for a whole bunch of target languages — I normally use it to build Python interfaces to my C++ code.

A cool feature that I’ve never made use of before is “directors” — these let you write subclasses for your C++ code in Python/(whatever language use desire).  In particular, this provides a relatively easy mechanism for writing callbacks using Python.  Here’s a quick example:

// rpc.h
class RPCHandler {
public:
void fire(const Request&, Response*) = 0;
}

class RPC {
public:
void register_handler(const std::string& name, RPCHandler*);
};

Normally, I’d make a subclass of RPCHandler in C++ and register it with my RPC server. But with SWIG, I can actually write this using Python:

class MyHandler(wrap.RPCHandler):
  def fire(req, resp):
    resp.write('Hello world!')

It’s relatively straightforward to setup. I write an interface file describing my application:

// wrap.swig
// Our output module will be called 'wrap'; enable director support.
%module(directors="1") wrap
%feature("director") RPCHandler;

// Generate wrappers for our RPC code
%include "rpc.h"

// When compiling the wrapper code, include our original header.
%{
#include "rpc.h"
%}

That’s it! Now we can run swig:

 swig -c++ -python -O -o wrap.cc wrap.swig 

Swig will generate wrap.cc (which we compile and link into our application), and a wrap.py file, which we can use from Python.

ah, latex

I thought I wanted to customize the layout of my document a little bit, so I started looking at the various styles that are available.

Then I came upon the memoir package (which is supposed to help with these things); the 550 page manual that comes with it has so effectively scared the crap out of me that I’m now looking at my crappy looking document and thinking: “heck, it looks good enough”.

I will gladly leave typography to the typographers.

vim

I’m a frequent (some might say avid) user of the Vim text editor. I’ve been using it off and on for the past 15 years, and it’s frequently saved me quite a bit of time with the handy macro system.

Now, if you’ve ever used Vim before you’re probably familiar with this intro screen:

vim

Somehow, amazingly, and I’m sure like everyone else, I had managed to go on for all this time without really ever typing

:help uganda

. Last night, while installing Vim on my new laptop, I finally did. I saw Bram’s visit report. I was really touched at how much they were doing with relatively limited resources. And so I finally made a small donation to ICCF; I even got a personal email from Bram in reply, which was nice.

Anyway, if you’re a Vim user, I encourage you to do the same.

kde desktop magic

Something miraculous that I just realized: middle clicking (aka pasting the X buffer) on the KDE desktop creates a “post-it” note with the contents of the clipboard.

Whether I will ever make use of this feature is unclear but it’s definitely the “right” behavior and a nice thought on the developers part.

Big Gummy Bears

Normally, when confronted with (inevitably weird and annoying) YouTube commercials, I’m hovering over the “Skip Ad” link, waiting for it to be enabled.

But today, I saw an advertisement so odd, that I was forced to watch through the whole thing just to figure out if it was a parody. It wasn’t. Congratulations Vat19, on a successful commercial – if I’m ever in the market for giant gummy bears, I’ll come your way.

Who thinks up these things?

Fiber to the people

My home internet sucks, relatively speaking. Anytime something shows up as “HD”, I know that it’s not going to work out for me. This is not at all surprising, given that I only have one choice (Time Warner) and they continually send me advertisements offering to spend $100 a month in order to get the bandwidth I’m supposed to get for $50. The sad thing is that if you look at the wireless routers visible from my apartment (> 50), everyone in the building (and nearby buildings) has the same problem. If only we could just share one good connection, we’d all be so much happier.

So here’s what I think should happen.

Apartment buildings should have fiber run to the building, run ethernet/wireless to each floor, and charge $40 a month to access it. Why? Because they’d make money off of it, that’s why. After the initial cost to run the fiber, they could contract with Cogent/Level 3/ATT to provide transit. Based on my crude knowledge of the state of connection costs from 5 years ago, it would cost about $5 a month to give every user 10Mb/s of dedicated service.

And everyone in the building would get 10 times the bandwidth of Time Warner to boot. Hurray!

I’ve been daydreaming about starting a company that contracts to do just this (drag fiber to buildings and contract for support). I know, I know, a lot of this types of companies already exist, but still… let me daydream.