scidb performance

We searched around trying to find any reasonable comparison of Scidb performance to existing systems (specifically, we’re looking at doing straightforward bulk-parallel operations like logistic regression/k-means). ┬áSo far, we’ve found the performance is very poor — an order of magnitude or 2 worse then single machine runs or systems like Spark.

Any ideas what we’re doing wrong? The code is below. We’re running this across 4 machines with 32GB of memory each. The example code is just trying to do the regression on 100000 samples with 10 dimensions.

The insert(multiply(X, W)) query itself seems to take several seconds. What’s going on here? On a single machine, this operation is less then a millisecond; even accounting for disk reads/network issues I’d expect this to be a hundred times faster then it is.

Trying to run with more data causes the system to run out of memory.


#!/bin/bash

function create_randoms(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating random array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,
y=0:$y,$CHUNK_SIZE,0], double(1)/(random() % 10 + 1)), $3)" >/dev/null
}

function create_zeros(){
  x=$(($1 - 1))
  y=$(($2 - 1))
  echo "Creating zero array $3[$1, $2]..."
  iquery -anq "store(build([x=0:$x,$CHUNK_SIZE,0,y=0:$y,$CHUNK_SIZE,0],0),$3)" >/dev/null
}

function create_template(){
  x=$(($1 - 1))
  iquery -nq "create array Template [x=0:$x,$CHUNK_SIZE,0]" >/dev/null
}

function exe(){
  iquery -o csv+ -q  "$1"
}

function exe_silent(){
  iquery -nq "$1" >/dev/null
}

function clean(){
  exe_silent "drop array W"
  exe_silent "drop array X"
  exe_silent "drop array Y"
  exe_silent "drop array Pred"
  exe_silent "drop array Diff"
  exe_silent "drop array Grad"
  exe_silent "drop array Temp"
  exe_silent "drop array Template"
}

function init(){
  create_randoms $N $D X
  create_randoms $N 1 Y
  create_randoms $D 1 W
  create_zeros $N 1 Pred
  create_zeros $D 1 Grad
  create_zeros $N 1 Diff
  create_template $N
}

D=10 #Dimensions.
N=100000 #Points.
CHUNK_SIZE=10000

clean
init #Create arrays.

for i in {1..5}
do
iquery <<HERE
set no fetch;

set lang afl;
insert(multiply(X, W), Pred);

set lang aql;
update Pred set val= double(1)/(double(1) + exp(-val));
select Pred.val - Y.val into Diff from Pred, Y;
select sum(new_val) as val into Temp from (select X.val * R.val as
new_val from X join reshape(Diff, Template) as R on X.x = R.x) group
by y;
select val into Grad from reshape(substitute(Temp, build(Temp, 0)), Grad);

set fetch;
select W.val + 0.001 * Grad.val into W from W, Grad;
HERE
done

clean


4 thoughts on “scidb performance

  1. Hi!

    SciDB Architect de-cloaking. Just a thought … http://www.scidb.net/forum is a good place for a question like this.

    1. Terribly sorry. Your script as quoted above confused me a bit at first.

    Your script’s “store(build([x=0:$x,$CHUNK_SIZE,0, y=0:$y,$CHUNK_SIZE,0], double(1)/(random() % 10 + 1)), $3)” won’t work. You need to specify the array’s attributes in build(…).

    Anyway … I substituted in for luck, and got it all working.

    2. Now … you’ve got a value in that script you’re calling CHUNK_LENGTH. And I suspect that there’s a slight misconception here. The size of the chunk is the *product* of the chunk’s lengths. In this case, you’re setting chunk *size* to be 10,000 … and then in the CREATE ARRAY setting you are setting what we call chunk *length* to be 10,000. Which means the size of a single *chunk* is 10,000 x 10,000 (times 8 bytes per value) which yields about 1G per chunk. The underlying RLE encoding will reduce the size of the smaller chunks, in ‘W’.

    We recommend that folks try to keep chunks between 2M and 64M. Chunks of 1G are a bit big. And our executor is not going to do very well. . .

    3. Changes I made to your script:

    i. Change the CHUNK_LENGTH variable name to a number like 1000 for the length. We advise folk to try to set their chunk sizes to 2Meg to 64Meg. Because SciDB supports both dense and “ragged” arrays, you need to be mindful of the chunk lengths (along each dimension) and sparsity to get the chunk sizes somewhat closer to what’s useful.

    ii. I added a couple of echo “” statements to your script. Mostly they’re mine to figure out what was going on. But I left ‘em in …

    iii. Ran the whole thing, on my laptop, with 1 thread / 1 G RAM and Disk on my Mac laptop running Virtual Box.

    $ time ./Bug.sh

    real 3m35.100s
    user 0m0.341s
    sys 0m0.162s

    4. On another note? SciDB is going to suck pigs through iron-pipe plumbing on 100,000 x 100,000 arrays, relative to almost anything else. At those sizes, it all fits in memory. We start to be useful at 10,000 x 100,000 dense arrays. And talk to us! The SciDB built-in stuff isn’t going to be particularly terrific.

    If you want (say) SVD or GEMM over 1,000,000 x 10,000 arrays? Talk to use.

    plumber

    
    #!/bin/bash
    
    function create_randoms(){
      echo "RANDOM: Creating array ${3}, at {$1} x ${2} with val = random() "
      x=$(($1 - 1))
      y=$(($2 - 1))
      echo "Creating random array $3[$1, $2] with Chunk Length = ${CHUNK_LENGTH} ..."
      iquery -anq "store(build(  [x=0:$x,$CHUNK_LENGTH,0,
     y=0:$y,$CHUNK_LENGTH,0], double(1)/(random() % 10 + 1)), $3)" >/dev/null
    }
    
    function create_zeros(){
      echo "ZEROES: Creating array ${3}, at {$1} x ${2} with val = 0 "
      x=$(($1 - 1))
      y=$(($2 - 1))
      echo "Creating zero array $3[$1, $2] with Chunk Length = ${CHUNK_LENGTH} ..."
      iquery -anq "store(build( [x=0:$x,$CHUNK_LENGTH,0,y=0:$y,$CHUNK_LENGTH,0],0),$3)" >/dev/null
    }
    
    function create_template(){
      echo "TEMPLATE: Template with ${1}"
      x=$(($1 - 1))
      iquery -nq "create array Template  
    [x=0:$x,$CHUNK_LENGTH,0]" >/dev/null
    }
    
    function exe(){
      iquery -o csv+ -q  "$1"
    }
    
    function exe_silent(){
      iquery -nq "$1" >/dev/null
    }
    
    function clean(){
      echo "Removing W, X, Y, Pred, Diff, Grad, Temp and Template"
      exe_silent "drop array W"
      exe_silent "drop array X"
      exe_silent "drop array Y"
      exe_silent "drop array Pred"
      exe_silent "drop array Diff"
      exe_silent "drop array Grad"
      exe_silent "drop array Temp"
      exe_silent "drop array Template"
      echo "Done Clean"
    }
    
    function scidb_init(){
      create_randoms $N $D X
      create_randoms $N 1 Y
      create_randoms $D 1 W
      create_zeros $N 1 Pred
      create_zeros $D 1 Grad
      create_zeros $N 1 Diff
      create_template $N
    }
    
    D=10 #Dimensions.
    N=100000 #Points.
    CHUNK_LENGTH=1000
    
    clean
    scidb_init #Create arrays.
    
    for i in {1..5}
    do
    iquery <<HERE
    set no fetch;
    
    set lang afl;
    insert(multiply(X, W), Pred);
    
    set lang aql;
    update Pred set val= double(1)/(double(1) + exp(-val));
    select Pred.val - Y.val into Diff from Pred, Y;
    select sum(new_val) as val into Temp from (select X.val * R.val as
    new_val from X join reshape(Diff, Template) as R on X.x = R.x) group
    by y;
    select val into Grad from reshape(substitute(Temp, build(Temp, 0)), Grad);
    
    set fetch;
    select W.val + 0.001 * Grad.val into W from W, Grad;
    HERE
    done
    
    clean
    
    
    • Thanks for the feedback (and yes, the less-than greater-than issue is a giant PITA). We’ll definitely use the forum if we run into any more issues; I’m not sure why that didn’t occur to me this time.

      We were just playing with the idea of using SciDB for a backend for another project, but understandably, you’re targeting a different use case then we’re looking at.

  2. It occurs that the use of enclosing “greater-than” and “less-than” symbols in SGML will be the death of mathematical communication.

    It further seems that the “less-than” code “greater-than” and “less-than” “slash” less-than” in our declarative language might be a mistake.

  3. SciDB might or might not be useful.

    Have you tried the ‘R’ front end to SciDB? We often find ourselves working as the big, red “GO BIG!” button with ‘R’ once you can’t get enough juice out of one physical box …

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>