thinking sysadmin

qstat -u aleonard -s z

On SPRINT: A new parallel framework for R

leave a comment

As a sysadmin that supports multiple R users, a post late last year on InsideHPC drew my attention – Parallel framework for statistical analysis package “R”.  The creators of the Simple Parallel R INTerface have “designed and built a prototype framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems.” (paper, source code)  In other words, a system that lets R users run on a cluster without learning parallel programming.

One of the biggest challenges of my job hasn’t been building cluster resources – in an era of open source queuing systems like Sun Grid Engine and distributions like Rocks, setting up a cluster is pretty much as easy as you want it to be.  Rather, the challenge has been convincing users to actually run their jobs on the cluster.  One user of mine preferred to use some home-built Perl to dispatch jobs instead of investing a half an hour learning to write job scripts for SGE.  Some of my R users will run jobs on the cluster, but not via a job script – they use qrsh instead, launching the job interactively and then leaving their terminal idle until it completes.  Oftentimes, it seems that the cluster-aware software that my users need leaves a bad taste in their mouth, as well.  I’ve seen a software developer who is chronically unclear on the difference between high throughput and high performance computing, “grid-aware” software products that hard code queue names, and a product that chose a backwards programming model and wrote its own bug-laden queueing system (I’m looking at you, TurboSEQUEST).

So SPRINT caught my eye as a possible solution to get more computational resources to more of my users more easily.  I recognize that it’s not a mature solution, or really even much more than an idea of how a mature solution might look one day.  I sent the link for the paper to my heaviest R user – his initial reaction was “interesting, but not very useful” – not because he couldn’t see the potential, but because a lot of heavy lifting has to yet to be done with the project to get it to a state where it would benefit his work.

I don’t know if SPRINT will eventually live up to its promise – it does have a long way to go, it is only a 0.0.3 release and it doesn’t scale particularly well right now – but I’m encouraged by it because I see it as a step in the right direction.  Single-threaded performance doesn’t seem to be improving any time soon; if we want to keep analyzing larger and larger datasets, the problem that SPRINT addresses is one we’ll have to keep thinking about.

Written by Andy

January 16th, 2009 at 8:55 am

Posted in hpc

Tagged with ,

Leave a Reply