Parallel Computing using R on statcluster.ucsf.edu

This document outlines what you need to do to run parallel computations within R, using statcluster.ucsf.edu as a twelve-CPU cluster. This procedure uses the LAM/MPI message passing interface" to communicate between computers ("nodes") comprising the cluster, and the "snow" library within R to distribute computations. LAM/MPI works by using rsh/ssh to distribute computations between cluster nodes and return the results to the calling process. After the initial set-up in step 1, you need only follow step 2 below to do a parallel computation in R.

Note that this approach is designed for "embarrassingly parallel" computations like simulations and the bootstrap. Also, the examples require a good working knowledge of R or S-PLUS."

Remember that is important to properly shut down LAM/MPI processes after you're finished (steps 2-d and 2-e below). Leaving things running will tie up resources that will be unavailable for others to use.

See the article by Rossini et al and Luke Tierney's pages on the snow() library for more details about parallel computations using R. Finally, note that custom R functions using clusters can be written using primitives in the Rmpi library. Further, LAM/MPI can be used to implement parallel programs in FORTRAN and C that run outside of the R environment. (See www.lam-mpi.org for information on how to do this.)


How to Run a Simple Parallel R Job using LAM/MPI

Get an account on statcluster.ucsf.edu from Mark Garey. Note that this machine can only be accessed from Sun and Apple hosts in the Division offices, and that user accounts on the master node (statcluster) are mirrored to the daughter nodes via NFS, so direct login to nodes other than the master is unecessary. However, you'll need to set-up password-less login to the daughter nodes for the LAM/MPI software to work. This is described below:

1.  Use ssh to login in to statcluster from an account on a trusted host among the Division workstations, and generate a file called ".rhosts" in your home directory using a text editor (e.g. vi). The contents of the file should look like the following:

10.0.0.1
10.0.0.2
10.0.0.3
10.0.0.4
10.0.0.5
10.0.0.6
These are the IP addresses of all hosts on the internal ethernet network for the cluster. Set the permissions of this file as follows:
chmod 600 .rhosts
Now, verify that the file permissions are set to allow reading and writing only by you:
statcluster:~ steve$ ls -lrt .rhosts
-rw-------  1 steve  staff  54 28 Oct 12:16 .rhosts
Do not make a .rhosts file on your desktop machine under any circumstances, and never attempt to set-up password-less login on that machine.

2.   Follow the steps below to run a distributed job using R on statcluster as a 12-unit cluster:

a.   Choose a working directory where you want to run your R session. Create a host file listing the hosts you want to use in your cluster. I'll assume this is named "lamhosts". Here are the contents of a file specifying all twelve available cluster CPUs:
statcluster
statcluster
node1
node1
node2
node2
node3
node3
node4
node4
node5
node5
Each host is included twice, once for each processor (CPU). Within the working directory on statcluster, use the following commands to initiate the LAM/MPI process on the specified cluster:
LAMRSH=rsh
export LAMRSH
lamboot -v lamhosts
The first two lines set the LAMRSH environment variable to tell the LAM/MPI interface that you'll be using rsh to communicate bewteen nodes. You should see output like the following if you were successful:
statcluster:~/work/test steve$ lamboot -v lamhosts

LAM 7.0.4/MPI 2 C++/ROMIO - Indiana University

n0<11018> ssi:boot:base:linear: booting n0 (statcluster)
n0<11018> ssi:boot:base:linear: booting n1 (node1)
n0<11018> ssi:boot:base:linear: booting n2 (node2)
n0<11018> ssi:boot:base:linear: booting n3 (node3)
n0<11018> ssi:boot:base:linear: booting n4 (node4)
n0<11018> ssi:boot:base:linear: booting n5 (node5)
n0<11018> ssi:boot:base:linear: finished
b.   Now you are ready to use the cluster from within R. Get into R, and initiate the cluster as indicated below. (Note that the argument to makeMPIcluster() should match the number of hosts in your lamhosts file.)
         
library(snow)
cl <- makeCluster(12, type = "MPI")
You should see output like the following:
Loading required package: Rmpi 

        Rmpi version: 0.4-9 
        Rmpi is an interface (wrapper) to MPI APIs
        with interactive R slave functionalities.
        See `library (help=Rmpi)' for details.
        12 slaves are spawned successfully. 0 failed.
c.   The snow library has versions of standard functions designed to distribute work over the nodes in the cluster and collect the results locally. Below is an example of a simple bootstrap prediction computation with and without using the cluster:
############### single CPU #####################
> source("boot-example.R")
> system.time(boot(nuke.data, nuke.fun, R = 1000, m = 1, fit.pred = new.fit, x.pred = new.data)) 
[1] 11.86  0.53 12.40  0.00  0.00
############### twelve CPUs #####################

Calling with system.time() gives timing information, and the 3rd number returned gives the elapsed time (in seconds) for the process to run. The second call is faster. There is currently only one online help file for snow. You can view it via the help command called with any valid snow function. e.g. "help(parMM)". See the linked page for more details about this example.
d.   Shut down the cluster from within R:
stopCluster(cl)
e.   After quitting R, make sure to permanently shut down the MPI processes:
lamclean -v

A General Approach to Parallel Simulations and Bootstrapping

As demonstrated above, running simulations, bootstraps, etc. in parallel requires a bit of set-up, but results in substantially faster computations. One limitation in using built-in function libraries (e.g. bootstrap and boot as dcescribed above) is that the results returned from the included cluster nodes may need to be "massaged" a bit to be useable for summarization and/or further analyses. For example, if a bootstrap confidence interval based on 1000 samples boot is desired, the cluster call applying boot() to 10 nodes will return a list consisting of 10 lists, each summarizing the bootstrap results for 100 samples. This structure is not directly usable in the boot.ci() function to compute BCa confidence intervals. It is not usually too difficult to extract the desired results. However, until cluster-friendly versions of standard routines are developed, an alternative approach which provides more control (but relies less on "canned" functions) is described below.

The basic idea is to generate a list of all component datasets (or, in the case of bootstraping, a list of indicators of all desired resamples) first, and then using a cluster-aware version of the function lapply() to apply a selected user-supplied function for estimation across cluster nodes. The results are returned as a list that can be easily manipulated.

Bootstrap example

The first example of this approach is the computation of the bootstrap standard error for hazard parameters estimated by a fitting function I wrote in R and FORTRAN. The data consist of a 1508 observations of a binary failure indicator and two time variables used to define the interval of failure (for failed indiv.) or time of right censoring (for survivors). Note that I am assuming that you already have created and saved the functions and data necessary for the computations in a .RData object in a directory in your account, and that you have started up a cluster and are in R as described in step 1 above.

The ingredients needed are

The list of sample indicators is generated in the two statements below:

# generate bootstrap sample indicators
bdat1 <- sample(seq(1,length(icd2[,1])),1000*length(icd2[,1]),replace=TRUE)
# coerce into a list of 1000 components
bdat1 <- split(bdat1.0,rep(seq(1,1000),rep(length(jnk[,1]),1000)))
The first statement uses the R internal function sample() to generate a random sample of 1000*n individuals (with replacement) from the original data. The second splits these into a list of 1000 separate sample selection vectors, each of length n. The resulting list of selection vectors bdat1 will be used again below.

In this example, I chose to pass only the sample selection vectors to the nodes (rather than the all data for each bootstrap sample), and use a simple fitting function which calls the actual estimation routine and returns only the desired estimates. Here is a listing of the R code for this function:

> dcft2
function (idx, sp = 1, tol = 1e-04, maxit = 100) 
{
    n <- length(icd2[idx, 1])
    fit <- dcpl(y = icd2[idx, 1], A = rep(1, n), B = rep(1, n), C = icd2[idx, 2], D = icd2[idx, 3], E = rep(0, n), pri = matrix(1, nrow = n, ncol = 1), thet = c(0.01), phi = 0, lambda = sp, maxit = 100, jfail = 3)
    if (fit$niter == maxit) {
        fit$theta <- NULL
    }
    fit$theta
}
Notice that this just sets up a call to the actual fitting function, and extracts the desired parameter estimate. (I've also included a statement to set to NULL any parameter vectors resulting in models that don't fully converge. The idx argument expects row indicators of indiv. selected for a given bootstrap sample, and is used to subset the original dataset (icd2) in the function body. This dataset and the estimation function must be present in the .RData from a previous session. The contents of this are made available to all participating nodes via a call to clusterEvalQ() prior to running the job (see below).

Here are the cluster start-up statements and timing results from running this problem on both a 12-unit cluster and a single CPU on the master node:

# Startup
> cl <- makeCluster(12, type = "MPI")
        12 slaves are spawned successfully. 0 failed.
> clusterEvalQ(cl, load(".RData"))

# Cluster results
> system.time(bout1 <- parLapply(cl, bdat1, dcft2))
[1]  0.47  0.16 15.84  0.00  0.00

# Single CPU results
> system.time(bout2 <- lapply(bdat1, dcft2))
[1] 150.72   6.94 155.90   0.00   0.00

# Shutdown
stopCluster(cl)
The cluster job runs approximately ten times faster in this case.

The output of the cluster call is a list of 1000 components, each containing a vector of 12 estimates. I can put these into a 1000 X 12 array and get individual standard errors across all bootstrap samples as follows:

bout1 <- matrix(unlist(bout1), ncol = 12, byrow = T)
apply(bout1,2,sd)
Now that the boostrap results are available it's possible to perform further computations (e.g. confidence intervals, etc). NOTE: batch jobs and simulation example

Instructions on compiling FORTRAN/C code for the cluster

For many complex problems, dynamically loaded C or FORTRAN subroutines provide a significant speed boost over writing code in R, especially for optimizations and when a lot of explicit iterations are involved. (The fitting function in the bootstrap std. error computation illustrated above is an example.) The instructions below assume that you have a C or FORTRAN subroutine named subr.c/subr.f, and that you have written a calling R function according to standard procedures. Currently, we only have the GNU C (gcc) and FORTRAN (g77) compilers available. Below are the statements required to use these compile and make a shared object for dynamic loading in R:
###############
# C
###############
# Compile (produces the object file subr.o)
cc -g -O2 -mcpu=970 -mtune=970 -force_cpusubtype_ALL -mpowerpc-gpopt -c subr.c
# Link (produces the shared object file subr.so)
cc -bundle -flat_namespace -undefined suppress -L/usr/local/lib/ -o subr.so subr.o -lcc_dynamic  -Wl,-framework -Wl,vecLib

###############
# FORTRAN
###############
# Compile (produces the object file subr.o)
g77 -g -O2 -mcpu=970 -mtune=970 -force_cpusubtype_ALL -mpowerpc-gpopt -c subr.f
# Link (produces the shared object file subr.so)
cc -bundle -flat_namespace -undefined suppress -L/usr/local/lib/ -o subr.so subr.o -lcc_dynamic  -Wl,-framework -Wl,vecLib
Note that the 3rd through 6th compile options for gcc/g77 are specific to the G5 processor. The "veclib" link option is useful if your subroutine invokes linear algebra routines from BLAS/LAPACK. R is compiled using this framework and it is optimized for the G5 chip. Extending this procedure to several routines is straightforward. To make a single shared object, simply list all the compiled object files together in the link statement (after the ".so" file). The simplest way to handle a large number of routines is to write a makefile and use the make utility. The resulting shared object can now be loaded using a call to dynload within R. (e.g. dynload("subr.so").)