Medical Statistics - Dr Jim Maas

HPC Case Studies

Dr Jim Maas Research Associate - Norwich Medical School

with Dr. Allan Clark, Prof. Fujian Song, Prof. Max Bachmann, Norwich Medical School

Our research involves meta-analysis: a technique that combines numerical results from several different research trials to estimate the true efficacy of new medical treatments or procedures, compared to current standard procedures and/or other available procedures.

Previous methods depended on frequentest statistical methodology, however recent developments in Bayesian Markov-Chain Monte-Carlo simulation methods have facilitated more complex but powerful meta-analysis methods. Meta-analysis results can be seriously compromised by characteristics of the data being analysed. One such compromise is the potential of "Bias" whereby the data contains systematic bias resulting in estimates of the treatment effectiveness being over- or under-estimated.

This bias can be due to a number of factors, such as publication bias where trials where the new treatment showed advantageous results only get published, and those where the new treatment was found to be equivalent or inferior are not published. Another source of bias could be due to commercial interests when performing licensing trials of new treatments.

Therefore the goal of this research was to evaluate the ability of several different meta-analysis models where there was presence of bias in the data sets analysed in the meta-analysis.

To accomplish this we simulated data sets with various predefined levels of bias, as well as variable numbers of trials, and variable level of heterogeneity and inconsistency.

The benefits of using Grace for High Performance Computing

We chose to write the data generation and analysis routines in the "R" language which is a statistically-based programming language. It was chosen for several reasons including that it is open-source and therefore low cost, it can run on virtually any operating system and its world-wide user base ensures that it is constantly and rapidly being improved and upgraded.

Many routines to perform specific routines are published and tested by R users. It is readily scalable from a single core PC to large clusters such as the GRACE cluster at UEA. Several robust packages are written to accomplish parallelization of calculations across many CPU's on a cluster, dramatically reducing calculation time.

We have found it particularly practical because exactly the same packages can be loaded on single PC for code development and testing, and then simply uploaded and scaled to run the final simulations on the cluster. Other advantages of the cluster, are that it is accessible remotely from anywhere and the jobs run in batch when resources are available.

Our work

Our specific application exploits a couple of pre-written routines to operate efficiently on the GRACE cluster LSF operating system. Briefly, to operate in parallel on a cluster, R requires a parallel "back end", the part running behind the scenes that co-ordinates the information flow among the many processors and a parallel "front end" within the R code that determines which jobs to split up in
parallel.

We use an open-source package called "open Message Passing Interface" (open-MPI) as a backend and a "foreach" looping routine within the R code as a front end. The foreach routine also allows nesting of parallel loops within other parallel loops to increase efficiency.

This is an example of the libraries required, within the R file, to perform procedures in parallel:

This is an example of the libraries required, within the R file, to perform procedures in parallel:

library(doMPI)

cl <- startMPIcluster()
registerDoMPI(cl)

library(foreach)
library(rjags)
library(coda)

This is an example of a nested foreach loop that allows the job to be done across multiple cores on the cluster:

mpiresults <- foreach (j=1:nrow(inputpars), .combine=rbind) %:%

foreach (i=1:nsim, .combine=rbind, .final=finalprocess,

.packages = c("rjags", "gtools", "contrast", "lme4", "coda"))
%dopar% {
## run the function to produce the actual simulation data sets
sim.dat <- data.generator (j,OR,P,N.pts,bias)

## Do analysis

nout <- n3(sim.dat$simdata)
}

Our tests produce 2.88E⁵ different data sets and perform 1.15E¹⁰ analyses on those data sets in approximately 4.5 hours utilizing, 404 Infiniband cores on the GRACE cluster. If running continuously on a single core PC, this same job would require approximately 80 days to complete.

We found that all results were affected by the level of bias in the data however two models were relatively unaffected and two other commonly used models failed and produced very poor estimates of new treatment effectiveness. Currently one paper has been published and another is currently being submitted for publication.