Efficient and Reproducible Simulation Workflows in HPC Using simsem

Overview

Simulation studies in structural equation modeling (SEM) can be computationally demanding. This tutorial explores practical strategies for running SEM simulations efficiently in high-performance computing (HPC) environments using the R package simsem. I was motivated by a simple question: Is it faster to split jobs into chunks and run them separately, or to use the built-in multicore feature in sim()?

To answer this, I developed a new feature and compared it with the existing functionality in simsem. Since I had already conducted this comparison, I turned the results into this tutorial to demonstrate how to use simsem in HPC environments. I consider three approaches:

No parallelization (baseline)
Parallel computation via multicore = TRUE and numProc
Serialization using the new repRun feature (manual splitting)

As a quick preview, parallel computation is more efficient than serialization and much easier to implement. Later, I discuss a situation where serialization might still be useful.

This tutorial demonstrates:

How to implement each approach using SLURM + R
How to ensure reproducibility across approaches
An empirical comparison of computation time and service usage

This note focuses on how to run simulations efficiently in HPC, rather than on the simulation results themselves.

Simple Illustration of Three Approaches

Let me start with a simple bootstrap example that I will later use for empirical illustration. Consider a two-factor model with seven items per factor, standardized loadings of .7, and a factor correlation of .5:

generatescript <- '
f1 ~ 0.7*x1 + 0.7*x2 + 0.7*x3 + 0.7*x4 + 0.7*x5 + 0.7*x6 + 0.7*x7
f2 ~ 0.7*y1 + 0.7*y2 + 0.7*y3 + 0.7*y4 + 0.7*y5 + 0.7*y6 + 0.7*y7
f1 ~~ 0.5*fx
x1 ~~ 0.51*x1
x2 ~~ 0.51*x2
x3 ~~ 0.51*x3
x4 ~~ 0.51*x4
x5 ~~ 0.51*x5
x6 ~~ 0.51*x6
x7 ~~ 0.51*x7
y1 ~~ 0.51*y1
y2 ~~ 0.51*y2
y3 ~~ 0.51*y3
y4 ~~ 0.51*y4
y5 ~~ 0.51*y5
y6 ~~ 0.51*y6
y7 ~~ 0.51*y7
'

For illustration, I first generate a single dataset and compute the confidence intervals manually before moving to simsem. set.seed(123321) is used for data generation.

library(lavaan)

## Warning: package 'lavaan' was built under R version 4.5.2

## This is lavaan 0.6-21
## lavaan is FREE software! Please report any bugs.

set.seed(123321)
dat <- simulateData(generatescript, sample.nobs=500)

Next, I analyze data using bootstrap (se="boot"). Note that iseed = 787221 is specified so that the same bootstrap samples are drawn across runs.

analyzescript <- '
f1 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7
f2 ~ y1 + y2 + y3 + y4 + y5 + y6 + y7
f1 ~~ cov*fx
'
out <- cfa(analyzescript, data=dat, std.lv=TRUE, se="boot", iseed=787221)
out

## lavaan 0.6-21 ended normally after 15 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        19
## 
##   Number of observations                           500
## 
## Model Test User Model:
##                                                       
##   Test statistic                                27.686
##   Degrees of freedom                                29
##   P-value (Chi-square)                           0.535

I compare three types of confidence intervals (CI) with a confidence level of .95 and 1,000 bootstrap samples (the default in lavaan).

First, Wald CI is computed by using vcov to get variance of the factor correlation estimate, calculate standard error, and then the Wald CI.

est <- coef(out)["cov"]
sewald <- sqrt(vcov(out)["cov", "cov"])
crit <- qnorm(1 - (1 - 0.95)/2)
waldlower <- est - crit*sewald
waldupper <- est + crit*sewald
c(waldlower, waldupper)

##       cov       cov 
## 0.3426860 0.5098541

Second, percentile bootstrap CI

percout <- data.frame(parameterestimates(out, boot.ci.type = "perc", level=0.95))
perclower <- percout[percout$label == "cov", "ci.lower"]
percupper <- percout[percout$label == "cov", "ci.upper"]
c(perclower, percupper)

## [1] 0.3407223 0.5086200

Third, bias-corrected bootstrap CI

bcaout <- data.frame(parameterestimates(out, boot.ci.type = "bca.simple", 
                                        level=0.95))
bcalower <- bcaout[bcaout$label == "cov", "ci.lower"]
bcaupper <- bcaout[bcaout$label == "cov", "ci.upper"]
c(bcalower, bcaupper)

## [1] 0.3473720 0.5123705

These three approaches will later be compared within the simulation framework. I then define a function to automatically obtain all types of CI, which will be used later in simsem.

ciextract <- function(out) {
  est <- coef(out)["cov"]
  sewald <- sqrt(vcov(out)["cov", "cov"])
  crit <- qnorm(1 - (1 - 0.95)/2)
  waldlower <- est - crit*sewald
  waldupper <- est + crit*sewald
  percout <- data.frame(parameterestimates(out, boot.ci.type = "perc", level=0.95))
  perclower <- percout[percout$label == "cov", "ci.lower"]
  percupper <- percout[percout$label == "cov", "ci.upper"]
  bcaout <- data.frame(parameterestimates(out, boot.ci.type = "bca.simple", level=0.95))
  bcalower <- bcaout[bcaout$label == "cov", "ci.lower"]
  bcaupper <- bcaout[bcaout$label == "cov", "ci.upper"]
  c(waldlower=waldlower, waldupper=waldupper,
    perclower=perclower, percupper=percupper,
    bcalower=bcalower, bcaupper=bcaupper)
}
ciextract(out)

## waldlower.cov waldupper.cov     perclower     percupper      bcalower 
##     0.3426860     0.5098541     0.3407223     0.5086200     0.3473720 
##      bcaupper 
##     0.5123705

Standard no-parallel simsem usage

As usual, sim() is used to run the simulation.

library(simsem)

##

## #################################################################

## This is simsem 0.5.16.914

## simsem is BETA software! Please report any bugs.

## simsem was first developed at the University of Kansas Center for

## Research Methods and Data Analysis, under NSF Grant 1053160.

## #################################################################

## 
## Attaching package: 'simsem'

## The following object is masked from 'package:lavaan':
## 
##     inspect

output_n <- sim(nRep=10, # Set to 10 for illustration
                model=analyzescript, # Analysis model
                generate=generatescript, # Data-generation model
                n=500, # Sample sizes for each generated data
                std.lv=TRUE, # Lavaan option to standardize latent variables (factors)
                lavaanfun="cfa", # Lavaan function used
                se="boot", # Lavaan option to run bootstrap
                iseed=787221, # Lavaan option to set seed for lavaan bootstrap (same row index is used)
                outfun=ciextract, # Extra output extracted from each lavaan results
                seed = 123321 # Set seed for data generation
)

## The RNGkind() was changed from "Mersenne-Twister" to RNGkind("L'Ecuyer-CMRG") in order to enable the multiple streams in the parallel package.
## To set a RNG seed using the previous RNGkind(), you can use
##  RNGkind("Mersenne-Twister","Inversion","Rejection")
##  or 
##  set.seed(seed, kind = "Mersenne-Twister")

## Progress: 1 / 10 
## Progress: 2 / 10 
## Progress: 3 / 10 
## Progress: 4 / 10 
## Progress: 5 / 10 
## Progress: 6 / 10 
## Progress: 7 / 10 
## Progress: 8 / 10 
## Progress: 9 / 10 
## Progress: 10 / 10

Use getExtraOutput() to extract the lower and upper bounds of the CIs.

cibounds_n <- getExtraOutput(output_n)
cibounds_n <- do.call(rbind, cibounds_n)
cibounds_n

##       waldlower.cov waldupper.cov perclower percupper  bcalower  bcaupper
##  [1,]     0.4444868     0.6728764 0.4393079 0.6703178 0.4393079 0.6703178
##  [2,]     0.4254273     0.6217501 0.4251118 0.6207501 0.4359865 0.6328827
##  [3,]     0.4088480     0.5943929 0.4092004 0.5997734 0.4111010 0.6034957
##  [4,]     0.3082559     0.5023283 0.3084828 0.5009681 0.3124582 0.5130160
##  [5,]     0.3936783     0.5805999 0.3900957 0.5782858 0.3977782 0.5806980
##  [6,]     0.3722804     0.5526076 0.3780085 0.5582242 0.3783728 0.5587538
##  [7,]     0.3751542     0.5953503 0.3753101 0.5962607 0.3825973 0.6101467
##  [8,]     0.3897483     0.5596978 0.3924128 0.5658567 0.3953604 0.5683240
##  [9,]     0.3955311     0.5982241 0.3955057 0.5998806 0.4072052 0.6130370
## [10,]     0.2963299     0.4669292 0.2950193 0.4633839 0.2963629 0.4658216

Standard parallel

In this example, specify multicore = TRUE and numProc = 4.

output_p <- sim(nRep=10, model=analyzescript, generate=generatescript, 
                n=500, std.lv=TRUE, lavaanfun="cfa", se="boot", iseed=787221, 
                outfun=ciextract, seed=123321, 
                multicore=TRUE, # Use parallel processing
                numProc=4 # Use four processors
)

## Progress tracker is not available when 'multicore' is TRUE.

Use getExtraOutput() to extract the lower and upper bounds of the CIs.

cibounds_p <- getExtraOutput(output_p)
cibounds_p <- do.call(rbind, cibounds_p)
cibounds_p

##       waldlower.cov waldupper.cov perclower percupper  bcalower  bcaupper
##  [1,]     0.4444868     0.6728764 0.4393079 0.6703178 0.4393079 0.6703178
##  [2,]     0.4254273     0.6217501 0.4251118 0.6207501 0.4359865 0.6328827
##  [3,]     0.4088480     0.5943929 0.4092004 0.5997734 0.4111010 0.6034957
##  [4,]     0.3082559     0.5023283 0.3084828 0.5009681 0.3124582 0.5130160
##  [5,]     0.3936783     0.5805999 0.3900957 0.5782858 0.3977782 0.5806980
##  [6,]     0.3722804     0.5526076 0.3780085 0.5582242 0.3783728 0.5587538
##  [7,]     0.3751542     0.5953503 0.3753101 0.5962607 0.3825973 0.6101467
##  [8,]     0.3897483     0.5596978 0.3924128 0.5658567 0.3953604 0.5683240
##  [9,]     0.3955311     0.5982241 0.3955057 0.5998806 0.4072052 0.6130370
## [10,]     0.2963299     0.4669292 0.2950193 0.4633839 0.2963629 0.4658216

Manual Serialization

In simsem version 0.5-17 or higher, users can specify repRun so that simsem runs only a subset of replications. For example, I split this simulation into three sim(): 1:3, 4:6, and 7:10.

output_s1 <- sim(nRep=10, model=analyzescript, generate=generatescript, 
                n=500, std.lv=TRUE, lavaanfun="cfa", se="boot", iseed=787221, 
                outfun=ciextract, seed=123321, 
                repRun=1:3 # Run 1st, 2nd, 3rd replications
)

## Progress: 1 / 3 
## Progress: 2 / 3 
## Progress: 3 / 3

output_s2 <- sim(nRep=10, model=analyzescript, generate=generatescript, 
                n=500, std.lv=TRUE, lavaanfun="cfa", se="boot", iseed=787221, 
                outfun=ciextract, seed=123321, 
                repRun=4:6 # Run 4, 5, 6 replications
)

## Progress: 1 / 3 
## Progress: 2 / 3 
## Progress: 3 / 3

output_s3 <- sim(nRep=10, model=analyzescript, generate=generatescript, 
                n=500, std.lv=TRUE, lavaanfun="cfa", se="boot", iseed=787221, 
                outfun=ciextract, seed=123321, 
                repRun=7:10 # Run 7, 8, 9, 10 replications
)

## Progress: 1 / 4 
## Progress: 2 / 4 
## Progress: 3 / 4 
## Progress: 4 / 4

output_s <- combineSim(output_s1, output_s2, output_s3)

combineSim() is used to pool results from the manually split simulations. Use getExtraOutput() to extract the lower and upper bounds of the CIs.

cibounds_s <- getExtraOutput(output_s)
cibounds_s <- do.call(rbind, cibounds_s)
cibounds_s

##       waldlower.cov waldupper.cov perclower percupper  bcalower  bcaupper
##  [1,]     0.4444868     0.6728764 0.4393079 0.6703178 0.4393079 0.6703178
##  [2,]     0.4254273     0.6217501 0.4251118 0.6207501 0.4359865 0.6328827
##  [3,]     0.4088480     0.5943929 0.4092004 0.5997734 0.4111010 0.6034957
##  [4,]     0.3082559     0.5023283 0.3084828 0.5009681 0.3124582 0.5130160
##  [5,]     0.3936783     0.5805999 0.3900957 0.5782858 0.3977782 0.5806980
##  [6,]     0.3722804     0.5526076 0.3780085 0.5582242 0.3783728 0.5587538
##  [7,]     0.3751542     0.5953503 0.3753101 0.5962607 0.3825973 0.6101467
##  [8,]     0.3897483     0.5596978 0.3924128 0.5658567 0.3953604 0.5683240
##  [9,]     0.3955311     0.5982241 0.3955057 0.5998806 0.4072052 0.6130370
## [10,]     0.2963299     0.4669292 0.2950193 0.4633839 0.2963629 0.4658216

As expected, all CI bounds are identical across the three approaches.

identical(cibounds_n, cibounds_p, cibounds_s)

## [1] TRUE

Empirical Comparison of Computation Time

Simulation Design

I used the bootstrap example described above to run the mock simulation with the following design conditions:

Items per factor: 3, 7
Factor correlations: 0.2, 0.5
Standardized loadings: 0.5, 0.7
Sample size: 500, 1000

The total number of conditions is \(2 \times 2 \times 2 \times 2 = 16\). Without parallel processing or splitting, sixteen separate sim() runs are required. Let me briefly show how I set up the simulation workflow. This is just my current style—other approaches would work as well.

Simulation Backbone Setup

List out design condition and use expand.grid(). The output is the data frame with 16 rows.

p <- c(3, 7)
faccor <- c(0.2, 0.5)
loading <- c(0.5, 0.7)
samplesizes <- c(500, 1000)

conds <- expand.grid(p, faccor, loading, samplesizes) # 16 conditions
colnames(conds) <- c("p", "faccor", "loading", "samplesizes")

Define the number of replications

nRep <- 1000

Enumerate the rows used in sim(). Note that the jobid ranges from 1 to 16, corresponding to each row of conditions. I will show you how to change jobid using HPC later. Note that cond represents the row corresponding to the jobid.

jobid <- 1
cond <- conds[jobid, ]

Script for data generation. This is exactly the generatescript described above. String parsing is used to modify the script according to the number of items per factor, factor correlation, and factor loading.

xs <- paste0("x", 1:cond$p)
ys <- paste0("y", 1:cond$p)

load_x <- paste(paste(cond$loading, xs, sep="*"), collapse = "+")
load_y <- paste(paste(cond$loading, ys, sep="*"), collapse = "+")

resid_x <- paste0(xs, "~~", (1 - cond$loading^2), "*", xs)
resid_y <- paste0(ys, "~~", (1 - cond$loading^2), "*", ys)

generatescript <- paste(
  paste0("f1 =~ ", load_x),
  paste0("f2 =~ ", load_y),
  paste0("f1 ~~ ", cond$faccor, "*f2"),
  paste(resid_x, collapse = "\n"),
  paste(resid_y, collapse = "\n"),
  sep = "\n"
)

Script for data analysis. This script is adjusted based on the number of items per factor.

load_x2 <- paste(xs, collapse = "+")
load_y2 <- paste(ys, collapse = "+")

analyzescript <- paste(
  paste0("f1 =~ ", load_x2),
  paste0("f2 =~ ", load_y2),
  "f1 ~~ cov*f2",
  sep = "\n"
)

Define ciextract() function to obtain extra outputs (CI bounds from three methods)

ciextract <- function(out) {
  est <- coef(out)["cov"]
  sewald <- sqrt(vcov(out)["cov", "cov"])
  crit <- qnorm(1 - (1 - 0.95)/2)
  waldlower <- est - crit*sewald
  waldupper <- est + crit*sewald
  percout <- data.frame(parameterestimates(out, boot.ci.type = "perc", level=0.95))
  perclower <- percout[percout$label == "cov", "ci.lower"]
  percupper <- percout[percout$label == "cov", "ci.upper"]
  bcaout <- data.frame(parameterestimates(out, boot.ci.type = "bca.simple", level=0.95))
  bcalower <- bcaout[bcaout$label == "cov", "ci.lower"]
  bcaupper <- bcaout[bcaout$label == "cov", "ci.upper"]
  c(waldlower=waldlower, waldupper=waldupper,
    perclower=perclower, percupper=percupper,
    bcalower=bcalower, bcaupper=bcaupper)
}

Run the simulation. Note that I did not evaluate the script right now.

output <- sim(
    nRep=nRep,
    model=analyzescript,
    generate=generatescript,
    n=cond$samplesizes,
    std.lv=TRUE,
    lavaanfun="cfa",
    se="boot",
    iseed=787221, 
    outfun=ciextract,
    seed = 123321, 
    repRun = repRun
  )

Save the outputs to the storage.

name <- paste0("simsem_serial_", cond$p, "_", as.integer(cond$faccor * 10), "_",
               as.integer(cond$loading * 10), "_", cond$samplesizes)
saveRDS(output, file = outfile)

No-parallel jobs

For no-parallel jobs, 16 jobs for 16 conditions are run in HPC. Let’s say that the SLURM script saved as submit_noparallel.sh is as follows:

#!/bin/bash -l
#SBATCH -p compute
#SBATCH -A pv911001
#SBATCH -N 1
#SBATCH --cpus-per-task=1
#SBATCH --time=12:00:00
#SBATCH --job-name=simsem_noparallel
#SBATCH --array=1-16
#SBATCH --output=logs/simsem_noparallel_%A_%a.out

module load cray-R/4.4.0

Rscript --vanilla splitsimresult_hpc_noparallel.R

Note that this SLURM script is exactly as I used in my accessible HPC (see acknowledgement). The R file I used to save the script is splitsimresult_hpc_noparallel.R, which contains the script described above. The only notable change is Step 3, which is modified as follows:

jobid <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
cond <- conds[jobid, ]

That is, each HPC job will have its own array ID, which ranges from 1 to 16. Sys.getenv is to obtain the saved object, which is "SLURM_ARRAY_TASK_ID". For example, if the value is 15, cond will correspond to the 15th row.

After submit_noparallel.sh and splitsimresult_hpc_noparallel.R are in the same directory in HPC, the sbatch command is used on the head node to submit the jobs as follows:

sbatch submit_noparallel.sh

Note that different HPC systems may have different routines. Also, in the actual HPC script, I check whether the output file already exists, because some jobs may fail before completion. I can resubmit the SLURM script and the simulation will be rerun for unfinished jobs.

As another note, users may need to check the R version used in their system. The installation of simsem and lavaan can also be a bit tricky in HPC environments. In practice, users should load the same R version on the head node that will later be used on the compute nodes. After that, run R and install simsem and lavaan in a local directory.

If installation is required, make sure to install the packages in a user-level library, for example:

install.packages("package_name", lib = "~/R/4.4.0")

Feel free to adjust the library path to match your system.

Also note that I explicitly specify the library path at the beginning of the R script:

.libPaths(c("~/R/4.4.0", .libPaths()))

This step ensures that the required packages are available on the compute nodes during job execution.

Parallel jobs

For parallel jobs, 16 jobs for 16 conditions are run in HPC as in no-parallel condition. Let’s say that the SLURM script saved as submit_parallel.sh is as follows:

#!/bin/bash -l
#SBATCH -p compute
#SBATCH -A pv911001
#SBATCH -N 1
#SBATCH --cpus-per-task=8
#SBATCH --time=12:00:00
#SBATCH --job-name=simsem_parallel
#SBATCH --array=1-16
#SBATCH --output=logs/simsem_parallel_%A_%a.out

module load cray-R/4.4.0

Rscript --vanilla splitsimresult_hpc_parallel.R

Note that cpus-per-task is set to 8 to use eight processors. The R script is changed to splitsimresult_hpc_parallel.R. In the R file, these steps are adjusted. First, Step 3 is adjusted:

jobid <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
cond <- conds[jobid, ]

Second, Step 7 is adjusted:

output <- sim(
    nRep=nRep,
    model=analyzescript,
    generate=generatescript,
    n=cond$samplesizes,
    std.lv=TRUE,
    lavaanfun="cfa",
    se="boot",
    iseed=787221, 
    outfun=ciextract,
    seed = 123321, 
    multicore = TRUE, 
    numProc = 8
  )

The sim() call is modified to enable parallel processing. The numProc is set to 8 as in the SLURM script. The sbatch command in the Linux system can be used to submit the parallel jobs. In this empirical illustration, I also evaluate performance with 16 and 32 processors.

Manual Split Serial Jobs

As the new feature, the jobs are manually split into 20 chunks. Each chunk runs 50 replications. That is, \(16 \times 20\) jobs are run on the HPC system. Thus, the SLURM script will have 320 jobs with only one processor as follows:

#!/bin/bash -l
#SBATCH -p compute
#SBATCH -A pv911001
#SBATCH -N 1
#SBATCH --cpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --job-name=simsem_serial
#SBATCH --array=1-320
#SBATCH --output=logs/simsem_serial_%A_%a.out

module load cray-R/4.4.0

Rscript --vanilla splitsimresult_hpc_serial.R

Note that the array contains 320 jobs. When R reads the array index, it is translated into both the row of conditions and the chunk number. First, Step 2 is modified to account for chunking:

nRep <- 1000
nChunks <- 20
chunkSize <- nRep / nChunks

The number of chunks is defined as 20 and the chunk size is defined as 50. Next, Step 3 is adjusted as follows:

jobid <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
cond_id  <- ((jobid - 1) %/% nChunks) + 1
chunk_id <- ((jobid - 1) %% nChunks) + 1
cond <- conds[cond_id, ]
start <- (chunk_id - 1) * chunkSize + 1
end   <- chunk_id * chunkSize
repRun <- start:end

The jobid is used to determine both the condition (cond_id) and the chunk (chunk_id). The division gives the condition index, and the remainder gives the chunk index. Next, repRun is defined based on the chunk_id. Next, Step 7 is adjusted by adding repRun as follows:

output <- sim(
  nRep=nRep,
  model=analyzescript,
  generate=generatescript,
  n=cond$samplesizes,
  std.lv=TRUE,
  lavaanfun="cfa",
  se="boot",
  iseed=787221, 
  outfun=ciextract,
  seed = 123321, 
  repRun = repRun
)

Finally, Step 8 is adjusted to save the result output for each chunk:

name <- paste0("simsem_serial_", cond$p, "_", as.integer(cond$faccor * 10), "_",
               as.integer(cond$loading * 10), "_", cond$samplesizes)
outfile <- sprintf("%s_chunk%02d.rds", name, chunk_id)
saveRDS(output, file = outfile)

Summary of Files

In the GitHub repository, you will see the following files used in these simulations:

simsem/gh-pages/vignettes/hpc-simulation-simsem

Files used in the simulation.
Name	Script	SLURM
No Parallel	splitsimresult_hpc_noparallel.R	submit_noparallel.sh
Parallel (8)	splitsimresult_hpc_parallel.R	submit_parallel.sh
Parallel (16)	splitsimresult_hpc_parallel16.R	submit_parallel16.sh
Parallel (32)	splitsimresult_hpc_parallel32.R	submit_parallel32.sh
Serialization	splitsimresult_hpc_serial.R	submit_serial.sh

Processing Time Results

Comparison of HPC execution strategies.
Method	Mean Wall Time	Min	Max	Number of Jobs	Normalized Service Hours
No Parallel	7:46:24	6:57:31	9:18:15	16	0.97
Parallel (8)	0:59:34	0:50:57	1:13:39	16	1.00
Parallel (16)	0:30:31	0:26:24	0:38:17	16	1.02
Parallel (32)	0:16:36	0:14:24	0:21:21	16	1.11
Serialization	0:27:51	0:20:26	0:41:08	320	1.20

Parallelization works as expected: increasing the number of processors decreases runtime, with near-linear improvement up to 32 cores. Splitting into 20 chunks is roughly equivalent to using 20 processors. However, it results in higher service usage, suggesting additional overhead from job management and repeated setup. The no-parallel approach uses the lowest service hours but has the longest wall time. Note that 1 service hour corresponds to 128 processor hours.

In other words, if sufficient processors are available, parallel processing is both simpler and more efficient than manual serialization.

Potential Benefits of Serialization

If parallel processing is available, it is generally recommended. However, many researchers may not have access to a large number of processors. For example, some HPC systems provide free access for jobs using only a small number of processors (e.g., 4) and limited wall time (e.g., less than 24 hours). Parallel jobs using 4 processors may exceed the wall-time limit. In this case, manually splitting the jobs and running each with 4 processors can be efficient—so serialization is still a reasonable option.

Users might wonder why I developed repRun. Why not simply use different data-generating seeds for each chunk? The reason is simple: I want the single run and the split run to produce exactly the same results—that’s it. If users are not concerned about strict reproducibility, using different seeds is also a viable option.

Simulation Results

The simulation results are not the main focus of this article. The results from the no-parallel, serialization, and parallel processing are identical. Readers may skip this section and go directly to the conclusions.

See pooloutputs.R to check the result. In this file, first, I repeat the design conditions, the number of replications (nRep), and chunk size (nchunks).

p <- c(3, 7)
faccor <- c(0.2, 0.5)
loading <- c(0.5, 0.7)
samplesizes <- c(500, 1000)
nRep <- 1000
nchunks <- 20

conds <- expand.grid(p, faccor, loading, samplesizes) 
colnames(conds) <- c("p", "faccor", "loading", "samplesizes")

Next, a for-loop is used to iterate over each row of conds. The RDS file for each condition and for each method is extracted as in output_n*. Next, the extra outputs (CI bounds) are extracted in temp_cibounds_*. The extra output is combined into a matrix using rbind and the design conditions are added as columns and attached to cibounds_* outside the loop.

cibounds_n <- NULL
cibounds_p8 <- NULL
cibounds_p16 <- NULL
cibounds_p32 <- NULL
cibounds_s <- NULL
for(i in 1:nrow(conds)) {
  cond <- conds[i,]
  cond_expanded <- as.matrix(cond[rep(1, nRep), ])
  faccor <- cond$faccor
  
  namerow <- paste0(cond$p, "_", as.integer(cond$faccor * 10), "_", 
                    as.integer(cond$loading * 10), "_", cond$samplesizes)
  noparallel <- paste0("outputs//simsem_noparallel_", namerow, ".rds")
  output_n <- readRDS(file = noparallel)
  temp_cibounds_n <- getExtraOutput(output_n)
  temp_cibounds_n <- do.call(rbind, temp_cibounds_n)
  temp_cibounds_n <- cbind(temp_cibounds_n, cond_expanded)
  cibounds_n <- rbind(cibounds_n, temp_cibounds_n)

  parallel8 <- paste0("outputs//simsem_parallel_", namerow, ".rds")
  output_p8 <- readRDS(file = parallel8)
  temp_cibounds_p8 <- getExtraOutput(output_p8)
  temp_cibounds_p8 <- do.call(rbind, temp_cibounds_p8)
  temp_cibounds_p8 <- cbind(temp_cibounds_p8, cond_expanded)
  cibounds_p8 <- rbind(cibounds_p8, temp_cibounds_p8)

  parallel16 <- paste0("outputs//simsem_parallel16_", namerow, ".rds")
  output_p16 <- readRDS(file = parallel16)
  temp_cibounds_p16 <- getExtraOutput(output_p16)
  temp_cibounds_p16 <- do.call(rbind, temp_cibounds_p16)
  temp_cibounds_p16 <- cbind(temp_cibounds_p16, cond_expanded)
  cibounds_p16 <- rbind(cibounds_p16, temp_cibounds_p16)

  parallel32 <- paste0("outputs//simsem_parallel32_", namerow, ".rds")
  output_p32 <- readRDS(file = parallel32)
  temp_cibounds_p32 <- getExtraOutput(output_p32)
  temp_cibounds_p32 <- do.call(rbind, temp_cibounds_p32)
  temp_cibounds_p32 <- cbind(temp_cibounds_p32, cond_expanded)
  cibounds_p32 <- rbind(cibounds_p32, temp_cibounds_p32)

  serial <- paste0("outputs//simsem_serial_", namerow)
  temp_cibounds_s <- NULL
  for(j in 1:nchunks) {
    outfile <- sprintf("%s_chunk%02d.rds", serial, j)
    output_s <- readRDS(file = outfile)
    temptemp_cibounds_s <- getExtraOutput(output_s)
    temp_cibounds_s <- c(temp_cibounds_s, temptemp_cibounds_s)
  }
  temp_cibounds_s <- do.call(rbind, temp_cibounds_s)
  temp_cibounds_s <- cbind(temp_cibounds_s, cond_expanded)
  cibounds_s <- rbind(cibounds_s, temp_cibounds_s)
}

Check whether all CI bounds are identical, which they are.

identical(cibounds_n, cibounds_p8, cibounds_p16, cibounds_p32, cibounds_s)

## [1] TRUE

Because they are all identical, only cibounds_n is used to summarize the outputs. Next, compute the coverage rate for each replication.

cibounds_n <- as.data.frame(cibounds_n)
cibounds_n$coverage_wald <- with(cibounds_n, waldlower.cov < faccor & faccor < waldupper.cov)
cibounds_n$coverage_perc <- with(cibounds_n, perclower < faccor & faccor < percupper)
cibounds_n$coverage_bca <- with(cibounds_n, bcalower < faccor & faccor < bcaupper)

Next, use aggregate() to find the coverage rate for each design condition.

sumout <- aggregate(cbind(coverage_wald, coverage_perc, coverage_bca) 
                    ~ p + faccor + loading + samplesizes, data=cibounds_n, 
                    FUN=mean)
sumout

##    p faccor loading samplesizes coverage_wald coverage_perc coverage_bca
## 1  3    0.2     0.5         500         0.956         0.957        0.959
## 2  7    0.2     0.5         500         0.953         0.952        0.952
## 3  3    0.5     0.5         500         0.957         0.959        0.958
## 4  7    0.5     0.5         500         0.956         0.960        0.960
## 5  3    0.2     0.7         500         0.950         0.951        0.952
## 6  7    0.2     0.7         500         0.948         0.948        0.950
## 7  3    0.5     0.7         500         0.954         0.956        0.956
## 8  7    0.5     0.7         500         0.954         0.951        0.951
## 9  3    0.2     0.5        1000         0.957         0.956        0.955
## 10 7    0.2     0.5        1000         0.948         0.948        0.947
## 11 3    0.5     0.5        1000         0.950         0.948        0.950
## 12 7    0.5     0.5        1000         0.950         0.954        0.954
## 13 3    0.2     0.7        1000         0.942         0.944        0.941
## 14 7    0.2     0.7        1000         0.946         0.945        0.944
## 15 3    0.5     0.7        1000         0.947         0.949        0.947
## 16 7    0.5     0.7        1000         0.948         0.946        0.946

I then create a figure illustrating the coverage rate for each condition. Please see the R script for the code used to generate the graph.

Coverage probability across conditions.

Conclusions

I illustrated different parallel processing options in HPC using simsem. I also introduced the serialization feature in simsem, which was recently added in this version. However, serialization is generally suboptimal compared to parallel processing using multicore and numProc. I hope this tutorial helps researchers save time when running large-scale simulation studies.

Acknowledgements

This work used computational resources provided by the NSTDA Supercomputer Center (ThaiSC), Thailand, under project code pv911001. I thank the support team for maintaining the system.