Inspired by Grant McDermott’s blog post on efficient simulations in R, I decided to reimplement the same exercise in Julia. This post will not make much sense without having read that excellent post, so I’d recommend doing that first.
I recently switched to doing simulation work in Julia instead of R, because you don’t need many tricks to achieve decent performance on the first try. Compared to R, it is not necessary (at least in this example) to generate all the data at once. Instead, we can implement the simulation algorithm more naturally: generate a dataset, extract the quantities of interest, and repeat this process N times. I find that this leads to more intuitive and readable code, and Julia’s great for this kind of task.
1. Generate the data
We start by implementing a function
gen_data() to generate a DataFrame. The code
is basically a one-to-one translation from R, but we only generate one instance of the data.
using Distributions using DataFrames const std_normal = Normal(0, 1) function gen_data() ## total time periods in the the panel = 500 tt = 500 # x1 and x2 covariates x1_A = 1 .+ rand(std_normal, tt) x1_B = 1/4 .+ rand(std_normal, tt) x2_A = 1 .+ x1_A .+ rand(std_normal, tt) x2_B = 1 .+ x1_B .+ rand(std_normal, tt) # outcomes (notice different slope coefs for x2_A and x2_B) y_A = x1_A .+ 1*x2_A + rand(std_normal, tt) y_B = x1_B .+ 2*x2_B + rand(std_normal, tt) # combine DataFrame( id = vcat(fill(0, length(x1_A)), fill(1, length(x1_B))), x1 = vcat(x1_A, x1_B), x2 = vcat(x2_A, x2_B), x1_dmean = vcat(x1_A .- mean(x1_A), x1_B .- mean(x1_B)), x2_dmean = vcat(x2_A .- mean(x2_A), x2_B .- mean(x2_B)), y = vcat(y_A, y_B)) end
This should hopefully be pretty clear, even if you have never seen any Julia code before.
vcat() is used to concatenate vectors, and
fill() is similar to
rep() in R.
Another thing that might be unusual is having to write
.+ instead of
+ to achieve
While this still trips me up occasionally,
I think it’s one of Julia’s best features.
2. Extract the quantities of interest
Given a dataset, we can now run the two regressions and extract the coefficents. Here’s a function to achieve that using the GLM package:
using GLM function coefs_lm_formula(data) mod_level = lm(@formula(y ~ id + x1 * x2), data) mod_dmean = lm(@formula(y ~ id + x1_dmean * x2_dmean), data) (coef(mod_level)[end], coef(mod_dmean)[end]) end # example data = gen_data() coefs_lm_formula(data)
Now we just need to repeat these last two lines a large number of times, and save the coefficients.
3. Repeat N times
The function below runs the above nsim times and stores the two coefficients in a matrix. I use the BenchmarkTools package to benchmark this function.
function run_simulations(nsim) sims = zeros(nsim, 2); for i in 1:nsim data = gen_data() sims[i, :] .= coefs_lm_formula(data) end sims end using BenchmarkTools n = 20000 @btime run_simulations($n);
5.098 s (23020002 allocations: 16.86 GiB)
Around 5 seconds – not bad at all for this “naive” implementation that doesn’t make any use of particular performance tricks. A simple graph shows that the results are the same as in Grant’s post:
using Plots sims = run_simulations(20000) histogram(sims, label = ["level" "dmean"])
Some performance improvements
While this is a pretty good result already, there are of course numerous ways to speed
this up. Being a novice to Julia, I’m probably not the best person to show this, but here’s
an attempt anyway. One straightforward way to speed this up is to avoid creating the model
matrix using the
@formula call and instead to create the matrix ourselves.
Here’s a way to do this (I’ll simply overwrite the existing function):
function coefs_lm_formula(data) constant = fill(1, nrow(data)) X = Float64[constant data.id data.x1 data.x2 data.x1 .* data.x2] mod_level = fit(LinearModel, X, data.y) X[:, 5] .= data.x1_dmean .* data.x2_dmean mod_dmean = fit(LinearModel, X, data.y) (coef(mod_level)[end], coef(mod_dmean)[end]) end
And benchmark again:
2.016 s (2760002 allocations: 8.30 GiB)
A good speedup for a rather simple change!
A last idea is to fit the model more efficiently. The
fit function from GLM
still computes standard errors and p-values, which are unnecessary for this example.
Here’s a last benchmark implementing this:
using LinearAlgebra fastfit(X, y) = cholesky!(Symmetric(X' * X)) \ (X' * y) function coefs_lm_formula(data) constant = fill(1, nrow(data)) X = Float64[constant data.id data.x1 data.x2 data.x1 .* data.x2] mod_level = fastfit(X, data.y) X[:, 5] .= data.x1_dmean .* data.x2_dmean mod_dmean = fastfit(X, data.y) (mod_level[end], mod_dmean[end]) end @btime run_simulations($n);
1.694 s (2040002 allocations: 4.65 GiB)
Looks like calculating the standard errors and the p-values is not such an expensive operation after all.
The goal of this post was not to squeeze out the last bit of performance for this particular kind of simulation. Instead, I hope that this post shows that the first “naive” implementation in Julia (which would often be extremely slow in R), is often fast enough. There are many other advantages to Julia, but this is a major one for me.