This tutorial introduces power analysis using R. The entire R markdown document for this tutorial can be downloaded here. The basis for the present tutorial is Green and MacLeod (2016) (which you can find here). Green and MacLeod (2016) is a highly recommendable and thorough tutorial on performing power analysis in R. Recommendable literature on this topic are, e.g. Arnold et al. (2011) and Johnson et al. (2015) and this tutorial.
Power analysis is a method primarily used to determine the appropriate sample size for empirical studies.
Power analysis have also been used post-hoc to test if the sample size of studies was sufficient to detect meaningful effects. However, such post-hoc power calculations where the target effect size comes from the data, give misleading results (Hoenig and Heisey (2001)) and should thus be treated with extreme care.
There are different factors that determine if a model finds an effect. The accuracy (i.e., the probability of finding an effect) depends on three main factors:
Now, if a) we dealing with a very big effect, then we need only few participants and few items to accurately find this effect.
Or b) if we dealing with an effect that has low variability (it is observable for all subjects with the same strength), then we need only few participants and few items to accurately find this effect.
Before we conduct a study, we should figure out, what sample we need to detect a small/medium effect with medium variability so that our model is sufficient to detect this kind of effect. In order to do this, we would generate a data set that mirrors the kind of data that we expect to get (with the properties that we expect to get). We can then fit a model to this data and check if a model would be able to detect the expected effect. However, because a single model does not tell us that much (ift could simply be luck that it happened to find the effect), we run many different models on variations of the data and see how many of them find the effect. As a general rule of thumb, we want a data set that allows a model to find a medium sized effect with at least an accuracy of 80 percent (Field et al. 2007).
In the following, we will go through how to determine what sample size we need for an example analysis.
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages into the R library on your computer so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead and ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).
# set options
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 100, "digits" = 4) # suppress math annotation
# install libraries
install.packages(c("tidyverse", "lme4", "sjPlot", "simr"))
Once you have installed R and RStudio and initiated the session by executing the code shown above, you are good to go.
In order to perform a power analysis, we will start by loading the tidyverse package to process the data and by generating a data that we will use to determine the power of a regression model.
This simulated data set has
# load package
library(tidyverse)
# generate data
simdat <- data.frame(
sub <- rep(paste0("Sub", 1:10), each = 20),
cond <- rep(c(
rep("Control", 10),
rep("Test", 10))
, 10),
itm <- as.character(rep(1:10, 20))
) %>%
dplyr::rename(Subject = 1,
Condition = 2,
Item = 3) %>%
dplyr::mutate_if(is.character, factor)
## Warning in kableExtra::kable_styling(., bootstrap_options = c("striped", :
## Please specify format in kable. kableExtra can customize either HTML or LaTeX
## outputs. See https://haozhu233.github.io/kableExtra/ for details.
Subject | Condition | Item |
---|---|---|
Sub1 | Control | 1 |
Sub1 | Control | 2 |
Sub1 | Control | 3 |
Sub1 | Control | 4 |
Sub1 | Control | 5 |
Sub1 | Control | 6 |
Sub1 | Control | 7 |
Sub1 | Control | 8 |
Sub1 | Control | 9 |
Sub1 | Control | 10 |
Sub1 | Test | 1 |
Sub1 | Test | 2 |
Sub1 | Test | 3 |
Sub1 | Test | 4 |
Sub1 | Test | 5 |
We add a dependent variable (AOI) which represents the dependent variable in the study. In our case, we determine that Condition has a relatively weak effect (the probability of gazing into the area of interest (AOI) is .7 in the test condition compared to .5 in the control condition). In addition, this effect is only present in half of the subjects to reflect the variability in the effect.
simdat <- simdat %>%
dplyr::arrange(Condition) %>%
dplyr::mutate(
dep <- c(sample(c("AOI", "NotAOI"), 50, replace = T, prob = c(.5, .5)),
sample(c("AOI", "NotAOI"), 50, replace = T, prob = c(.5, .5)),
sample(c("AOI", "NotAOI"), 50, replace = T, prob = c(.5, .5)),
sample(c("AOI", "NotAOI"), 50, replace = T, prob = c(.7, .3)))
) %>%
dplyr::mutate_if(is.character, factor) %>%
dplyr::rename(AOI = 4)
The data looks like this.
Now that we have generated some data, we will fit a model to it and perform a power analysis on the observed effects.
Post-hoc power calculations (where the target effect size comes from the data) give misleading results (Hoenig and Heisey (2001)) and should thus be treated with extreme care!
We will fit a first model to the data. Thus, in a first step, we load the lme4
package to create a model, set a seed (to save the results and so that the results can be replicated), and then create an initial mixed-effects model.
# load package
library(lme4)
# set seed for replicability
set.seed(12345)
# fit model
m1 <- glmer(AOI ~ (1|Subject) +(1|Item) + Condition, family="binomial", data=simdat)
# inspect results
summary(m1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: AOI ~ (1 | Subject) + (1 | Item) + Condition
## Data: simdat
##
## AIC BIC logLik deviance df.resid
## 279.0 292.1 -135.5 271.0 196
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.2267 -0.8697 -0.6901 0.9611 1.4491
##
## Random effects:
## Groups Name Variance Std.Dev.
## Subject (Intercept) 9.692e-02 3.113e-01
## Item (Intercept) 1.063e-10 1.031e-05
## Number of obs: 200, groups: Subject, 10; Item, 10
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1640 0.2257 0.727 0.4674
## ConditionTest -0.6222 0.2909 -2.139 0.0324 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## ConditinTst -0.629
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
We now check the effect sizes of the predictors in the model. We can do this by displaying the results of the model using the tab_model
function from the sjPlot
package.
# load package
library(sjPlot)
# tabulate results
sjPlot::tab_model(m1)
AOI | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 1.18 | 0.76 – 1.83 | 0.467 |
Condition [Test] | 0.54 | 0.30 – 0.95 | 0.032 |
Random Effects | |||
σ^{2} | 3.29 | ||
τ_{00} _{Subject} | 0.10 | ||
τ_{00} _{Item} | 0.00 | ||
ICC | 0.03 | ||
N _{Subject} | 10 | ||
N _{Item} | 10 | ||
Observations | 200 | ||
Marginal R^{2} / Conditional R^{2} | 0.028 / 0.056 |
Now, we perform power analysis on an observed effect. This analysis tells us how likely the model is to find an observed effect given the data.
We use a very low number of simulations (100) and we use the default z-test which is suboptimal for small samples (Bolker et al. (2009)). In a proper study, you shuld incerase the number of simulations (to at least 1000) and you should use a bootstrapping rather than a z-test (see Halekoh, Højsgaard, and others (2014)).
# load package
library(simr)
# set seed for replicability
set.seed(12345)
# perform power analysis for present model
rsim0 <- powerSim(m1, fixed("ConditionTest", "z"), nsim=100)
Now we can inspect the results.
# inspect results
rsim0
## Power for predictor 'ConditionTest', (95% confidence interval):
## 59.00% (48.71, 68.74)
##
## Test: z-test
## Effect size for ConditionTest is -0.62
##
## Based on 100 simulations, (1 warning, 0 errors)
## alpha = 0.05, nrow = 200
##
## Time elapsed: 0 h 0 m 18 s
##
## nb: result might be an observed power calculation
The results of the power analysis show that, given the data at hand, the model would have detected the effect of Conidition:Test with a probability of 59 percent. However, and as stated above, the results of such post-hoc power calculations (where the target effect size comes from the data) give misleading results (Hoenig and Heisey (2001)) and should thus be treated with extreme care!
While the effect of Condidition:Test is rather small (given the small number of subjects and items, the small effect size, and the variability in the effect and we would thus not be surprised that out model is not very accurate in detecting this effect), the target accuracy of finding an effect that one is interested in is commonly 80 percent (see Field, Miles, and Field (2012), Green and MacLeod (2016)). The accuracy (i.e., the probability of finding an effect) depends on three main factors:
We will now check if the sample size is sufficient to detect a small effect (Cohen’s d 0.2). According to Chen, Cohen, and Chen (2010) odds ratios of 1.68, 3.47, and 6.71 are equivalent to Cohen’s d = 0.2 (small), 0.5 (medium), and 0.8 (large) - the traditional scale is 0.2 for a small, 0.5 for medium sized, and 0.8 for a large or strong effect. We need to determine the odds ratios of the fixed effects and then convert them into Cohen’s d values for which we have associations between traditional denominations (small, medium, and large) and effect sife values.
estimatesfixedeffects <- fixef(m1)
exp(estimatesfixedeffects)
## (Intercept) ConditionTest
## 1.17826 0.53678
We can see that the effect is rather small which makes it very hard to detect an effect. We will now change the size of the effect of ConditionTest to represent a truly small effect, i.e. on the brink of being noise but being just strong enough to be considered small. In other words, we will set the effect so that its odds ratio is exactly 1.68.
# set seed for replicability
set.seed(12345)
# perform power analysis for small effect
fixef(m1)["ConditionTest"] <- 0.519
estimatesfixedeffects <- fixef(m1)
exp(estimatesfixedeffects)
## (Intercept) ConditionTest
## 1.178260 1.680346
Now we can test if the sample size of the model is sufficient to find a small effect.
# set seed for replicability
set.seed(12345)
# perform power analysis for present model
rsim1 <- powerSim(m1, fixed("ConditionTest", "z"), nsim=100)
The results are shown below.
# show results
rsim1
## Power for predictor 'ConditionTest', (95% confidence interval):
## 50.00% (39.83, 60.17)
##
## Test: z-test
## Effect size for ConditionTest is 0.52
##
## Based on 100 simulations, (0 warnings, 0 errors)
## alpha = 0.05, nrow = 200
##
## Time elapsed: 0 h 0 m 17 s
The power analysis shows that the data is sufficient to detect a small effect for Condition:Test with 50 percent accuracy.
We will now extend the data to see what sample size is needed to get to the 80 percent accuracy threshold. We begin by increasing the number of items from 10 to 30 to see if this would lead to a sufficient sample size.
# increase sample size
m2 <- extend(m1, along="Item", n=30)
# perform power simulation
rsim2 <- powerSim(m2, fixed("ConditionTest", "z"), nsim=100)
The results are shown below.
# show results
rsim2
## Power for predictor 'ConditionTest', (95% confidence interval):
## 79.00% (69.71, 86.51)
##
## Test: z-test
## Effect size for ConditionTest is 0.52
##
## Based on 100 simulations, (1 warning, 0 errors)
## alpha = 0.05, nrow = 600
##
## Time elapsed: 0 h 0 m 29 s
By increasing the number of items to 30, we would now be able to detect a small effect (d=.2) with an accuracy of 79 percent. This means that we would have to add more items as 30 is not yet sufficient.
We can also check the accuracy for a range of values as shown below. We begin by extending the number of Items.
pc2 <- powerCurve(m2, fixed("ConditionTest", "z"), along = "Item", nsim=100)
The results are shown below.
# show results
print(pc2)
## Power for predictor 'ConditionTest', (95% confidence interval),
## by number of levels in Item:
## 3: 19.00% (11.84, 28.07) - 60 rows
## 6: 29.00% (20.36, 38.93) - 120 rows
## 9: 39.00% (29.40, 49.27) - 180 rows
## 12: 49.00% (38.86, 59.20) - 240 rows
## 15: 59.00% (48.71, 68.74) - 300 rows
## 18: 69.00% (58.97, 77.87) - 360 rows
## 21: 70.00% (60.02, 78.76) - 420 rows
## 24: 75.00% (65.34, 83.12) - 480 rows
## 27: 82.00% (73.05, 88.97) - 540 rows
## 30: 86.00% (77.63, 92.13) - 600 rows
##
## Time elapsed: 0 h 3 m 0 s
In addition, we can plot the results as follows:
plot(pc2)
Instead of increasing the number of Items, we could also increase the number of Subjects. So below, we test check the accuracy for up to 30 subjects.
m3 <- extend(m1, along="Subject", n=30)
# perform power calculation
pc3 <- powerCurve(m3, fixed("ConditionTest", "z"), along="Subject", nsim=100)
The results are shown below.
# print results
print(pc3)
## Power for predictor 'ConditionTest', (95% confidence interval),
## by number of levels in Subject:
## 3: 16.00% ( 9.43, 24.68) - 60 rows
## 6: 25.00% (16.88, 34.66) - 120 rows
## 9: 36.00% (26.64, 46.21) - 180 rows
## 12: 51.00% (40.80, 61.14) - 240 rows
## 15: 60.00% (49.72, 69.67) - 300 rows
## 18: 67.00% (56.88, 76.08) - 360 rows
## 21: 71.00% (61.07, 79.64) - 420 rows
## 24: 76.00% (66.43, 83.98) - 480 rows
## 27: 80.00% (70.82, 87.33) - 540 rows
## 30: 83.00% (74.18, 89.77) - 600 rows
##
## Time elapsed: 0 h 2 m 50 s
Again, we can also visualize the results.
# visualize results
plot(pc3)
The results show that we breach the 80 percent threshold with 30 subjects.
Finally, it may be an option to increase the number of data points within subjects and items (while the number of items and subjects remain constant).
m4 <- extend(m1, within="Item+Subject", n=15)
# perform power calculation
pc4 <- powerCurve(m4, fixed("ConditionTest", "z"), within="Item+Subject", breaks=c(5, 10, 15), nsim=100)
The results are shown below.
# show results
print(pc4)
## Power for predictor 'ConditionTest', (95% confidence interval),
## by number of observations within Item+Subject:
## 5: 82.00% (73.05, 88.97) - 500 rows
## 10: 99.00% (94.55, 99.97) - 1000 rows
## 15: 100.0% (96.38, 100.0) - 1500 rows
##
## Time elapsed: 0 h 1 m 53 s
# show results
plot(pc4)
The results show that we would have a sufficient data set if we had 14 observations per Subject in each Item because with 10 observations, the accuracy breaches the 80 percent level.
Schweinberger, Martin. 2021. Power Analysis in R. Brisbane: The University of Queensland. url: https://slcladal.github.io/clust.html (Version 2021.04.30).
@manual{schweinberger2021pwr,
author = {Schweinberger, Martin},
title = {Power Analysis in R},
note = {https://slcladal.github.io/pwr.html},
year = {2021},
organization = "The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2021.04.30}
}
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] simr_1.0.5 sjPlot_2.8.6 lme4_1.1-25 Matrix_1.2-18
## [5] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
## [9] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.3
## [13] tidyverse_1.3.0 flextable_0.5.11 knitr_1.30 DT_0.16
##
## loaded via a namespace (and not attached):
## [1] TH.data_1.0-10 minqa_1.2.4 colorspace_1.4-1 rio_0.5.16
## [5] ellipsis_0.3.1 sjlabelled_1.1.7 estimability_1.3 parameters_0.12.0
## [9] base64enc_0.1-3 fs_1.5.0 rstudioapi_0.11 fansi_0.4.1
## [13] mvtnorm_1.1-1 lubridate_1.7.9 xml2_1.3.2 codetools_0.2-16
## [17] splines_4.0.3 sjmisc_2.8.5 jsonlite_1.7.1 nloptr_1.2.2.2
## [21] ggeffects_0.16.0 pbkrtest_0.4-8.6 broom_0.7.2 binom_1.1-1
## [25] dbplyr_2.0.0 effectsize_0.4.4 compiler_4.0.3 httr_1.4.2
## [29] sjstats_0.18.0 emmeans_1.5.2-1 backports_1.1.10 assertthat_0.2.1
## [33] cli_2.1.0 htmltools_0.5.0 tools_4.0.3 coda_0.19-4
## [37] gtable_0.3.0 glue_1.4.2 Rcpp_1.0.5 carData_3.0-4
## [41] cellranger_1.1.0 vctrs_0.3.4 nlme_3.1-149 iterators_1.0.13
## [45] crosstalk_1.1.0.1 insight_0.13.1 xfun_0.19 openxlsx_4.2.3
## [49] rvest_0.3.6 lifecycle_0.2.0 statmod_1.4.35 MASS_7.3-53
## [53] zoo_1.8-8 scales_1.1.1 hms_0.5.3 parallel_4.0.3
## [57] sandwich_3.0-0 curl_4.3 yaml_2.2.1 gdtools_0.2.2
## [61] stringi_1.5.3 highr_0.8 bayestestR_0.8.2 plotrix_3.7-8
## [65] boot_1.3-25 zip_2.1.1 rlang_0.4.8 pkgconfig_2.0.3
## [69] systemfonts_0.3.2 evaluate_0.14 lattice_0.20-41 htmlwidgets_1.5.3
## [73] tidyselect_1.1.0 plyr_1.8.6 magrittr_1.5 R6_2.5.0
## [77] generics_0.1.0 multcomp_1.4-14 RLRsim_3.1-6 DBI_1.1.0
## [81] foreign_0.8-80 pillar_1.4.6 haven_2.3.1 withr_2.3.0
## [85] mgcv_1.8-33 abind_1.4-5 survival_3.2-7 performance_0.5.1
## [89] car_3.0-10 modelr_0.1.8 crayon_1.3.4 uuid_0.1-4
## [93] rmarkdown_2.5 officer_0.3.15 grid_4.0.3 readxl_1.3.1
## [97] data.table_1.13.2 reprex_0.3.0 digest_0.6.27 webshot_0.5.2
## [101] xtable_1.8-4 munsell_0.5.0 viridisLite_0.3.0 kableExtra_1.3.1
Arnold, Benjamin F, Daniel R Hogan, John M Colford, and Alan E Hubbard. 2011. “Simulation Methods to Estimate Design Power: An Overview for Applied Research.” BMC Medical Research Methodology 11 (1): 1–10.
Bolker, Benjamin M, Mollie E Brooks, Connie J Clark, Shane W Geange, John R Poulsen, M Henry H Stevens, and Jada-Simone S White. 2009. “Generalized Linear Mixed Models: A Practical Guide for Ecology and Evolution.” Trends in Ecology & Evolution 24 (3): 127–35.
Chen, Henian, Patricia Cohen, and Sophie Chen. 2010. “How Big Is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies.” Communications in Statistics–Simulation and Computation 39 (4): 860–64.
Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.
Field, Scott A, Patrick J O’Connor, Andrew J Tyre, and Hugh P Possingham. 2007. “Making Monitoring Meaningful.” Austral Ecology 32 (5): 485–91.
Green, Peter, and Catriona J MacLeod. 2016. “SIMR: An R Package for Power Analysis of Generalized Linear Mixed Models by Simulation.” Methods in Ecology and Evolution 7 (4): 493–98.
Halekoh, Ulrich, Søren Højsgaard, and others. 2014. “A Kenward-Roger Approximation and Parametric Bootstrap Methods for Tests in Linear Mixed Models–the R Package Pbkrtest.” Journal of Statistical Software 59 (9): 1–30.
Hoenig, John M, and Dennis M Heisey. 2001. “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis.” The American Statistician 55 (1): 19–24.
Johnson, Paul CD, Sarah JE Barry, Heather M Ferguson, and Pie Müller. 2015. “Power Analysis for Generalized Linear Mixed Models in Ecology and Evolution.” Methods in Ecology and Evolution 6 (2): 133–42.