This is an introduction to linear trend analysis from an estimation perspective, applied to estimating the linear trend among population means. The contents of this introduction is based on Maxwell, Delaney, and Kelley (2017) and Rosenthal, Rosnow, and Rubin (2000). I have taken the (invented) data from Haans (2018). The estimation perspective to statistical analysis is aimed at obtaining point and interval estimates of effect sizes. Here, I will use the frequentist perspective of obtaining a point estimate and a 95% Confidence Interval of the relevant effect size. For linear trend analysis, the relevant effect size is the slope coefficient of the linear trend, so, the purpose of the analysis is to estimate the value of the slope and the 95% confidence interval of the estimate. We will use contrast analysis to obtain the relevant data.
[Note: A pdf-file that differs only slightly from this blogpost can be found on my Researchgate page: here; I suggest Haans (2018) for an easy to follow introduction to contrast analysis, which should really help understanding what is being said below].
Pagina's
Showing posts with label R. Show all posts
Showing posts with label R. Show all posts
Monday, 26 August 2019
Sunday, 14 April 2019
Planning for Precise Contrast Estimates: Introduction and Tutorial (Preprint)
I just finished a preprint of an introduction and tutorial to sample size planning for precision of contrast estimates. The tutorial focuses on single factor between and within subjects designs, and mixed factorial designs with one within and one between factor. The tutorial contains R-code for sample size planning in these designs.
The preprint is availabe on researchgate: Click (but I am just as happy to send it to you if you like; just let me know).
The preprint is availabe on researchgate: Click (but I am just as happy to send it to you if you like; just let me know).
Thursday, 4 April 2019
Contrast analysis with R: Tutorial for factorial mixed designs
In this tutorial I will show how contrast estimates can be obtained with R. Previous posts focused on the analyses in factorial between and within designs, now I will focus on a mixed design with one between participants factor and one within participants factor. I will discuss how to obtain an estimate of an interaction contrast using a dataset provided by Haans (2018).
I will illustrate two approaches, the first approach is to use transformed scores in combination with one-sample t-tests, and the other approach uses the univariate mixed model approach. As was explained in the previous tutorial, the first approach tests each contrast against it's own error variance, whereas in the mixed model approach a common error variance is used (which requires statistical assumptions that will probably not apply in practice; the advantage of the mixed model approach, if its assumptions apply, is that the Margin of Error of the contrast estimate is somewhat smaller).
Again, our example is taken from Haans (2018; see also this post). It considers the effect of students' seating distance from the teacher and the educational performance of the students: the closer to the teacher the student is seated, the higher the performance. A "theory "explaining the effect is that the effect is mainly caused by the teacher having decreased levels of eye contact with the students sitting farther to the back in the lecture hall.
To test that theory, a experiment was conducted with N = 9 participants in a factorial mixed design (also called a split-plot design), with two fixed factors: the between participants Sunglasses (with or without), and the within participants factor Location (row 1 through row 4). The dependent variable was the score on a 10-item questionnaire about the contents of the lecture. So, we have a 2 by 4 mixed factorial design, with n = 9 participants in each combination of the factor levels.
We will again focus on obtaining an interaction contrast: we will estimate the extent to which the difference between the mean retention score on the first row and those on the other rows differs between the conditions with and without sunglasses.
I will illustrate two approaches, the first approach is to use transformed scores in combination with one-sample t-tests, and the other approach uses the univariate mixed model approach. As was explained in the previous tutorial, the first approach tests each contrast against it's own error variance, whereas in the mixed model approach a common error variance is used (which requires statistical assumptions that will probably not apply in practice; the advantage of the mixed model approach, if its assumptions apply, is that the Margin of Error of the contrast estimate is somewhat smaller).
Again, our example is taken from Haans (2018; see also this post). It considers the effect of students' seating distance from the teacher and the educational performance of the students: the closer to the teacher the student is seated, the higher the performance. A "theory "explaining the effect is that the effect is mainly caused by the teacher having decreased levels of eye contact with the students sitting farther to the back in the lecture hall.
To test that theory, a experiment was conducted with N = 9 participants in a factorial mixed design (also called a split-plot design), with two fixed factors: the between participants Sunglasses (with or without), and the within participants factor Location (row 1 through row 4). The dependent variable was the score on a 10-item questionnaire about the contents of the lecture. So, we have a 2 by 4 mixed factorial design, with n = 9 participants in each combination of the factor levels.
We will again focus on obtaining an interaction contrast: we will estimate the extent to which the difference between the mean retention score on the first row and those on the other rows differs between the conditions with and without sunglasses.
Using SPSS
Per contrast error variance
Labels:
ANOVA,
emmeans,
interaction,
interaction contrast,
lme4,
mixed design,
R,
split-plot.,
SPSS
Friday, 22 March 2019
Planning with assurance, with assurance
Planning for precision requires that we choose a target Margin of Error (MoE; see this post for an introduction to the basic concepts) and a value for assurance, the probability that MoE will not exceed our target MoE. What your exact target MoE will be depends on your research goals, of course.
Cumming and Calin-Jageman (2017, p. 277) propose a strategy for determining target MoE. You can use this strategy if your research goal is to provide strong evidence that the effect size is non-zero. The strategy is to divide the expected value of the difference by two, and to use that result as your target MoE.
Let's restrict our attention to the comparison of two means. If the expected difference between the two means is Cohens's d = .80, the proposed strategy is to set your target MoE at f = .40, which means that your target MoE is set at .40 standard deviations. If you plan for this value of target MoE with 80% assurance, the recommended sample size is n = 55 participants per group. These results are guaranteed to be true, if it is known for a fact that Cohen's d is .80 and all statistical assumptions apply.
But it is generally not known for a fact that Cohen's d has a particular value and so we need to answer a non-trivial question: what effect size can we reasonably expect? And, how can we have assurance that the MoE will not exceed half the unknown true effect size? One of the many options we have for answering this question is to conduct a pilot study, estimate the plausible values of the effect size and use these values for sample size planning. I will describe a strategy that basically mirrors the sample size planning for power approach described by Anderson, Kelley, and Maxwell (2017).
The procedure is as follows. In order to plan with approximately 80% assurance, estimate on the basis of your pilot the 80% confidence interval for the population effect size and use half the value of the lower limit for sample size planning with 90% assurance. This will give you 81% assurance that assurance MoE is no larger than half the unknown true effect size.
Cumming and Calin-Jageman (2017, p. 277) propose a strategy for determining target MoE. You can use this strategy if your research goal is to provide strong evidence that the effect size is non-zero. The strategy is to divide the expected value of the difference by two, and to use that result as your target MoE.
Let's restrict our attention to the comparison of two means. If the expected difference between the two means is Cohens's d = .80, the proposed strategy is to set your target MoE at f = .40, which means that your target MoE is set at .40 standard deviations. If you plan for this value of target MoE with 80% assurance, the recommended sample size is n = 55 participants per group. These results are guaranteed to be true, if it is known for a fact that Cohen's d is .80 and all statistical assumptions apply.
But it is generally not known for a fact that Cohen's d has a particular value and so we need to answer a non-trivial question: what effect size can we reasonably expect? And, how can we have assurance that the MoE will not exceed half the unknown true effect size? One of the many options we have for answering this question is to conduct a pilot study, estimate the plausible values of the effect size and use these values for sample size planning. I will describe a strategy that basically mirrors the sample size planning for power approach described by Anderson, Kelley, and Maxwell (2017).
The procedure is as follows. In order to plan with approximately 80% assurance, estimate on the basis of your pilot the 80% confidence interval for the population effect size and use half the value of the lower limit for sample size planning with 90% assurance. This will give you 81% assurance that assurance MoE is no larger than half the unknown true effect size.
Saturday, 5 January 2019
Contrast Analysis with R: Tutorial for Factorial Within Subjects Designs
In this post, I illustrate how to obtain contrast estimates in R for within subjects designs. I will illustrate two approaches. The first is to simply use the one-sample t-test on the transformed scores. This will replicate a contrast analysis done with SPSS GLM Repeated Measures. The second is to make use of mixed linear effects modeling with the lmer-function from the lme4 library.
Conceptually, the major difference between the two approaches is that in the latter approach we make use of a single shared error variance and covariance across conditions (we assume compound symmetry), whereas in the former each contrast has a separate error variance, depending on the specific conditions involved in the contrast (these conditions may have unequal variances and covariances).
As in the previous post (https://the-small-s-scientist.blogspot.com/2018/12/contrast-analysis-with-r-tutorial.html), we will focus our attention on obtaining an interaction contrast estimate.
Again, our example is taken from Haans (2018; see also this post). It considers the effect of students' seating distance from the teacher and the educational performance of the students: the closer to the teacher the student is seated, the higher the performance. A "theory "explaining the effect is that the effect is mainly caused by the teacher having decreased levels of eye contact with the students sitting farther to the back in the lecture hall.
To test that theory, a experiment was conducted with N = 9 participants in a completely within-subjects-design (also called a fully-crossed design), with two fixed factors: sunglasses (with or without) and location (row 1 through row 4). The dependent variable was the score on a 10-item questionnaire about the contents of the lecture. So, we have a 2 by 4 within-subjects-design, with n = 9 participants in each combination of the factor levels.
We will again focus on obtaining an interaction contrast: we will estimate the extent to which the difference between the mean retention score on the first row and those on the other rows differs between the conditions with and without sunglasses.
Conceptually, the major difference between the two approaches is that in the latter approach we make use of a single shared error variance and covariance across conditions (we assume compound symmetry), whereas in the former each contrast has a separate error variance, depending on the specific conditions involved in the contrast (these conditions may have unequal variances and covariances).
As in the previous post (https://the-small-s-scientist.blogspot.com/2018/12/contrast-analysis-with-r-tutorial.html), we will focus our attention on obtaining an interaction contrast estimate.
Again, our example is taken from Haans (2018; see also this post). It considers the effect of students' seating distance from the teacher and the educational performance of the students: the closer to the teacher the student is seated, the higher the performance. A "theory "explaining the effect is that the effect is mainly caused by the teacher having decreased levels of eye contact with the students sitting farther to the back in the lecture hall.
To test that theory, a experiment was conducted with N = 9 participants in a completely within-subjects-design (also called a fully-crossed design), with two fixed factors: sunglasses (with or without) and location (row 1 through row 4). The dependent variable was the score on a 10-item questionnaire about the contents of the lecture. So, we have a 2 by 4 within-subjects-design, with n = 9 participants in each combination of the factor levels.
We will again focus on obtaining an interaction contrast: we will estimate the extent to which the difference between the mean retention score on the first row and those on the other rows differs between the conditions with and without sunglasses.
Obtaining the contrast estimate with SPSS Repeated Measures
Labels:
ANOVA,
interaction,
interaction contrast,
lme4,
lmer.,
mixed model,
R,
SPSS,
tutorial
Friday, 19 October 2018
Custom contrasts for the one-way repeated measures design using Lmer
Here is some code for doing one-way repeated measures analysis with lme4 and custom contrasts. We will use a repeated measures design with three conditions of the factor Treat and 20 participants. The contrasts are Helmert contrasts, but they differ from the built-in Helmert contrasts in that the sum of the absolute values of the contrasts weights equals 2 for each contrast.
The standard error of each contrast equals the square root of the product of the sum of the squared contrast weight w and the residual variance divided by the number of participants n.
$$\sigma_{\hat{\psi}} = \sqrt{\sum{w_i}\sigma^2_e/n}$$
The residual variance equals the within treatment variance times 1 minus the correlation between conditions. (Which equals the within treatment variance minus 1 times the covariance $\rho\sigma^2_{within}$) .
$$\sigma^2_e = \sigma^2_{within}(1 - \rho)$$
In the example below, the within treatment variance equals 1 and the covariance 0.5 (so the value of the correlation is .50 as well). The residual variance is therefore equal to .50.
For the first contrasts, the weights are equal to {-1, 1, 0}, so the value of the standard error of the contrasts should be equal to the square root of 2*.50/20 = 0.2236.
The standard error of each contrast equals the square root of the product of the sum of the squared contrast weight w and the residual variance divided by the number of participants n.
$$\sigma_{\hat{\psi}} = \sqrt{\sum{w_i}\sigma^2_e/n}$$
The residual variance equals the within treatment variance times 1 minus the correlation between conditions. (Which equals the within treatment variance minus 1 times the covariance $\rho\sigma^2_{within}$) .
$$\sigma^2_e = \sigma^2_{within}(1 - \rho)$$
In the example below, the within treatment variance equals 1 and the covariance 0.5 (so the value of the correlation is .50 as well). The residual variance is therefore equal to .50.
For the first contrasts, the weights are equal to {-1, 1, 0}, so the value of the standard error of the contrasts should be equal to the square root of 2*.50/20 = 0.2236.
library(MASS)
library(lme4)
# setting up treatment and participants factors nTreat = 3 nPP = 20 Treat <- factor(rep(1:nTreat, each=nPP)) PP <- factor(rep(1:nPP, nTreat)) # generate some random # specify means means = c(0, .20, .50) # create variance-covariance matrix # var = 1, cov = .5 Sigma = matrix(rep(.5, 9), 3, 3) diag(Sigma) <- 1 # generate the data; using empirical = TRUE # so that variance and covariance are known # set to FALSE for "real" random data sco = as.vector(mvrnorm(nPP, means, Sigma, empirical = TRUE)) #setting up custom contrasts for Treatment factor myContrasts <- rbind(c(-1, 1, 0), c(-.5, -.5, 1)) contrasts(Treat) <- ginv(myContrasts) #fit linear mixed effects model: myModel <- lmer(sco ~ Treat + (1|PP)) summary(myModel)
## Linear mixed model fit by REML ['lmerMod'] ## Formula: sco ~ Treat + (1 | PP) ## ## REML criterion at convergence: 157.6 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -2.6676 -0.4869 0.1056 0.6053 1.9529 ## ## Random effects: ## Groups Name Variance Std.Dev. ## PP (Intercept) 0.5 0.7071 ## Residual 0.5 0.7071 ## Number of obs: 60, groups: PP, 20 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 0.2333 0.1826 1.278 ## Treat1 0.2000 0.2236 0.894 ## Treat2 0.4000 0.1936 2.066 ## ## Correlation of Fixed Effects: ## (Intr) Treat1 ## Treat1 0.000 ## Treat2 0.000 0.000
Thursday, 4 October 2018
Distribution of the difference between two binomial variables
For a project I'm working on, I needed to work with the probabilities associated with the difference between two binomial variables. I thought I'd share the code for four functions for calculating the probability mass and the cumulative probabilites.
The functions d.diff.binom (probability mass) and p.diff.binom (cumulative distribution) are functions for calculating the distribution of the differences for equal sample sizes N and equal "succes" probability. The arguments of the functions are the quantiles k of the distribution of the differences, the number of trials N and the probability of succes p.
The functions d.diff.binom.un (probability mass) and p.diff.binom.un (cumulative distribution) are functions for calculating the distribution of the differences for sample sizes and succes probabilities that may (or may not) differ between the two populations. The arguments of the functions are the quantiles k of the distribution of the differences, the number of trials N.1 and N.2 for the two groups and the probability of succes p.1 and p.2 for each group.
The functions d.diff.binom (probability mass) and p.diff.binom (cumulative distribution) are functions for calculating the distribution of the differences for equal sample sizes N and equal "succes" probability. The arguments of the functions are the quantiles k of the distribution of the differences, the number of trials N and the probability of succes p.
The functions d.diff.binom.un (probability mass) and p.diff.binom.un (cumulative distribution) are functions for calculating the distribution of the differences for sample sizes and succes probabilities that may (or may not) differ between the two populations. The arguments of the functions are the quantiles k of the distribution of the differences, the number of trials N.1 and N.2 for the two groups and the probability of succes p.1 and p.2 for each group.
#equal N and p: #probability mass: d.diff.bin = function(k, N, p) { diff = outer(0:N, 0:N, "-") prob = outer(dbinom(0:N, N, p, log=TRUE), dbinom(0:N, N, p, log=TRUE), "+") p = sum(exp(prob[diff == k])) return(p) } #cumulative probability: p.diff.bin = function(k, N, p) { diff = outer(0:N, 0:N, "-") prob = outer(dbinom(0:N, N, p, log=TRUE), dbinom(0:N, N, p, log=TRUE), "+") p = sum(exp(prob[diff <= k])) return(p) } #examples d.diff.bin(0, 30, .50)
## [1] 0.1025782
p.diff.bin(0, 30, .50)
## [1] 0.5512891
#mean of distribution: N = 30 m = sum(sapply(-N:N, d.diff.bin, N = 30, p= .50)*(-N:N)) m
## [1] -3.084642e-20
all.equal(0, m)
## [1] TRUE
#(un)equal N and p: #probability mass: d.diff.bin.un = function(k, N.1, N.2, p.1, p.2) { diff = outer(0:N.1, 0:N.2, "-") prob = outer(dbinom(0:N.1, N.1, p.1, log=TRUE), dbinom(0:N.2, N.2, p.2, log=TRUE), "+") p = sum(exp(prob[diff == k])) return(p) } #cumulative distribion: p.diff.bin.un = function(k, N.1, N.2, p.1, p.2) { diff = outer(0:N.1, 0:N.2, "-") prob = outer(dbinom(0:N.1, N.1, p.1, log=TRUE), dbinom(0:N.2, N.2, p.2, log=TRUE), "+") p = sum(exp(prob[diff <= k])) return(p) } #examples d.diff.bin.un(0, 30, 20, .60, .70)
## [1] 0.05896135
p.diff.bin.un(0, 30, 20, .60, .70)
## [1] 0.1497043
Friday, 16 March 2018
Sample size planning for precision: the basics
In this post, I will introduce some of the ideas underlying sample size planning for precision. The ideas are illustrated with a shiny-application which can be found here: https://gmulder.shinyapps.io/PlanningApp/. The app illustrates the basic theory considering sample size planning for two independent groups. (If the app is no longer available (my allotted active monthly hours are limited on shinyapps.io), contact me and I'll send you the code).
The basic idea
The basic idea is that we are planning an experiment to estimate the difference in population means of an experimental and a control group. We want to know how many observations per group we have to make in order to estimate the difference between the means with a given target precision.
Our measure of precision is the Margin of Error (MOE). In the app, we specify our target MOE as a fraction (f) of the population standard deviation. However, we do not only specify our target MOE, but also our desired level of assurance. The assurance is the probability that our obtained MOE will not exceed our target MOE. Thus, if the assurance is .80 and our target MOE is f = .50, we have a probability of 80% that our obtained MOE will not exceed f = .50.
The only part of the app you need for sample size planning is the "Sample size planning"-form. Specify f, and the assurance, and the app will give you the desired sample size.
If you do that with the default values f = .50 and Assurance = .80, the app will give you the following results on the Planning Results-tab: Sample Size: 36.2175, Expected MOE (f): 0.46. This tells you that you need to sample 37 participants (for instance) per group and then the Expected MOE (the MOE you will get on average) will equal 0.46 (or even a little less, since you sample more than 36.2175 participants).
The Planning-Results-tab also gives you a figure for the power of the t-test, testing the NHST nil-hypothesis for the effect size (Cohen's d) specified in the "Set population values"-form. Note that this form, like the rest of the app provides details that are not necessary for sample size planning for precision, but make the theoretical concepts clear. So, let's turn to those details.
The population
Even though it is not at all necessary to specify the population values in detail, considering the population helps to realize the following. The sample size calculations and the figures for expected MOE and power, are based on the assumption that we are dealing with random samples from normal populations with equal variances (standard deviations).
From these three assumptions, all the results follow deductively. The following is important to realize: if these assumptions do not obtain, the truth of the (statistical) conclusions we derive by deduction is no longer guaranteed. (Maybe you have never before realized that sample size planning involves deductive reasoning; deductive reasoning is also required for the calculation of p-values and to prove that 95% confidence intervals contain the value of the population parameter in 95% of the cases; without these assumptions is it uncertain what the true p-value is and whether or not the 95% confidence interval is in fact a 95% confidence interval).
In general, then, you should try to show (to others, if not to yourself) that it is reasonable to assume normally distributed populations, with equal variances and random sampling, before you decide that the p-value of your t-test, the width of your confidence interval, and the results of sample size calculations are believable.
The populations in the app are normal distributions. By default, the app shows two such distributions. One of the distributions, the one I like to think about as corresponding to the control condition, has μ = 0, the other one has μ = 0.5. Both distributions have a standard deviation (σ = 1). The standardized difference between the means is therefore equal to δ = 0.50.
The default populations are presented in Figure 1 below.
In general, then, you should try to show (to others, if not to yourself) that it is reasonable to assume normally distributed populations, with equal variances and random sampling, before you decide that the p-value of your t-test, the width of your confidence interval, and the results of sample size calculations are believable.
The populations in the app are normal distributions. By default, the app shows two such distributions. One of the distributions, the one I like to think about as corresponding to the control condition, has μ = 0, the other one has μ = 0.5. Both distributions have a standard deviation (σ = 1). The standardized difference between the means is therefore equal to δ = 0.50.
The default populations are presented in Figure 1 below.
Labels:
MOE,
Planning,
Planning for Precision,
power,
R,
shiny.,
type II error
Friday, 15 December 2017
Planning for a precise contrast estimate: the mixed model case
In a previous post (here), we saw how we can determine sample size for obtaining, with assurance, a precise interaction contrast estimate. In that post we considered a 2 x 2 factorial design. In this post, I will extend the discussion to the mixed model case. That is, we will consider sample size planning for a precise interaction estimate in case of a design with 2 fixed factors and two random factors: participant and stimulus (item). (A pdf version of this post can be found here: view pdf. )
In order to keep things relatively simple, we will focus on a design where both participants and items are nested under condition. So, each treatment condition has a unique sample of participants and items. We will call this design the both-within-condition design (see, for instance, Westfall et al. 2014, for detailed descriptions of this design). We will analyse the 2 x 2 factorial design as a single factor design (the factor has a = 4 levels) and formulate an interaction contrast.
In order to keep things relatively simple, we will focus on a design where both participants and items are nested under condition. So, each treatment condition has a unique sample of participants and items. We will call this design the both-within-condition design (see, for instance, Westfall et al. 2014, for detailed descriptions of this design). We will analyse the 2 x 2 factorial design as a single factor design (the factor has a = 4 levels) and formulate an interaction contrast.
Saturday, 14 October 2017
The omnibus F-test may be ignored if you use multiple comparison procedures
I think trying to be scientific with a small s involves asking critical questions about common wisdom or common practice. In this post, I would like to focus on multiple comparisons in the context of ANOVA. What does common practice indicate?
Common wisdom suggests doing multiple comparisons only if the F-test is significant
Let's have a look on some practical advice considering multiple comparisons found on the web (R-bloggers.com) and in Field (2015).
"When we have a statistically significant effect in ANOVA and an independent variable of more than two levels, we typically want to make follow-up comparisons. There are numerous methods for making pairwise comparisons and this tutorial will demonstrate how to execute several different techniques in R." (https://www.r-bloggers.com/r-tutorial-series-anova-pairwise-comparison-methods/)
And have a look at how the text book I used to use in my statistics course explains it.
"It might seem a a bit unhelpful that an ANOVA doesn't tell you which groups are different from which, given that having gone to the trouble of running an experiment, you probably need to know more than 'there's some difference somewhere or other'. You might wonder, therefore, why we don't just carry out a lot of t-tests, which would tell us very specifically whether pairs of group menas differ. Actually, the reason has already been explained in Section 2.1.6.7: every time you run multiple tests on the same data you inflate the potential Type I errors that you make. However, we'll return to this point in Section 11.5 when we look at how we follow up an ANOVA to discover where the group difference lie." (Field, 2015, p. 442).
Although, in honesty, on p. 459 Field writes:
"The least significance difference (LSD) pairwise comparison makes no attempt to control Type I error and is equivalent to performing multiple t-tests on the data. The only difference is that LSD requires the overall ANOVA to be significant."
This is meant to inform about the relative merits of one post hoc procedure to another in terms of Type I and Type II error. Crucially, it is not mentioned that the other post-hoc procedures require that the overall ANOVA be significant. (As common wisdom seems to suggest). However, his flow-chart of the ANOVA procedure (p. 460) clearly suggests multiple comparison procedures should be used as post-hoc procedures (after the ANOVA is significant).
Thus, common "statistical" wisdom seem to suggest that multiple comparison procedures are to be used as post hoc procedures following up a significant omnibus F-test. And the reason is that this two-stepped procedure minimizes the probability of type I errors.
Now, let's ask ourselves whether this common sense is, well, sensible.
Multiple comparisons only after significant F-test affects power negatively
Wilcox (2017) contains some useful information regarding our question. In his discussion of the much used Tukey-HSD procedure (the Tukey-Kramer Method), he references Bernhardson, (1975) who shows that the probability of at least 1 type I error among pairwise comparisons of estimates of equal population means (i.e. true null-hypotheses) is no longer equal to $\alpha$ if the procedure is only carried out following a significant omnibus test. That is, if we use our beloved two step procedure.
The consequence of the two step procedure for the Tukey-HSD is that $\alpha$ is reduced. Thus, if we want our multiple comparisons procedure to generate one type I error or more at most with a probability of $\alpha = .05$, using the 2 step procedure leads to a lowered $\alpha$. This is of course, bad news, because in the event that not all of the null-hypotheses are true, lowering $\alpha$ increases $\beta$, the probability of not rejecting when the null-hypothesis is false (keeping the sample size constant, of course). In other words, the two step procedure decreases the power of the multiple comparison procedure.
In the words of Wilcox (2017):
"In practical terms, when it comes to controlling the probability of at least one type I error, there is no need to first reject with the ANOVA F test to justify using the Tukey-Kramer method. If the Tukey-Kramer method is used only after the F test rejects, power can be reduced. Currently, however, common practice is to use the Tukey-Kramer method only if the F-test rejects. That is, the insight reported by Bernhardson is not yet well known." (p. 385).
In conceptual terms, the fact that the probability of at least one type I error in the multiple comparison procedure is smaller than $\alpha$ if the F-test rejects is pretty clear, at least to me it is. Suppose we reject if the p-value of the F-test is smaller or equal to 5%. This will also be the probability that we conduct the multiple comparison test over repeated replications of the same experiment. Of that 5%, not every application of the procedure will result in at least one type I error. Indeed, a puzzling fact for many beginning researchers is that the F-test is significant while none of the pairwise comparisons is. In other words, some of those 5% of the cases in which we perform the procedure following a significant F-test will probably not reject any of the pairwise null-hypotheses, unless it is guaranteed that at least one type I error per application will be made.
(With no adjustment of $\alpha$ for multiple comparisons, this will happen (with high probability so no guarantee) if a huge number of pairwise comparisons are made. For instance, with 99 unadjusted multiple comparisons the probability of at least one type I error is 99%.; this is why it makes sense to demand that the F-test is significant before testing multiple comparisons with the LSD procedure. Although the latter seems to run into trouble with more than 3 groups (Wilcox, 2017).
A quick simulation study
My hunch is that the two-step procedure is unnecessary for the Tukey-Kramer method as well as for other multiple comparison procedures (the exception Fisher's LSD procedure which was designed as a post hoc procedure to be used as a follow up after a significant F-test, as Field (2015) rightly points out), but I only focused on the Tukey-Kramer method. What I did was a simple simulation study with a four group between subjects design (all $\mu$'s equal) and estimated the probability of at least type I error both with and without using the 2 step procedure.
set.seed(456) #number of groups ngr = 4 #number of participants n = 40 #group is a factor gr <- factor(rep(1:ngr, each=n)) #vector for storing rejections F-test Reject <- rep(0, 10000) #vector for storing #rejections multiple #comparisons RejectHSD <- rep(0, 10000) for (i in 1:10000) { y = rnorm(ngr*n) mod = aov(y ~ gr) Reject[i] = anova(mod)$"Pr(>F)"[1] <= .05 PS <- TukeyHSD(mod)$gr[,4] RejectHSD[i] = sum(PS <=.05) } #probability type I error F-test sum(Reject)/length(Reject)
## [1] 0.0515
#probability at least one type I error Tukey HSD sum(RejectHSD > 0) / length(RejectHSD)
## [1] 0.0503
#probability at least one type I error given F-tests Rejects sum(RejectHSD[Reject==TRUE] > 0) / length(RejectHSD)
## [1] 0.0424
Even though a single (relatively tiny) simulation (which, by the way, takes a long time to run, nonetheless), is not necessarily convincing, it does illustrate the main points of this post. First, the probability of at least one incorrect rejection using the TukeyHSD function is close to .05. With this particular random seed it even performs a little better than the ANOVA F-test: .0503 versus .0515. This illustrates that even without considering whether the omnibus test is significant the main demand of not rejecting too many true null-hypotheses is completely satisfied. So, in practical terms, you can safely ignore the omnibus test if your concerns are about $\alpha$.
Second, the probability of incorrectly rejecting at least one true pair-wise null-hypothesis after the ANOVA F-test is significant is estimated to be .0424. This shows, that the two-step procedure leads to a larger decrease in the actual type I error probability than is wanted. Even though this may seem good news from the perspective of avoiding type I errors, the down side is that pair wise null-hypotheses that are false (and potentially important) may not be detected.
Conclusion
Common wisdom and practice suggest that multiple comparisons procedures should be done only after a significant omnibus test. We have seen that this is not at all necessary if we use a multiple comparisons procedure that is designed to control the type I error probability. To my knowledge, most of the procedures conventionally thought of as post hoc tests are designed in this manner, the exception being the LSD procedure which does require a significant F-test. For practical purposes, then, do not bother with the omnibus test (note the exception) if you are planning to pair wise compare all the treatment means.
This practical advice does not mean, of course, that I am suggesting you spend your time comparing all treatment means. Most of the time, focused comparisons are a more fruitful way of analysing your data. But I'll leave that topic for another time.
References
Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. 4th Edition. London: Sage.
Wilcox, R. (2017). Understanding and Applying Basic Statistical Methods Using R. Hoboken, NJ: Wiley,
Monday, 9 October 2017
Planning for a precise slope estimate in simple regression
In this post, I will show you a way of determining a sample size for obtaining a precise estimate of the slope $\beta_1$of the simple linear regression equation $\hat{Y_i} = \beta_0 + \beta_1X_i$. The basic ingredients we need for sample size planning are a measure of the precision, a way to determine the quantiles of the sampling distribution of our measure of precision, and a way to calculate sample sizes.
Cumming, G. (2012). Understanding the New Statistics. Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge
Cumming, G., & Calin-Jageman, R. (2017). Introduction to the New Statistics: Estimation, Open Science, and Beyond. New York: Routledge.
Wilcox, R. (2017). Understanding and Applying Basic Statistical Methods using R. Hoboken, New Jersey: John Wiley and Sons.
As our measure of precision we choose the Margin of Error (MOE), which is the half-width of the 95% confidence interval of our estimate (see: Cumming, 2012; Cumming & Calin-Jageman, 2017; see also www.thenewstatistics.com).
The distribution of the margin of error of the regression slope
In the case of simple linear regression, assuming normality and homogeneity of variance, MOE is $t_{.975}\sigma_{\hat{\beta_1}}$, where $t_{.975}$, is the .975 quantile of the central t-distribution with $N - 2$ degrees of freedom, and $\sigma_{\hat{\beta_1}}$ is the standard error of the estimate of $\beta_1$.
An expression of the squared standard error of the estimate of $\beta_1$ is $\frac{\sigma^2_{Y|X}}{\sum{(X_i - \bar{X})}^2}$ (Wilcox, 2017): the variance of Y given X divided by the sum of squared errors of X. The variance $\sigma^2_{Y|X}$ equals $\sigma^2_y(1 - \rho^2_{YX})$, the variance of Y multiplied by 1 minus the squared population correlation between Y and X, and it is estimated with the residual variance $\frac{\sum{(Y - \hat{Y})^2}}{df_e}$, where $df_e = N - 2$.
The estimated squared standard error is given in (1)
$$\hat{\sigma}_{\hat{\beta_{1}}}^{2}=\frac{\sum(Y-\hat{Y})^{2}/df_{e}}{\sum(X-\bar{X})^{2}}. \tag{1} $$
With respect to the sampling distribution of MOE, we first note the following. The distribution of estimates of the residual variance in the numerator of (1) is a scaled $\chi^2$-distribution:
$$\frac{\sum(Y-\hat{Y})^{2}}{\sigma_{y}^{2}(1-\rho^{2})}\sim\chi^{2}(df_{e}),$$
thus
$$\frac{\sum(Y-\hat{Y})^{2}}{df_{e}}\sim\frac{\sigma_{y}^{2}(1-\rho^{2})\chi^{2}(df_{e})}{df_{e}}.$$
Second, we note that
$$\frac{\sum(X-\bar{X})^{2}}{\sigma_{X}^{2}}\sim\chi^{2}(df),$$
where $df = N - 1$, therefore
$$\sum(X-\bar{X})^{2}\sim\sigma_{X}^{2}\chi^{2}(df).$$
Alternatively, since $\sum{(X - \bar{X})^2} = df\sigma^2_X$, and multiplying by 1 ($\frac{df}{df}$).
$$df\sigma_{X}^{2}\sim df\sigma_{X}^{2}\chi^{2}(df)/df.$$
In terms of the sampling distribution of (1), then, we have the ratio of two (scaled) $\chi^2$ distributions, one with $df_e = N - 2$ degrees of freedom, and one with $df = N - 1$ degrees of freedom. Or something like:
$$ \hat{\sigma}_{\hat{\beta_{1}}}^{2}\sim\frac{\sigma_{y}^{2}(1-\rho^{2})\chi^{2}(df_{e})/df_{e}}{df\sigma_{X}^{2}\chi^{2}(df)/df}=\frac{\sigma_{y}^{2}(1-\rho^{2})}{df\sigma_{X}^{2}}\frac{\chi^{2}(df_{e})/df_{e}}{\chi^{2}(df)/df}=\frac{\sigma_{y}^{2}(1-\rho^{2})F(df_{e,}df)}{df\sigma_{X}^{2}},$$
which means that the sampling distribution of MOE is:
$$ \hat{MOE}\sim t_{.975}(N-2)\sqrt{\frac{\sigma_{y}^{2}(1-\rho^{2})F(N-2,N-1)}{(N-1)\sigma_{X}^{2}}}. \tag{2} $$
This last equation, that is (2), can be used to obtain quantiles of the sampling distribution of MOE, which enables us to determine assurance MOE, that is the value of MOE that under repeated sampling will not exceed a target value with a given probability. For instance, if we want to know the .80 quantile of estimates of MOE, that is, assurance is .80, we determine the .80 quantile of the (central) F-distribution with N - 2 and N - 1 degrees of freedom and fill in (2) to obtain a value of MOE that will not be exceeded in 80\% of replication experiments.
For instance, suppose $\sigma^2_Y = 1$, $\sigma^2_X = 1$, $\rho = .50$, $N = 100$, and assurance is .80, then according to (2), 80\% of estimated MOEs will not exceed the value given by:
vary = 1 varx = 1 rho = .5 N = 100 dfe = N - 2 dfx - N - 1
assu = .80 t = qt(.975, dfe) MOE.80 = t*sqrt(vary*(1 - rho^2)*qf(.80, dfe, dfx)/(dfx*varx)) MOE.80
## [1] 0.1880535
What does a quick simulation study tell us?
A quick simulation study may be used to check whether this is at all accurate. And, yes, the estimated quantile from the simulation study is pretty close to what we would expect based on (2). If you run the code below, the estimate equals 0.1878628.
library(MASS) set.seed(355) m = c(0, 0) #note: s below is the variance-covariance matrix. In this case, #rho and the cov(y, x) have the same values #otherwise: rho = cov(x, y)/sqrt(varY*VarX) (to be used in the
#functions that calculate MOE) #equivalently, cov(x, y) = rho*sqrt(varY*varX) (to be used #in the specification of the variance-covariance matrix for
#generating bivariate normal variates) s = matrix(c(1, .5, .5, 1), 2, 2) se <- rep(10000, 0) for (i in 1:10000) { theData <- mvrnorm(100, m, s) mod <- lm(theData[,1] ~ theData[,2]) se[i] <- summary(mod)$coefficients[4] } MOE = qt(.975, 98)*se quantile(MOE, .80)
## 80% ## 0.1878628
Planning for precision
If we want to plan for precision we can do the following. We start by making a function that calculates the assurance quantile of the sampling distribution of MOE described in (2). Then we formulate a squared cost function, which we will optimize for the sample sizeusing the optimize function in R.
Suppose we want to plan for a target MOE of .10 with 80% assurance.We may do the following.
vary = 1 varx = 1 rho = .5 assu = .80 tMOE = .10 MOE.assu = function(n, vary, varx, rho, assu) { varY.X = vary*(1 - rho^2) dfe = n - 2 dfx = n - 1 t = qt(.975, dfe) q.assu = qf(assu, dfe, dfx) MOE = t*sqrt(varY.X*q.assu/(dfx * varx)) return(MOE) } cost = function(x, tMOE) {
cost = (MOE.assu(x, vary=vary, varx=varx, rho=rho, assu=assu) - tMOE)^2 } #note samplesize is at least 40, at most 5000. #note that since we already know that N = 100 is not enough #in stead of 40 we might just as well set N = 100 at the lower #limit of the interval (samplesize = ceiling(optimize(cost, interval=c(40, 5000), tMOE = tMOE)$minimum))
## [1] 321
#check the result: MOE.assu(samplesize, vary, varx, rho, assu)
## [1] 0.09984381
Let's simulate with the proposed sample size
Let's check it with a simulation study. The value of estimated .80 of estimates of MOE is 0.1007269 (if you run the below code with random seed 335), which is pretty close to what we would expect based on (2).
set.seed(355) m = c(0, 0) #note: s below is the variance-covariance matrix. In this case, #rho and the cov(y, x) have the same values #otherwise: rho = cov(x, y)/sqrt(varY*VarX) (to be used in the
#functions that calculate MOE) #equivalently, cov(x, y) = rho*sqrt(varY*varX) (to be used #in the specification of the variance-covariance matrix for #generating bivariate normal variates) s = matrix(c(1, .5, .5, 1), 2, 2) se <- rep(10000, 0) samplesize = 321 for (i in 1:10000) { theData <- mvrnorm(samplesize, m, s) mod <- lm(theData[,1] ~ theData[,2]) se[i] <- summary(mod)$coefficients[4] } MOE = qt(.975, 98)*se quantile(MOE, .80)
## 80% ## 0.1007269
References
Cumming, G. (2012). Understanding the New Statistics. Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge
Cumming, G., & Calin-Jageman, R. (2017). Introduction to the New Statistics: Estimation, Open Science, and Beyond. New York: Routledge.
Wilcox, R. (2017). Understanding and Applying Basic Statistical Methods using R. Hoboken, New Jersey: John Wiley and Sons.
Tuesday, 25 July 2017
Planning for a precise interaction contrast estimate
In my previous post (here), I wrote about obtaining a confidence interval for the estimate of an interaction contrast. I demonstrated, for a simple two-way independent factorial design, how to obtain a confidence interval by making use of the information in an ANOVA source table and estimates of the marginal means and how a custom contrast estimate can be obtained with SPSS.
One of the results of the analysis in the previous post was that the 95% confidence interval for the interaction was very wide. The estimate was .77, 95% CI [0.04, 1.49]. Suppose that it is theoretically or practically important to know the value of the contrast to a more precise degree. (I.e. some researchers will be content that the CI allows for a directional qualitative interpretation: there seems to exist a positive interaction effect, but others, more interested in the quantitative questions may not be so easily satisfied). Let's see how we can plan the research to obtain a more precise estimate. In other words, let's plan for precision.
Of course, there are several ways in which the precision of the estimate can be increased. For instance, by using measurement procedures that are designed to obtain reliable data, we could change the experimental design, for example switching to a repeated measures (crossed) design, and/or increase the number of observations. An example of the latter would be to increase the number of participants and/or the number of observations per participant. We will only consider the option of increasing the number of participants, and keep the independent factorial design, although in reality we would of course also strive for a measurement instrument that generally gives us highly reliable data. (By the way, it is possible to use my Precision application to investigate the effects of changing the experimental design on the expected precision of contrast estimates in studies with 1 fixed factor and 2 random factors).
The plan for the rest of this post is as follows. We will focus on getting a short confidence interval for our interaction estimate, and we will do that by considering the half-width of the interval, the Margin of Error (MOE). First we will try to find a sample size that gives us an expected MOE (in repeated replication of the experiment with new random samples) no more than a target MOE. Second, we will try to find a sample size that gives a MOE smaller than or equal to our target MOE in a specifiable percentage (say, 80% or 90%) of replication experiments. The latter approach is called planning with assurance.
One of the results of the analysis in the previous post was that the 95% confidence interval for the interaction was very wide. The estimate was .77, 95% CI [0.04, 1.49]. Suppose that it is theoretically or practically important to know the value of the contrast to a more precise degree. (I.e. some researchers will be content that the CI allows for a directional qualitative interpretation: there seems to exist a positive interaction effect, but others, more interested in the quantitative questions may not be so easily satisfied). Let's see how we can plan the research to obtain a more precise estimate. In other words, let's plan for precision.
Of course, there are several ways in which the precision of the estimate can be increased. For instance, by using measurement procedures that are designed to obtain reliable data, we could change the experimental design, for example switching to a repeated measures (crossed) design, and/or increase the number of observations. An example of the latter would be to increase the number of participants and/or the number of observations per participant. We will only consider the option of increasing the number of participants, and keep the independent factorial design, although in reality we would of course also strive for a measurement instrument that generally gives us highly reliable data. (By the way, it is possible to use my Precision application to investigate the effects of changing the experimental design on the expected precision of contrast estimates in studies with 1 fixed factor and 2 random factors).
The plan for the rest of this post is as follows. We will focus on getting a short confidence interval for our interaction estimate, and we will do that by considering the half-width of the interval, the Margin of Error (MOE). First we will try to find a sample size that gives us an expected MOE (in repeated replication of the experiment with new random samples) no more than a target MOE. Second, we will try to find a sample size that gives a MOE smaller than or equal to our target MOE in a specifiable percentage (say, 80% or 90%) of replication experiments. The latter approach is called planning with assurance.
Saturday, 29 April 2017
Planning for precision with samples of participants and items
Many experiments involve the (quasi-)random selection of both participants and items. Westfall et al. (2014) provide a Shiny-app for power-calculations for five different experimental designs with selections of participants and items. Here I want to present my own Shiny-app for planning for precision of contrast estimates (for the comparison of up to four groups) in these experimental designs. The app can be found here: https://gmulder.shinyapps.io/precision/
(Note: I have taken the code of Westfall's app and added code or modified existing code to get precision estimates in stead of power; so, without Westfall's app, my own modified version would never have existed).
The plan for this post is as follows. I will present the general theoretical background (mixed model ANOVA combined with ideas from Generalizability Theory) by considering comparing three groups in a counter balanced design.
Note 1: This post uses mathjax, so it's probably unreadable on mobile devices. Note: a (tidied up) version (pdf) of this post can be downloaded here: download the pdf
Note 2: For simulation studies testing the procedure go here: https://the-small-s-scientist.blogspot.nl/2017/05/planning-for-precision-simulation.html
Note 3: I use the terms stimulus and item interchangeably; have to correct this to make things more readable and comparable to Westfall et al. (2014).
Note 4: If you do not like the technical details you can skip to an illustration of the app at the end of the post.
The general idea
The focus of planning for precision is to try to minimize the half-width of a 95%-confidence interval for a comparison of means (in our case). Following Cumming's (2012) terminology I will call this half-width the Margin of Error (MOE). The actual purpose of the app is to find required sample sizes for participants and items that have a high probability ('assurance') of obtaining a MOE of some pre-specified value.
Labels:
ANOVA,
Confidence interval,
mixed models,
MOE,
Precision,
R,
shiny
Subscribe to:
Posts (Atom)