This tutorial introduces basic statistical techniques for inferential statistics for hypothesis testing with R. The entire code for the tutorial can be downloaded here. The first part of this tutorial focuses on basic non-parametric tests such as Fisher’s Exact test while the second part introduces parametric tests such as the t-test.

Non-Parametric Tests do not require the data (or the errors of the dependent variable, to be more precise) to be distributed normally. Tests that do not require normal data are referred to as *non-parametric tests* (test that require the data to be distributed normally are analogously called *parametric tests*). We focus on non-parametric tests first as this family of test in frequently used in linguistics. In the later part of this section, we will focus on regression modelling where assumptions of about the data become more important.

**Preparation and session set up**

As all visualizations in this tutorial rely on R, it is necessary to install R and RStudio. If these programs (or, in the case of R, environments) are not already installed on your machine, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain *packages* need to be installed from an R *library* so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

```
# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)
# install libraries
install.packages(c("stringr", "VGAM", "fGarch"))
```

Once you have installed R, R-Studio, and have also initiated the session by executing the code shown above, you are good to go.

Fisher’s Exact test is very useful because it does not rely on distributional assumptions relying on normality. Instead, Fisher’s Exact Test calculates the probabilities of all possible outcomes and uses these to determine significance. To understand how a Fisher’s Exact test, we will use a very simple example.

Imagine you are interested in adjective modification and you want to find out if “very” and “truly” differ in their collocational preferences. So you extract all instances of “cool”, all instances of “very”, and all instances of “truly” from a corpus. Now that you have gathered this data, you want to test if “truly” and “very” differ with respect to their preference to co-occur with “cool”. Accordingly, you tabulate the results and get the following table.

```
library(knitr) # activate library for tabulating
# generate data
coolmatrix <- matrix(c("truly", "very", 5, 17, 40, 41), ncol = 3, byrow = F)
colnames(coolmatrix) <- c("Adverb", "with cool", "with other adjectives")
kable(coolmatrix, caption = "Co-occurrence of cool with adverbs.")
```

Adverb | with cool | with other adjectives |
---|---|---|

truly | 5 | 40 |

very | 17 | 41 |

To perform a Fisher’s Exact test, we first need to create a table with these results in “R”.

```
# create table
coolmx <- matrix(c(5, 17, 40, 41),
# number of rows of the table
nrow = 2,
# def. dimension names
dimnames = list(Adverbs = c("truly", "very"),
Adjectives = c("cool", "other adjective")))
# inspect resulting table
coolmx
```

```
## Adjectives
## Adverbs cool other adjective
## truly 5 40
## very 17 41
```

Once we have the created a matrix with these numbers, we can perform the Fisher’s Exact test to see if “very” and “truly” differ in their preference to co-occur with “cool.”

```
# perform test
fisher.test(coolmx)
```

```
##
## Fisher's Exact Test for Count Data
##
## data: coolmx
## p-value = 0.03024
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.08015294 0.96759831
## sample estimates:
## odds ratio
## 0.3048159
```

The results of the Fisher’s Exact test show that the p-value is lower than .05 and that we are therefor justified in assuming that “very” and “truly” differ in their collocational preferences to co-occur with “cool.”

If the depended variable is neither nominal nor categorical, but ordinal (that is if the dependent variable represents an order factor such as ranks), using a chi-square test is not warranted. In such cases it is advisable to apply tests that are designed to handle ordinal data. In the following, we will therefore briefly touch on bi-variate tests that can handle ordinal dependent variables.

It is a rather frequent case that numeric depend variables are transformed or converted into ordinal variables because the distribution of residuals does not allow the application of a linear regression. Because we are dealing with ordinal data, the application of a chi-square test is unwarranted and we need to use another test. In such cases, the Mann-Whitney U-test (also called Wilcoxon rank sum-test) can be used.

The Mann-Whitney U-test can also be used if the groups under investigation represent identical participants that are tested under two conditions.

Imaging we wanted to determine if two language families differed with respect to the size of their phoneme inventories. You have already ranked the inventory sizes and would now like to now if language family correlates with inventory size. To answer this question, you create the table shown below.

```
# create table
Rank <- c(1,3,5,6,8,9,10,11,17,19, 2,4,7,12,13,14,15,16,18,20)
Groups <- c(rep("Lang.Fam.1", 10), rep("Lang.Fam.2", 10))
Languagefamilytb <- data.frame(Groups, Rank)
# inspect table
Languagefamilytb
```

```
## Groups Rank
## 1 Lang.Fam.1 1
## 2 Lang.Fam.1 3
## 3 Lang.Fam.1 5
## 4 Lang.Fam.1 6
## 5 Lang.Fam.1 8
## 6 Lang.Fam.1 9
## 7 Lang.Fam.1 10
## 8 Lang.Fam.1 11
## 9 Lang.Fam.1 17
## 10 Lang.Fam.1 19
## 11 Lang.Fam.2 2
## 12 Lang.Fam.2 4
## 13 Lang.Fam.2 7
## 14 Lang.Fam.2 12
## 15 Lang.Fam.2 13
## 16 Lang.Fam.2 14
## 17 Lang.Fam.2 15
## 18 Lang.Fam.2 16
## 19 Lang.Fam.2 18
## 20 Lang.Fam.2 20
```

We will also briefly inspect the data visually using a box plot.

```
# inspect data
par(mfrow=c(1,2))
boxplot(Languagefamilytb$Rank ~ Languagefamilytb$Groups, col = c("orange", "darkgrey"), main = "")
par(mfrow=c(1,1))
```

To use Mann-Whitney U test, the dependent variable (Rank) must be ordinal and independent variable (Group) must be a binary factor. We briefly check this by inspecting the structure of the data.

```
# perform test
str(Languagefamilytb)
```

```
## 'data.frame': 20 obs. of 2 variables:
## $ Groups: Factor w/ 2 levels "Lang.Fam.1","Lang.Fam.2": 1 1 1 1 1 1 1 1 1 1 ...
## $ Rank : num 1 3 5 6 8 9 10 11 17 19 ...
```

As the variables are what we need them to be, we can now perform the Mann-Whitney U test on the table.

```
# perform test
wilcox.test(Languagefamilytb$Rank ~ Languagefamilytb$Groups)
```

```
##
## Wilcoxon rank sum test
##
## data: Languagefamilytb$Rank by Languagefamilytb$Groups
## W = 34, p-value = 0.2475
## alternative hypothesis: true location shift is not equal to 0
```

The results of the Wilcoxon rank sum test tell use that the two language families to not differ significantly with respect to their phoneme inventory size as the p-value is greater than .05.

The Wilcoxon rank sum test can also be used with continuity correction. A continuity correction is necessary when both variables represent numeric values that are non-normal. In the following example, we want to test if the reaction time for identifying a word as real is correlated with its token frequency.

For this example, we generate data is deliberately non-normal.

```
# load librarieslibrary(VGAM)
library(fGarch)
# generate non-normal skewed numeric data
r=.1
Frequency=rsnorm(100,0,2,4)
NormalReaction=rsnorm(100,0,2,4)
ReactionTimes=Frequency*r+NormalReaction*sqrt(1-r^2)
# inspect data
par(mfrow=c(1,2))
boxplot(Frequency, col = "orange", main = "Frequency")
boxplot(ReactionTimes, col = "darkgrey", main = "Reaction Times")
```

`par(mfrow=c(1,1))`

Both variables are skewed but we can use the Wilcoxon rank sum test with continuity correction which takes the skewness into account.

```
# perform test
wilcox.test(ReactionTimes, Frequency)
```

```
##
## Wilcoxon rank sum test with continuity correction
##
## data: ReactionTimes and Frequency
## W = 4023, p-value = 0.01703
## alternative hypothesis: true location shift is not equal to 0
```

When performing the Wilcoxon rank sum test with data that represent the same individuals that were tested under two condition, i.e. if the samples are dependent, then the argument “paired” has to be “TRUE”.

In this example, the same individuals had to read tongue twisters when they were sober and when they were intoxicated. A Wilcoxon signed rank test with continuity correction is used to test if the number of errors that occur when reading tongue twisters correlates with being sober/intoxicated. Again, we create fictitious data.

```
# create data
Sober <- sample(0:9, 15, replace = T)
Intoxicated <- sample(3:12, 15, replace = T)
# tabulate data
intoxtb <- data.frame(Sober, Intoxicated)
# inspect data
str(intoxtb)
```

```
## 'data.frame': 15 obs. of 2 variables:
## $ Sober : int 2 6 8 9 8 9 4 3 5 2 ...
## $ Intoxicated: int 5 4 10 6 3 12 11 3 11 4 ...
```

Now, we briefly plot the data.

```
# inspect data
boxplot(intoxtb$Sober, intoxtb$Intoxicated, col = c("orange", "darkgrey"), main = "", ylab = "Number of errors", xlab = "State", data = intoxtb)
```

The boxes indicate a significant difference. Finally, we perform the Wilcoxon signed rank test with continuity correction,

```
# perform test
wilcox.test(intoxtb$Sober, intoxtb$Intoxicated, paired=T)
```

```
##
## Wilcoxon signed rank test with continuity correction
##
## data: intoxtb$Sober and intoxtb$Intoxicated
## V = 26, p-value = 0.1822
## alternative hypothesis: true location shift is not equal to 0
```

The p-value is lower than .05 which means that the number of errors when reading tongue twisters is affected by one’s state (sober/intoxicated) - at least in this fictitious example.

The Kruskal-Wallis rank sum test is a type of ANOVA (Analysis of Variance). For this reason, the Kruskal Wallis Test is also refered to as a “one-way Anova by ranks” which can not only handle numeric but also ordinal data.

In the example below, *uhm* represents the number of filled pauses in a short 5 minute interview while speaker represents whether the speaker was a native speaker or a learner of English. As before, the data is generated and thus artificial.

```
# create data
Uhms <- c(15, 13, 10, 8, 37, 23, 31, 52, 11, 17)
Speaker <- c(rep("Learner", 5), rep("NativeSpeaker", 5))
# create table
uhmtb <- data.frame(Speaker, Uhms)
# inspect table
str(uhmtb)
```

```
## 'data.frame': 10 obs. of 2 variables:
## $ Speaker: Factor w/ 2 levels "Learner","NativeSpeaker": 1 1 1 1 1 2 2 2 2 2
## $ Uhms : num 15 13 10 8 37 23 31 52 11 17
```

Now, we briefly plot the data.

```
# inspect data
boxplot(uhmtb$Uhms ~ uhmtb$Speaker, col = c("orange", "darkgrey"), main = "", ylab = "Number of errors", xlab = "State")
```

Now, we test for statistical significance.

`kruskal.test(uhmtb$Speaker~uhmtb$Uhms) `

```
##
## Kruskal-Wallis rank sum test
##
## data: uhmtb$Speaker by uhmtb$Uhms
## Kruskal-Wallis chi-squared = 9, df = 9, p-value = 0.4373
```

The Kruskal-Wallis test does not report a significant difference for the number of “uhms” produced by native speakers and learners of English in the ficticious data.

The Friedman rank sum test is also called a randomized block design and it is used when the correlation between a numeric dependent variable, a grouping factor and a blocking factor is tested. The Friedman rank sum test assumes that each combination of the grouping factor (Gender) and the blocking factor (Age) occur only once. Thus, imagine that the values of “Uhms” represent the means of the respective groups.

```
# create data
Uhms <- c(7.2, 9.1, 14.6, 13.8)
Gender <- c("Female", "Male", "Female", "Male")
Age <- c("Young", "Young", "Old", "Old")
# create table
uhmtb2 <- data.frame(Gender, Age, Uhms)
# inspect table
uhmtb2
```

```
## Gender Age Uhms
## 1 Female Young 7.2
## 2 Male Young 9.1
## 3 Female Old 14.6
## 4 Male Old 13.8
```

```
# perform Friedman test
friedman.test(Uhms ~ Age | Gender, data = uhmtb2)
```

```
##
## Friedman rank sum test
##
## data: Uhms and Age and Gender
## Friedman chi-squared = 2, df = 1, p-value = 0.1573
```

In our example, age does not affect the use of filled pauses even if we control for gender as the p-value is higher than .05.

Schweinberger, Martin. 2020. *Basic Inferential Statistics with R*. Brisbane: The University of Queensland. url: https://slcladal.github.io/basicstatz.html.