1 Introduction

This tutorial introduces fixed-effects regression modeling using R. The entire code for the sections below can be downloaded here.

Regression models are among the most widely used methods in data analysis because they are:

  • multivariate and can incorporate many predictors in a single model (which allows to test the impact of one predictor while the impact of (all) other predictors is controlled for)
  • extremely flexible and and can be fitted to different types of predictors and dependend variables
  • provide output that can be easily interpreted
  • conceptually relative simple and not overly complex from a mathematical perspective

Fixed-effects regression models are simple additive models which means that the predicted values represent the intercept value plus the effects of the individual predictors while mixed-effects models are based on more complex matrix multiplications where predicted values represent the product of the random effect multiplied by the intercept values plus the effects of the fixed effects. R offers a various ready-made functions with which implemneting different types of regression models is very easy.

In the following, we will go over the most relevant and frequently used types of regression models: * multiple linear regression * multiple binomial logistic regression * ordinal regression * Poissant regression * robust regression

The major difference between these types of models is that they take different types of dependent variables: linear regressions take numeric , logistic regressions take nominal variables, ordinal regressions take ordinal variables, and Poissant regressions take dependent variables that reflect counts of (rare) events. Robust regression, in contrast, is a simple multiple linear regression that is able to handle outliers due to a weighing procedure.

2 Preparation and session set up

As all caluculations and visualizations in this tutorial rely on R, it is necessary to install R and RStudio. If these programms (or, in the case of R, environments) are not already installed on your machine, please search for them in your favorite search engine and add the term “download”. Open any of the first few links and follow the installation instructions (they are easy to follow, do not require any specifications, and are pretty much self-explanatory).

In addition, certain packages need to be installed from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the libraries so you do not need to worry if it takes some time).

# clean current workspace
rm(list=ls(all=T))
# set options
options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 100, "digits" = 4) # supress math annotation
# install libraries
install.packages(c("boot", "car", "caret", "effects", "foreign", "ggplot2", 
                   "Hmisc", "knitr", "MASS", "mlogit", "msm", "QuantPsyc", 
                   "reshape2", "rms", "sandwich", "sfsmisc", "sjPlot", 
                  "stringr",  "vcd", "visreg"))

Once you have installed R, R-Studio, and have also initiated the session by executing the code shown above, you are good to go.

3 Multiple Linear Regression

In contrast to simple linear regression, which estimates the effect of a single predictor, multiple linear regression estimates the effect of various predictor (see the equation below). A multiple linear regression can thus test the effects of various predictors simultaneously.

\[\begin{equation} f_{(x)} = \alpha + \beta_{1}x_{i} + \beta_{2}x_{i+1} + \dots + \beta_{n}x_{i+n} + \epsilon \end{equation}\]

There exists a wealth of literature focusing on multiple linear regressions and the concepts it is based on. For instance, there are Achen (1982), Bortz (2006), Crawley (2005), Faraway (2002), Field, Miles, and Field (2012) (my personal favourite), Gries (2009), Levshina (2015), and Wilcox (2009) to name just a few. Introductions to regression modelling in R are Baayen (2008), Crawley (2012), Gries (2009), or Levshina (2015).

The model diagnostics we are dealing with here are partly identical to the diagnostic methods discussed in the section on simple linear regression. Because of this overlap, diagnostics will only be described in more detail if they have not been described in the section on simple linear regression.

A brief note on minimum necessary sample or data set size appears necessary here. Although there appears to be a general assumption that 25 data points per group are sufficient, this is not necessarily correct (it is merely a general rule of thumb that is actually often incorrect). Such rules of thumb are inadequate because the required sample size depends on the number of variables in a given model, the size of the effect and the variance of the effect. If a model contains many variables, then this requires a larger sample size than a model which only uses very few predictors. Also, to detect an effect with a very minor effect size, one needs a substantially larger sample compared to cases where the effect is very strong. In fact, when dealing with small effects, model require a minimum of 600 cases to reliably detect these effects. Finally, effects that very robust and do not vary much require a much smaller sample size compared with effects that are spurious and vary substantially. Since the sample size depends on the effect size and variance as well as the number of variables, there is no final one-size-fits-all answer to what the best sample size is.

Another, slightly better but still incorrect, rule of thumb is that the more data, the better. This is not correct because models based on too many cases are prone for overfitting and thus report correlations as being significant that are not. However, given that there are procedures that can correct for overfitting, larger data sets are still preferable to data sets that are simply too small to warrant reliable results. In conclusion, it remains true that the sample size depends on the effect under investigation.

Despite there being no ultimate rule of thumb, Field, Miles, and Field (2012) 273-275), based on Green (1991), provide data-driven suggestions for the minimal size of data required for regression models that aim to find medium sized effects (k = number of predictors; categorical variables with more than two levels should be transformed into dummy variables):

  • If one is merely interested in the overall model fit (something I have not encountered), then the sample size should be at least 50 + k (k = number of predictors in model).
  • If one is only interested in the effect of specific variables, then the sample size should be at least 104 + k (k = number of predictors in model).
  • If one is only interested in both model fit and the effect of specific variables, then the sample size should be at least the higher value of 50 + k or 104 + k (k = number of predictors in model).

You will see in the “R”-code below that there is already a function that tests whether the sample size is sufficient.

3.1 Example: Gifts and Availability

The example we will go through here is taken from Field, Miles, and Field (2012). In this example, the research question is if the money that men spend on presents for women depends on the women’s attractiveness and their relationship status. To answer this research question, we will implement a multiple linear regression and start by preparing the R-session (activating necessary packages, and loading functions).

# load libraries
library(boot)
library(car)
library(caret)
library(effects)
library(foreign)
library(ggplot2)
library(Hmisc)
library(knitr)
library(MASS)
library(mlogit)
library(msm)
library(plyr)
library(QuantPsyc)
library(reshape2)
library(rms)
library(sandwich)
library(sfsmisc)
library(sjPlot)
library(stringr)
library(vcd)
library(visreg)
# load functions
source("https://slcladal.github.io/rscripts/blrsummary.r")
source("https://slcladal.github.io/rscripts/multiplot.r")
source("https://slcladal.github.io/rscripts/mlinrsummary.r")
source("https://slcladal.github.io/rscripts/SampleSizeMLR.r")
source("https://slcladal.github.io/rscripts/ExpR.r")

After preparing the session, we can now load the data and inspect its structure and properties.

# load data
mlrdata <- read.delim("https://slcladal.github.io/data/mlrdata.txt", header = TRUE)
head(mlrdata); str(mlrdata); summary(mlrdata) # inspect data
##         status    attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
## 'data.frame':    100 obs. of  3 variables:
##  $ status    : chr  "Relationship" "Relationship" "Relationship" "Relationship" ...
##  $ attraction: chr  "NotInterested" "NotInterested" "NotInterested" "NotInterested" ...
##  $ money     : num  86.3 45.6 68.4 52.9 61.9 ...
##     status           attraction            money       
##  Length:100         Length:100         Min.   :  0.93  
##  Class :character   Class :character   1st Qu.: 49.84  
##  Mode  :character   Mode  :character   Median : 81.73  
##                                        Mean   : 88.38  
##                                        3rd Qu.:121.57  
##                                        Max.   :200.99

The data set consist of three variables stored in three columns. The first column contains the relationship status of the women, the second whether the man is interested in the woman, and the third column represents the money spend on the present. The data set represents 100 cases and the mean amount of money spend on a present is 88.38 dollars. In a next step, we visualize the data to get a more detailed impression of the relationships between variables.

# create plots
p1 <- ggplot(mlrdata, aes(status, money)) +   # data + x/y-axes
  geom_boxplot(fill=c("grey30", "grey70")) + # def. col.
  theme_set(theme_bw(base_size = 8))+   # black and white theme
  labs(x = "") +                        # x-axis label
  labs(y = "Money spent on present (AUD)", cex = .75) +   # y-axis label
  coord_cartesian(ylim = c(0, 250)) +   # y-axis range
  guides(fill = FALSE) +                # no legend
  ggtitle("Status")                     # title
# plot 2
p2 <- ggplot(mlrdata, aes(attraction, money)) +
  geom_boxplot(fill=c("grey30", "grey70")) +
  theme_set(theme_bw(base_size = 8))+
  labs(x = "") +                              # x-axis label
  labs(y = "Money spent on present (AUD)") +  # y-axis label
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE) +
  ggtitle("Attraction")
# plot 3
p3 <- ggplot(mlrdata, aes(x = money)) +
  geom_histogram(aes(y=..density..),    # add density statistic
                 binwidth = 10,         # def. bin width
                 colour = "black",      # def. bar edge colour
                 fill = "white") +      # def. bar col.
    theme_bw() +                        # black-white theme
  geom_density(alpha=.2, fill = "gray50") + # def. col. of overlay
    labs(x = "Money spent on present (AUD)") +
  labs(y = "Density of frequency")
# plot 4
p4 <- ggplot(mlrdata, aes(status, money)) +
  geom_boxplot(notch = F, aes(fill = factor(status))) + # create boxplot
  scale_fill_manual(values = c("grey30", "grey70")) +   # def. col. palette
  facet_wrap(~ attraction, nrow = 1) +  # separate panels for attraction
  theme_set(theme_bw(base_size = 8)) +
  labs(x = "") +
  labs(y = "Money spent on present (AUD)") +
  coord_cartesian(ylim = c(0, 250)) +
  guides(fill = FALSE)
# show plots
multiplot(p1, p3, p2, p4, cols = 2)

The upper left figure consists of a boxplot which shows how much money was spent based on the relationship status of the moan. The figure suggests that men spend more on women who are not in a relationship. The next figure shows the relationship between the money spend on presents and whether or not the men were interested in the women.

The boxplot in the upper right panel suggests that men spend substantially more on women if the men are interested in them. The next figure depicts the distribution of the amounts of money spend on women. In addition, the figure indicates the existence of two outliers (dots in the boxplot)

The histogram in the lower left panel shows that, although the mean amount of money spent on presents is 88.38 dollars, the distribution peaks around 50 dollars indicating that on average, men spend about 50 dollars on presents. Finally, we will plot the amount of money spend on presents against relationship status by attraction in order to check whether the money spent on presents is affected by an interaction between attraction and relationship status.

The boxplot in the lower right panel confirms the existence of an interaction (a non-additive term) as men only spend more money on single women if the men are interested in the women. If men are not interested in the women, then the relationship has no effect as they spend an equal amount of money on the women regardless of whether they are in a relationship or not.

We will now start to implement the regression model. In a first step, we create two saturated base-line models that contain all possible predictors (main effects and interactions). The two models are identical but one is generated with the “lm” and the other with the “glm” function as these functions offer different model parameters in their output.

m1.mlr = lm(                      # generate lm regression object
  money ~ 1 + attraction*status,  # def. rgression formula (1 = intercept)
  data = mlrdata)                 # def. data
m1.glm = glm(                     # generate glm regression object
  money ~ 1 + attraction*status,  # def. rgression formula (1 = intercept)
  family = gaussian,              # def. linkage function
  data = mlrdata)                 # def. data

After generating the saturated base-line models we can now start with the model fitting. Model fitting refers to a process that aims at find the model that explains a maximum of variance with a minimum of predictors (see Field, Miles, and Field 2012, 318). Model fitting is therefore based on the principle of parsimony which is related to Occam’s razor according to which explanations that require fewer assumptions are more likely to be true.

3.2 Automatic Model Fitting and Why You Should Not Use It

In this section, we will use a step-wise step-down procedure that uses decreases in AIC (Akaike information criterion) as the criterion to minimize the model in a step-wise manner. This procedure aims at finding the model with the lowest AIC values by evaluating - step-by-step - whether the removal of a predictor (term) leads to a lower AIC value.

We use this method here just so that you know it exists and how to implement it but you should rather avoid using automated model fitting. The reason for avoiding automated model fitting is that the algorithsm only checks if the AIC has decreased but not if the model is stable or reliable. Thus, automated model fitting has the problem that you can never be sure that the way that lead you to the final model is reliable and that all models were indeed stable. Imagine you want to climb down from a roof top and you have a ladder. The problem is that you do not know if and how many steps are broken. This is similar to using automated model fitting. In other sections, we will explore better methods to fit models (manual step-wise step-up and step-down procedures, for example).

The AIC is calculated using the equation below. The lower the AIC value, the better the balance between explained variance and the number of predictors. AIC values can and should only be compared for models that are fit on the same dataset with the same (number of) cases (\(LL\) stands for LogLikelihood and \(k\) represents the number of predictors in the model).

\[\begin{equation} -2LL + 2k \label{eq:aic} \end{equation}\]

Interactions are evaluated first and only if all interactions have been removed would the procedure start removing main effects. Other model fitting procedures (forced entry, step-wise step up, hierarchical) are discussed during the implementation of other regression models. We cannot discuss all procedures here as model fitting is rather complex and a discussion of even the most common procedures would to lengthy and time consuming at this point. It is important to note though that there is not perfect model fitting procedure and automated approaches should be handled with care as they are likely to ignore violations of model parameters that can be detected during manual - but time consuming - model fitting procedures. As a general rule of thumb, it is advisable to fit models as carefully and deliberately as possible. We will now begin to fit the model.

# automated AIC based model fitting
step(m1.mlr, direction = "both")
## Start:  AIC=592.5
## money ~ 1 + attraction * status
## 
##                     Df Sum of Sq   RSS AIC
## <none>                           34558 593
## - attraction:status  1     24947 59505 645
## 
## Call:
## lm(formula = money ~ 1 + attraction * status, data = mlrdata)
## 
## Coefficients:
##                          (Intercept)               attractionNotInterested  
##                                 99.2                                 -47.7  
##                         statusSingle  attractionNotInterested:statusSingle  
##                                 57.7                                 -63.2

The automated model fitting procedure informs us that removing predictors ahs not caused a decrease in the AIC. The saturated model is thus also the final minimal adequate model. We will now inspect the final minimal model and go over the model report.

m2.mlr = lm(                       # generate lm regression object
  money ~ (status + attraction)^2, # def. regression formula
  data = mlrdata)                  # def. data
m2.glm = glm(                      # generate glm regression object
  money ~ (status + attraction)^2, # def. regression formula
  family = gaussian,               # def. linkage function
  data = mlrdata)                  # def. data
# inspect final minimal model
summary(m2.mlr)
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.08 -14.26   0.46  11.93  44.14 
## 
## Coefficients:
##                                      Estimate Std. Error t value
## (Intercept)                             99.15       3.79   26.13
## statusSingle                            57.69       5.37   10.75
## attractionNotInterested                -47.66       5.37   -8.88
## statusSingle:attractionNotInterested   -63.18       7.59   -8.32
##                                                  Pr(>|t|)    
## (Intercept)                          < 0.0000000000000002 ***
## statusSingle                         < 0.0000000000000002 ***
## attractionNotInterested                 0.000000000000038 ***
## statusSingle:attractionNotInterested    0.000000000000581 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19 on 96 degrees of freedom
## Multiple R-squared:  0.852,  Adjusted R-squared:  0.847 
## F-statistic:  184 on 3 and 96 DF,  p-value: <0.0000000000000002

The first element of the report is called Call and it reports the regression formula of the model. Then, the report provides the residual distribution (the range, median and quartiles of the residuals) which allows drawing inferences about the distribution of differences between observed and expected values. If the residuals are distributed unevenly, then this is a strong indicator that the model is unstable and unreliable because mathematical assumptions on which the model is based are violated.

Next, the model summary reports the most important part: a table with model statistics of the fixed-effects structure of the model. The table contains the estimates (coefficients of the predictors), standard errors, t-values, and the p-values which show whether a predictor significantly correlates with the dependent variable that the model investigates.

All main effects (status and attraction) as well as the interaction between status and attraction is reported as being significantly correlated with the dependent variable (money). An interaction occurs if a correlation between the dependent variable and a predictor is affect by another predictor.

The top most term is called intercept and has a value of 99.15 which represents the base estimate to which all other estimates refer. To exemplify what this means, let us consider what the model would predict a man would spend on a present for a women who is single but the man is not attracted to her: The amount he would spend (based on the model would be 99.15 dollars (the intercept) plus 57.69 dollars (because she is single) minus 47.66 dollars (because he is not interested in her) minus 63.18 dollars because of the interaction between status and attraction.

#intercept  Single  NotInterested  Single:NotInterested
99.15     + 57.69  + 0           + 0     # 156.8 single + interested
## [1] 156.8
99.15     + 57.69  - 47.66       - 63.18 # 46.00 single + not interested
## [1] 46
99.15     - 0      + 0           - 0     # 99.15 relationship + interested
## [1] 99.15
99.15     - 0      - 47.66       - 0     # 51.49 relationship + not interested
## [1] 51.49

Interestingly, the model predicts that a man would invest even less money in a woman that he is not interested in if she were single compared to being in a relationship! We can derive the same results easier using the “predict” function.

# make prediction based on the model for original data
prediction <- predict(m2.mlr, newdata = mlrdata)
# inspect predictions
table(round(prediction,2))
## 
##  46.01  51.49  99.15 156.85 
##     25     25     25     25

Below the table of coefficients, the regression summary reports model statistics that provide information about how well the model performs. The difference between the values and the values in the coefficients table is that the model statistics refer to the model as a whole rather than focusing on individual predictors.

The multiple R2-value is a measure of how much variance the model explains. A multiple R2-value of 0 would inform us that the model does not explain any variance while a value of .852 mean that the model explains 85.2 percent of the variance. A value of 1 would inform us that the model explains 100 percent of the variance and that the predictions of the model match the observed values perfectly. Multiplying the multiple R2-value thus provides the percentage of explained variance. Models that have a multiple R2-value equal or higher than .05 are deemed substantially significant (see Szmrecsanyi 2006, 55). It has been claimed that models should explain a minimum of 5 percent of variance but this is problematic as itis not uncommon for models to have very low explanatory power while still performing significantly and systematically better than chance. In addition, the total amount of variance is negligible in cases where one is interested in very weak but significant effects. It is much more important for model to perform significantly better than minimal base-line models because if this is not the case, then the model does not have any predictive and therefore no explanatory power.

The adjusted R2-value considers the amount of explained variance in light of the number of predictors in the model (it is thus somewhat similar to the AIC and BIC) and informs about how well the model would perform if it were applied to the population that the sample is drawn from. Ideally, the difference between multiple and adjusted R2-value should be very small as this means that the model is not overfitted. If, however, the difference between multiple and adjusted R2-value is substantial, then this would strongly suggest that the model is instable and overfitted to the data while being inadequate for drawing inferences about the population. Differences between multiple and adjusted R2-values indicate that the data contains outliers that cause the distribution of the data on which the model is based to differ from the distributions that the model mathematically requires to provide reliable estimates. The difference between multiple and adjusted R2-value in our model is very small (85.2-84.7=.05) and should not cause concern.

Before continuing, we will calculate the confidence intervals of the coefficients.

# extract confidence intervals of the coefficients
confint(m2.mlr)
##                                       2.5 % 97.5 %
## (Intercept)                           91.62 106.69
## statusSingle                          47.04  68.34
## attractionNotInterested              -58.31 -37.01
## statusSingle:attractionNotInterested -78.24 -48.11
# create and compare baseline- and minimal adequate model
m0.mlr <- lm(money ~1, data = mlrdata)
anova(m0.mlr, m2.mlr)
## Analysis of Variance Table
## 
## Model 1: money ~ 1
## Model 2: money ~ (status + attraction)^2
##   Res.Df    RSS Df Sum of Sq   F              Pr(>F)    
## 1     99 233562                                         
## 2     96  34558  3    199005 184 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now, we compare the final minimal adequate model to the base-line model to test whether then final model significantly outperforms the baseline model.

# compare baseline- and minimal adequate model
Anova(m0.mlr, m2.mlr, type = "III")
## Anova Table (Type III tests)
## 
## Response: money
##             Sum Sq Df F value              Pr(>F)    
## (Intercept) 781016  1    2170 <0.0000000000000002 ***
## Residuals    34558 96                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison between the two model confirms that the minimal adequate model performs significantly better (makes significantly more accurate estimates of the outcome variable) compared with the baseline model.

3.3 Outlier Detection

After implementing the multiple regression, we now need to look for outliers and perform the model diagnostics by testing whether removing data points disproportionately decreases model fit. To begin with, we generate diagnostic plots.

# start plotting
par(mfrow = c(2, 2)) # display plots in 3 rows/2 columns
plot(m2.mlr); par(mfrow = c(1, 1)) # cerate plots and restore original settings

The plots do not show severe probelms such as funnel shaped patterns or drastic deviations from the diagonal line in Normal Q-Q plot (have a look at the explanation of what to look for and how to interpret these diagnostic plots in the section on simple linear regression) but data points 52, 64, and 83 are repeatedly indicated as potential outliers.

# determine a cutoff for data points that have D-values higher than 4/(n-k-1)
cutoff <- 4/((nrow(mlrdata)-length(m2.mlr$coefficients)-2))
# start plotting
par(mfrow = c(1, 2))           # display plots in 3 rows/2 columns
qqPlot(m2.mlr, main="QQ Plot") # create qq-plot
## [1] 52 83
plot(m2.mlr, which=4, cook.levels = cutoff); par(mfrow = c(1, 1))

The graphs indicate that data points 52, 64, and 83 may be problematic. We will therefore statistically evaluate whether these data points need to be removed. In order to find out which data points require removal, we extract the influence measure statistics and add them to out data set.

# extract influence statistics
infl <- influence.measures(m2.mlr)
# add infl. statistics to data
mlrdata <- data.frame(mlrdata, infl[[1]], infl[[2]])
# annotate too influential data points
remove <- apply(infl$is.inf, 1, function(x) {
  ifelse(x == TRUE, return("remove"), return("keep")) } )
# add annotation to data
mlrdata <- data.frame(mlrdata, remove)
# number of rows before removing outliers
nrow(mlrdata)
## [1] 100
# remove outliers
mlrdata <- mlrdata[mlrdata$remove == "keep", ]
# number of rows after removing outliers
nrow(mlrdata)
## [1] 98

The difference in row in the data set before and after removing data points indicate that two data points which represented outliers have been removed.

3.4 Rerun Regression

As we have a different data set now, we need to rerun the regression analysis. As the steps are identical to the regression analysis performed above, the steps will not be described in greater detail.

# recreate regression models on new data
m0.mlr = lm(money ~ 1, data = mlrdata)
m0.glm = glm(money ~ 1, family = gaussian, data = mlrdata)
m1.mlr = lm(money ~ (status + attraction)^2, data = mlrdata)
m1.glm = glm(money ~ status * attraction, family = gaussian,
             data = mlrdata)
# automated AIC based model fitting
step(m1.mlr, direction = "both")
## Start:  AIC=570.3
## money ~ (status + attraction)^2
## 
##                     Df Sum of Sq   RSS AIC
## <none>                           30411 570
## - status:attraction  1     21647 52058 621
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Coefficients:
##                          (Intercept)                          statusSingle  
##                                 99.2                                  55.9  
##              attractionNotInterested  statusSingle:attractionNotInterested  
##                                -47.7                                 -59.5
# create new final models
m2.mlr = lm(money ~ (status + attraction)^2, data = mlrdata)
m2.glm = glm(money ~ status * attraction, family = gaussian,
             data = mlrdata)
# inspect final minimal model
summary(m2.mlr)
## 
## Call:
## lm(formula = money ~ (status + attraction)^2, data = mlrdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.76 -13.51  -0.99  10.60  38.77 
## 
## Coefficients:
##                                      Estimate Std. Error t value
## (Intercept)                             99.15       3.60   27.56
## statusSingle                            55.85       5.14   10.87
## attractionNotInterested                -47.66       5.09   -9.37
## statusSingle:attractionNotInterested   -59.46       7.27   -8.18
##                                                  Pr(>|t|)    
## (Intercept)                          < 0.0000000000000002 ***
## statusSingle                         < 0.0000000000000002 ***
## attractionNotInterested                 0.000000000000004 ***
## statusSingle:attractionNotInterested    0.000000000001338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18 on 94 degrees of freedom
## Multiple R-squared:  0.857,  Adjusted R-squared:  0.853 
## F-statistic:  188 on 3 and 94 DF,  p-value: <0.0000000000000002
# extract confidence intervals of the coefficients
confint(m2.mlr)
##                                       2.5 % 97.5 %
## (Intercept)                           92.01 106.30
## statusSingle                          45.65  66.06
## attractionNotInterested              -57.76 -37.56
## statusSingle:attractionNotInterested -73.89 -45.03
# compare baseline with final model
anova(m0.mlr, m2.mlr)
## Analysis of Variance Table
## 
## Model 1: money ~ 1
## Model 2: money ~ (status + attraction)^2
##   Res.Df    RSS Df Sum of Sq   F              Pr(>F)    
## 1     97 213227                                         
## 2     94  30411  3    182816 188 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# compare baseline with final model
Anova(m0.mlr, m2.mlr, type = "III")
## Anova Table (Type III tests)
## 
## Response: money
##             Sum Sq Df F value              Pr(>F)    
## (Intercept) 760953  1    2352 <0.0000000000000002 ***
## Residuals    30411 94                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5 Additional Model Diagnostics

After rerunning the regression analysis on the updated data set, we again create diagnostic plots in order to check whether there are potentially problematic data points.

# start plotting
par(mfrow = c(2, 2)) # display plots in 2 rows/2 columns
plot(m2.mlr)         # plot fitted values

par(mfrow = c(1, 1)) # restore original settings
# determine a cutoff for data points that have
# D-values higher than 4/(n-k-1)
cutoff <- 4/((nrow(mlrdata)-length(m2.mlr$coefficients)-2))
# start plotting
par(mfrow = c(1, 2))           # display plots in 1 row/2 columns
qqPlot(m2.mlr, main="QQ Plot") # create qq-plot
## 84 88 
## 82 86
plot(m2.mlr, which=4, cook.levels = cutoff) # plot cook*s distance

par(mfrow = c(1, 1))           # restore original settings

Although the diagnostic plots indicate that additional points may be problematic, but these data points deviate substantially less from the trend than was the case with the data points that have already been removed. To make sure that retaining the data points that are deemed potentially problematic by the diagnostic plots, is acceptable, we extract diagnostic statistics and add them to the data.

# add model diagnostics to the data
mlrdata$residuals <- resid(m2.mlr)
mlrdata$standardized.residuals <- rstandard(m2.mlr)
mlrdata$studentized.residuals <- rstudent(m2.mlr)
mlrdata$cooks.distance <- cooks.distance(m2.mlr)
mlrdata$dffit <- dffits(m2.mlr)
mlrdata$leverage <- hatvalues(m2.mlr)
mlrdata$covariance.ratios <- covratio(m2.mlr)
mlrdata$fitted <- m2.mlr$fitted.values

We can now use these diagnostic statistics to create more precise diagnostic plots.

# plot 5
p5 <- ggplot(mlrdata,
             aes(studentized.residuals)) +
  theme(legend.position = "none") +
  theme_set(theme_bw(base_size = 8))+
  geom_histogram(aes(y=..density..),
                 binwidth = 1,
                 colour="black",
                 fill="white") +
  labs(x = "Studentized Residual", y = "Density") +
  stat_function(fun = dnorm,
                args = list(mean = mean(mlrdata$studentized.residuals, na.rm = TRUE),
                            sd = sd(mlrdata$studentized.residuals, na.rm = TRUE)),
                colour = "red", size = 1)
# plot 6
p6 <- ggplot(mlrdata, aes(fitted, studentized.residuals)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "Red")+
    theme_set(theme_bw(base_size = 8))+
  labs(x = "Fitted Values",
       y = "Studentized Residual")
# plot 7
p7 <- qplot(sample = mlrdata$studentized.residuals, stat="qq") +
    theme_set(theme_bw(base_size = 8))+
  labs(x = "Theoretical Values",
       y = "Observed Values")
multiplot(p5, p6, p7, cols = 3)

The new diagnostic plots do not indicate outliers that require removal. With respect to such data points the following parameters should be considered:

  1. Data points with standardised residuals > 3.29 should be removed (Field, Miles, and Field 2012, 269)

  2. If more than 1 percent of data points have standardized residuals exceeding values > 2.58, then the error rate of the model is inacceptable (Field, Miles, and Field 2012, 269).

  3. If more than 5 percent of data points have standardized residuals exceeding values > 1.96, then the error rate of the model is inacceptable (Field, Miles, and Field 2012, 269)

  4. In addition, data points with Cook’s D-values > 1 should be removed (Field, Miles, and Field 2012, 269)

  5. Also, data points with leverage values \(3(k + 1)/n\) (k = Number of predictors, N = Number of cases in model) should be removed (Field, Miles, and Field 2012, 270)

  6. There should not be (any) autocorrelation among predictors. This means that independent variables cannot be correlated with itself (for instance, because data points come from the same subject). If there is autocorrelation among predictors, then a Repeated Measures Design or a (hierarchical) mixed-effects model should be implemented instead.

  7. Predictors cannot substantially correlate with each other (multicollinearity). If a model contains predictors that have variance inflation factors (VIF) > 10 the model is completely unreliable (Myers 1990) and predictors causing such VIFs should be removed. Indeed, even VIFs of 2.5 can be problematic (Szmrecsanyi 2006, 215, @zuur2010protocol) proposes that variables with VIFs exceeding 3 should be removed!

  8. Data points with 1/VIF values \(<\) .1 must be removed (data points with values above .2 are considered problematic) (Menard 1995).

  9. The mean value of VIFs should be \(<\) 1 (Bowerman and O’Connell 1990).

# 1: optimal = 0
# (aufgelistete datenpunkte sollten entfernt werden)
which(mlrdata$standardized.residuals > 3.29)
## integer(0)
# 2: optimal = 1
# (listed data points should be removed)
stdres_258 <- as.vector(sapply(mlrdata$standardized.residuals, function(x) {
ifelse(sqrt((x^2)) > 2.58, 1, 0) } ))
(sum(stdres_258) / length(stdres_258)) * 100
## [1] 0
# 3: optimal = 5
# (listed data points should be removed)
stdres_196 <- as.vector(sapply(mlrdata$standardized.residuals, function(x) {
ifelse(sqrt((x^2)) > 1.96, 1, 0) } ))
(sum(stdres_196) / length(stdres_196)) * 100
## [1] 6.122
# 4: optimal = 0
# (listed data points should be removed)
which(mlrdata$cooks.distance > 1)
## integer(0)
# 5: optimal = 0
# (data points should be removed if cooks distance is close to 1)
which(mlrdata$leverage >= (3*mean(mlrdata$leverage)))
## integer(0)
# 6: checking autocorrelation:
# Durbin-Watson test (optimal: grosser p-wert)
dwt(m2.mlr)
##  lag Autocorrelation D-W Statistic p-value
##    1        -0.01433         1.968   0.604
##  Alternative hypothesis: rho != 0
# 7: test multicolliniarity 1
vif(m2.mlr)
##                         statusSingle              attractionNotInterested 
##                                 2.00                                 1.96 
## statusSingle:attractionNotInterested 
##                                 2.96
# 8: test multicolliniarity 2
1/vif(m2.mlr)
##                         statusSingle              attractionNotInterested 
##                               0.5000                               0.5102 
## statusSingle:attractionNotInterested 
##                               0.3378
# 9: mean vif should not exceed 1
mean(vif(m2.mlr))
## [1] 2.307

Except for the mean VIF value (2.307) which should not exceed 1, all diagnostics are acceptable. We will now test whether the sample size is sufficient for our model. With respect to the minimal sample size and based on (Green 1991), (Field, Miles, and Field 2012, 273–74) offer the following rules of thumb (k = number of predictors; categorical predictors with more than two levels should be recoded as dummy variables):

3.6 Evaluation of Sample Size

After performing the diagnostics, we will now test whether the sample size is adequate and what the values of “R” would be based on a random distribution in order to be able to estimate how likely a \(\beta\)-error is given the present sample size (see Field, Miles, and Field 2012, 274). Beta errors (or \(\beta\)-errors) refer to the erroneous assumption that a predictor is not significant (based on the analysis and given the sample) although it does have an effect in the population. In other words, \(\beta\)-error means to overlook a significant effect because of weaknesses of the analysis. The test statistics ranges between 0 and 1 where lower values are better. If the values approximate 1, then there is serious concern as the model is not reliable given the sample size. In such cases, unfortunately, the best option is to increase the sample size.

# check if sample size is sufficient
smplesz(m2.mlr)
## [1] "Sample too small: please increase your sample by  9  data points"
# check beta-error likelihood
expR(m2.mlr)
## [1] "Based on the sample size expect a false positive correlation of 0.0309 between the predictors and the predicted"

The function “smplesz” reports that the sample size is insufficient by 9 data points according to Green (1991). The likelihood of \(\beta\)-errors, however, is very small (0.0309). As a last step, we summarize the results of the regression analysis.

# tabulate regression results
mlrsummary <- mlinrsummary(m2.mlr, m2.glm, ia = T)
# remove columns with confidence intervals
mlrsummary[,-c(3:4)]
Overview of the final minimal adequate linear fixed-effects regression model.
Estimate VIF CI(97.5%) Std. Error t value Pr(>|t|) Significance
(Intercept) 99.15 106.21 3.6 27.56 0 p < .001***
statusSingle 55.85 2 65.93 5.14 10.87 0 p < .001***
attractionNotInterested -47.66 1.96 -37.69 5.09 -9.37 0 p < .001***
statusSingle:attractionNotInterested -59.46 2.96 -45.21 7.27 -8.18 0 p < .001***
Model statistics Value
Number of cases in model 98
Residual Standard Error on 94 DF 17.99
Multiple R2 0.857
Adjusted R2 0.853
AIC 850.4
BIC 863.32
F-statistic F-statistic: 188.36 DF: 3 and 94 p-value: 0 p < .001***

Although Field, Miles, and Field (2012) suggest that the main effects of the predictors involved in the interaction should not be interpreted, they are interpreted here to illustrate how the results of a multiple linear regression can be reported. Accordingly, the results of the regression analysis performed above can be summarized as follows:

A multiple linear regression was fitted to the data using an automated, step-wise, AIC-based (Akaike’s Information Criterion) procedure. The model fitting arrived at a final minimal model. During the model diagnostics, two outliers were detected and removed. Further diagnostics did not find other issues after the removal.

The final minimal adequate regression model is based on 98 data points and performs highly significantly better than a minimal baseline model (Multiple R2: .857, Adjusted R2: .853, F-statistic (3, 94): 154.4, AIC: 850.4, BIC: 863.32, p<.001\(***\)). The final minimal adequate regression model reports attraction and status as significant main effects. The relationship status of women correlates highly significantly and positively with the amount of money spend on the women’s presents (SE: 5.14, t-value: 10.87, p<.001\(***\)). This shows that men spend 156.8 dollars on presents are single while they spend 99,15 dollars if the women are in a relationship. Whether men are attracted to women also correlates highly significantly and positively with the money they spend on women (SE: 5.09, t-values: -9.37, p<.001\(***\)). If men are not interested in women, they spend 47.66 dollar less on a present for women compared with women the men are interested in.

Furthermore, the final minimal adequate regression model reports a highly significant interaction between relationship status and attraction (SE: 7.27, t-value: -8.18, p<.001\(***\)): If women are single but man are not interested in them, men spend 59.46 dollars less on their presents compared to all other constellations.

4 Multiple Binomial Logistic Regression

Logistic regression is a multivariate analysis technique that builds on and is very similar in terms of its implementation to linear regression but logistic regressions take dependent variables that represent nominal rather than numeric scaling (Harrell Jr 2015). The difference requires that the linear regression must be modified in certain ways to avoid producing non-sensical outcomes. The most fundamental difference between logistic and linear regressions is that logistic regression work on the probabilities of an outcome (the likelihood), rather than the outcome itself. In addition, the likelihoods on which the logistic regression works must be logged (logarithmized) in order to avoid produce predictions that produce values greater than 1 (instance occurs) and 0 (instance does not occur).

To understand what this mean, we will use a very simple example. In this example, we want to see whether the height of men affect their likelihood of being in a relationship. The data we use represents a data set consisting of two variables: height and relationship.

First six observations of an example data set representing the height and relationship status of a sample of men.
bodyheight relationship
148.8 0
155.9 0
157.8 0
158.7 0
164.5 0
170.2 1

The left panel of the Figure above shows that a linear model would predict values for the relationship status, which represents a factor (0 = Not in Relationship and 1 = In Relationship), that are non-sensical because 1.1 does not make sense if the only options are 0 OR 1. The logistic function shown in the right panel of the Figure above solves this problem by working on the logged probabilities of an outcome rather than on the actual outcome.

4.1 Example 1: EH in Kiwi English

To exemplify hot to implement a logistic regression in R (see Agresti 1996, @agresti2011categorical for very good and thorough introductions to this topic), we will analyse the use of the discourse particle eh in New Zealand English and test which factors correlate with its occurrence. The data set represents speech units in a corpus that were coded for the speaker who uttered a given speech unit, the gender, ethnicity, and age of that speaker and whether or not the speech unit contained an eh. To begin with, we clean the current work space, set option, install and activate relevant packages, load customized functions, and load the example data set.

# load data
blrdata <- read.table("https://slcladal.github.io/data/blrdata.txt",
                      comment.char = "",  # data does not contain comments
                      quote = "",         # data does not contain quotes
                      sep = "\t",         # data is tab separetd
                      header = T)         # variables have headers
# inspect data
str(blrdata)
## 'data.frame':    25821 obs. of  5 variables:
##  $ ID       : chr  "<S1A-001#M>" "<S1A-001#M>" "<S1A-001#M>" "<S1A-001#M>" ...
##  $ Gender   : chr  "Men" "Men" "Men" "Men" ...
##  $ Age      : chr  "Young" "Young" "Young" "Young" ...
##  $ Ethnicity: chr  "Pakeha" "Pakeha" "Pakeha" "Pakeha" ...
##  $ EH       : int  0 1 0 0 1 1 0 0 0 1 ...

The summary of the data show that the data set contains 25,821 observations of five variables. The variable “ID” contains strings that represent a combination file and speaker of a speech unit. The second variable represents the gender, the third the age, and the fourth the ethnicity of speakers. The fifth variable represents whether or not a speech unit contained the discourse particle eh. The first six lines of the data set are shown in the Table below.

First six line of the blrdata data set.
ID Gender Age Ethnicity EH
<S1A-001#M> Men Young Pakeha 0
<S1A-001#M> Men Young Pakeha 1
<S1A-001#M> Men Young Pakeha 0
<S1A-001#M> Men Young Pakeha 0
<S1A-001#M> Men Young Pakeha 1
<S1A-001#M> Men Young Pakeha 1

Next, we factorize the variables in our data set. In other words, we specify that the strings represent variable levels and define new reference levels because as a default “R” will use the variable level which first occurs in alphabet ordering as the reference level for each variable, we redefine the variable levels for Age and Ethnicity.

vrs <- c("Age", "Gender", "Ethnicity", "ID")  # define variables to be factorized
fctr <- which(colnames(blrdata) %in% vrs)     # define vector with variables
blrdata[,fctr] <- lapply(blrdata[,fctr], factor) # factorize variables
blrdata$Age <- relevel(blrdata$Age, "Young") # relevel Age (Young = Reference)
blrdata$Ethnicity <- relevel(                # relevel Ethnicity
  blrdata$Ethnicity, "Pakeha") # define Pakeha as Reference level)

After preparing the data, we will now plot the data to get an overview of potential relationships between variables.

ggplot(blrdata, aes(Age, EH, color = Gender)) +
  facet_wrap(~Ethnicity) +
  stat_summary(fun.y = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position = "top") +
  labs(x = "", y = "Observed Probabilty of eh") +
  scale_color_manual(values = c("gray20", "gray70"))

With respect to main effects, the Figure above indicates that men use eh more frequently than women, that young speakers sue it more frequently compared with old speakers, and that speakers that are descendants of European settlers (Pakeha) use eh more frequently compared with Maori (the native inhabitants of New Zealand).

The plots in the lower panels do not indicate significant interactions between use of eh and the Age, Gender, and Ethnicity of speakers. In a next step, we will start building the logistic regression model.

4.2 Model Building

As a first step, we need to define contrasts and add a distance matrix to the options. Contrasts define what and how variable levels should be compared and therefore influences how the results of the regression analysis are presented.

# set contrasts
options(contrasts  =c("contr.treatment", "contr.poly"))
# create distance matrix
blrdata.dist <- datadist(blrdata)
# include distance matrix in options
options(datadist = "blrdata.dist")

Next, we generate two minimal models that predict the use of eh solely based on the intercept.

# baseline glm model
m0.glm = glm(EH ~ 1, family = binomial, data = blrdata)
# baseline lrm model
m0.lrm = lrm(EH ~ 1, data = blrdata, x = T, y = T)

A few words on “glm” vs “lrm”: Baayen (2008:196-197) states that “lrm” should be the function of choice in cases where each row contains exactly 1 success OR failure (1 or 0) while “glm” is preferable if there are two columns holding the number of successes and the number of failures respectively. I have tried it both ways and both functions work fine if each row contains exactly 1 success OR failure but only glm can handle the latter case.

4.3 Model fitting

We will now start with the model fitting procedure. In the present case, we will use a manual step-wise step-up procedure during which predictors are added to the model if they significantly improve the model fit. In addition, we will perform diagnostics as we fit the model ateach step of the model fitting process rather than after the fitting.

We will test two things in particular: whether the data has incomplete information or complete separation and if the model suffers from multicollinearity.

Incomplete information or complete separation means that the data does not contain all combinations of the predictor or the dependent variable. This is important because if the data does not contain cases of all combinations, the model will assume that it has found a perfect predictor. In such cases the model overestimates the effect of that that predictor and the results of that model are no longer reliable. For example, if “EH” was only used by young speakers in the data, the model would jump on that fact and say “Ha! If there is an old speaker, that means that that speaker will never ever and under no circumstances say”EH" - I can therefore ignore all other factors!"

Multicollinearity means that predictors correlate and have shared variance. This means that whichever predictor is included first will take all the variance that it can explain and the remaining part of the variable that is shared will not be attributed to the other predictor. This may lead to reporting that a factor is not significant because all of the variance it can explain is already accounted for. However, if the other predictor were included first, then the orginal predictor would be returned as insignificant. This means that- depending on the order in which predictors are added - the results of the regression can differ dramatically and the model is therefore not reliable. Multicollinearity is actually a very comon problem and theer are various ways to deal with it but it cannot be ignored (at least in regression analyses).

We will start by adding “Age” to the minimal adequate model.

# check incomplete information
ifelse(min(ftable(blrdata$Age, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
# add age to the model
m1.glm = glm(EH ~ Age, family = binomial, data = blrdata)
# check multicollinearity (vifs should have values of 3 or lower for main effects)
ifelse(max(vif(m1.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
# check if adding Age significantly improves model fit
anova(m1.glm, m0.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age
## Model 2: EH ~ 1
##   Resid. Df Resid. Dev Df Deviance            Pr(>Chi)    
## 1     25819      32377                                    
## 2     25820      33008 -1     -631 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the data does not contain incomplete information, the vif values are below 3, and adding “Age” has significantly imporved the mdel fit (the p-value of the anova is lower than .05). We therefore proceed with “Age” included.

We continue by adding “Gender”. We add a second ANOVA test to see if including Gender affects the significance of other predictors in the model. If this were the case - if adding Gender woudl cause Age to become insignificant - then we could change the ordering in which we include predictors into our model.

ifelse(min(ftable(blrdata$Gender, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m2.glm <- update(m1.glm, . ~ . +Gender)
ifelse(max(vif(m2.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m2.glm, m1.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender
## Model 2: EH ~ Age
##   Resid. Df Resid. Dev Df Deviance            Pr(>Chi)    
## 1     25818      32140                                    
## 2     25819      32377 -1     -237 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(m2.glm, test = "LR")
## Analysis of Deviance Table (Type II tests)
## 
## Response: EH
##        LR Chisq Df          Pr(>Chisq)    
## Age         669  1 <0.0000000000000002 ***
## Gender      237  1 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, including “Gender” significantly improves model fit and the data does not contain incomplete information or complete separation. Also, including “Gender” does not affect the significance of “Age”. Now, we include “Ethnicity”.

ifelse(min(ftable(blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m3.glm <- update(m2.glm, . ~ . +Ethnicity)
ifelse(max(vif(m3.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m3.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25817      32139                     
## 2     25818      32140 -1   -0.261     0.61

Since adding “Ethnicity” does not significantly imporve the mdoel fit, we do not need to test if its inclusion affects the significance of other predictors. We continue without “Ethnicity” and include the interaction between “Age” and “Gender”.

ifelse(min(ftable(blrdata$Age, blrdata$Gender, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m4.glm <- update(m2.glm, . ~ . +Age*Gender)
ifelse(max(vif(m4.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m4.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Age:Gender
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25817      32139                     
## 2     25818      32140 -1   -0.124     0.72

The interaction between Age and Gender is not significant which means that men and women do not behave differently with respect to their use of “EH” as they age. Also, the data does not contain incomplete information and the model does not suffer from multicollinerity - the predictors are not collinear. We can now include if there is a significant interaction between “Age” and “Ethnicity”.

ifelse(min(ftable(blrdata$Age, blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m5.glm <- update(m2.glm, . ~ . +Age*Ethnicity)
ifelse(max(vif(m5.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m5.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity + Age:Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25816      32136                     
## 2     25818      32140 -2    -3.07     0.22

Again, no incomplete information or multicollinearity and no significant interaction. Now, we test if there exists a significant interaction between “Gender” and “Ethnicity”.

ifelse(min(ftable(blrdata$Gender, blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m6.glm <- update(m2.glm, . ~ . +Gender*Ethnicity)
ifelse(max(vif(m6.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m6.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity + Gender:Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25816      32139                     
## 2     25818      32140 -2   -0.272     0.87

As the interaction between “Gender” and “Ethnicity” is not significant, we continue without it. In a final step, we include the three-way interaction between “Age”, “Gender”, and “Ethnicity”.

ifelse(min(ftable(blrdata$Age, blrdata$Gender, blrdata$Ethnicity, blrdata$EH)) == 0, "not possible", "possible")
## [1] "possible"
m7.glm <- update(m2.glm, . ~ . +Gender*Ethnicity)
ifelse(max(vif(m7.glm)) <= 3,  "vifs ok", "WARNING: high vifs!") # VIFs ok
## [1] "vifs ok"
anova(m7.glm, m2.glm, test = "Chi")
## Analysis of Deviance Table
## 
## Model 1: EH ~ Age + Gender + Ethnicity + Gender:Ethnicity
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     25816      32139                     
## 2     25818      32140 -2   -0.272     0.87

We have found our final minimal adequate model because the 3-way interaction is also insignificant. As we have now arrived at the final minimal adequate model (m2.glm), we generate a final minimal model using the “lrm” model.

m2.lrm <- lrm(EH ~ Age+Gender, data = blrdata, x = T, y = T, linear.predictors = T)
m2.lrm
## Logistic Regression Model
##  
##  lrm(formula = EH ~ Age + Gender, data = blrdata, x = T, y = T, 
##      linear.predictors = T)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs         25821    LR chi2     868.21    R2       0.046    C       0.602    
##   0          17114    d.f.             2    g        0.432    Dxy     0.203    
##   1           8707    Pr(> chi2) <0.0001    gr       1.541    gamma   0.302    
##  max |deriv| 3e-10                          gp       0.091    tau-a   0.091    
##                                             Brier    0.216                     
##  
##               Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept    -0.2324 0.0223 -10.44 <0.0001 
##  Age=Old      -0.8305 0.0335 -24.78 <0.0001 
##  Gender=Women -0.4201 0.0273 -15.42 <0.0001 
## 
anova(m2.lrm)
##                 Wald Statistics          Response: EH 
## 
##  Factor     Chi-Square d.f. P     
##  Age        614.0      1    <.0001
##  Gender     237.7      1    <.0001
##  TOTAL      802.6      2    <.0001

After fitting the model, we validate the model to avoid arriving at a final minimal model that is overfitted to the data at hand.

4.4 Model Validation

To validate a model, you can apply the “validate” function and apply it to a saturated model. The output of the “validate” function shows how often predictors are retained if the sample is re-selected with the same size but with placing back drawn data points. The execution of the function requires some patience as it is rather computationally expensive and it is, tehrefore, commented out below.

# model validation (remove # to activate: output too long for website)
m7.lrm <- lrm(EH ~ (Age+Gender+Ethnicity)^3, data = blrdata, x = T, y = T, linear.predictors = T)
#validate(m7.lrm, bw = T, B = 200)

The “validate” function shows that retaining two predictors (Age and Gender) is the best option and thereby confirms our final minimal adequate model as the best minimal model. In addition, we check whether we need to include a penalty for data points because they have too strong of an impact of the model fit. To see whether a penalty is warranted, we apply the “pentrace” function to the final minimal adequate model.

pentrace(m2.lrm, seq(0, 0.8, by = 0.05)) # determine penalty
## 
## Best penalty:
## 
##  penalty    df
##      0.8 1.999
## 
##  penalty    df   aic   bic aic.c
##     0.00 2.000 864.2 847.9 864.2
##     0.05 2.000 864.2 847.9 864.2
##     0.10 2.000 864.2 847.9 864.2
##     0.15 2.000 864.2 847.9 864.2
##     0.20 2.000 864.2 847.9 864.2
##     0.25 2.000 864.2 847.9 864.2
##     0.30 2.000 864.2 847.9 864.2
##     0.35 2.000 864.2 847.9 864.2
##     0.40 2.000 864.2 847.9 864.2
##     0.45 2.000 864.2 847.9 864.2
##     0.50 2.000 864.2 847.9 864.2
##     0.55 1.999 864.2 847.9 864.2
##     0.60 1.999 864.2 847.9 864.2
##     0.65 1.999 864.2 847.9 864.2
##     0.70 1.999 864.2 847.9 864.2
##     0.75 1.999 864.2 847.9 864.2
##     0.80 1.999 864.2 847.9 864.2

The values are so similar that a penalty is unnecessary. In a next step, we rename the final models.

lr.glm <- m2.glm  # rename final minimal adequate glm model
lr.lrm <- m2.lrm  # rename final minimal adequate lrm model

Now, we calculate a Model Likelihood Ratio Test to check if the final model performs significantly better than the initial minimal base-line model. The result of this test is provided as a default if we call a summary of the lrm object.

modelChi <- lr.glm$null.deviance - lr.glm$deviance
chidf <- lr.glm$df.null - lr.glm$df.residual
chisq.prob <- 1 - pchisq(modelChi, chidf)
modelChi; chidf; chisq.prob
## [1] 868.2
## [1] 2
## [1] 0

The code above provides three values: a \(\chi\)2, the degrees of freedom, and a p-value. The p-value is lower than .05 and the results of the Model Likelihood Ratio Test therefore confirm that the final minimal adequate model performs significantly better than the initial minimal base-line model. Another way to extract the model likelihood test statistics is to use an ANOVA to compare the final minimal adequate model to the minimal base-line model.

A handier way to get thses statistics is by performing an ANOVA on the final minimal model which, if used this way, is identical to a Model Likelihood Ratio test.

anova(m0.glm, lr.glm, test = "Chi") # Model Likelihood Ratio Test
## Analysis of Deviance Table
## 
## Model 1: EH ~ 1
## Model 2: EH ~ Age + Gender
##   Resid. Df Resid. Dev Df Deviance            Pr(>Chi)    
## 1     25820      33008                                    
## 2     25818      32140  2      868 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In a next step, we calculate pseudo-R2 values which represent the amount of residual variance that is explained by the final minimal adequate model. We cannot use the ordinary R2 because the model works on the logged likelihoods rather than the values of the dependent variable.

# calculate pseudo R^2
# number of cases
ncases <- length(fitted(lr.glm))
R2.hl <- modelChi/lr.glm$null.deviance
R.cs <- 1 - exp ((lr.glm$deviance - lr.glm$null.deviance)/ncases)
R.n <- R.cs /( 1- ( exp (-(lr.glm$null.deviance/ ncases))))
# function for extracting pseudo-R^2
logisticPseudoR2s <- function(LogModel) {
  dev <- LogModel$deviance
    nullDev <- LogModel$null.deviance
    modelN <-  length(LogModel$fitted.values)
    R.l <-  1 -  dev / nullDev
    R.cs <- 1- exp ( -(nullDev - dev) / modelN)
    R.n <- R.cs / ( 1 - ( exp (-(nullDev / modelN))))
    cat("Pseudo R^2 for logistic regression\n")
    cat("Hosmer and Lemeshow R^2  ", round(R.l, 3), "\n")
    cat("Cox and Snell R^2        ", round(R.cs, 3), "\n")
    cat("Nagelkerke R^2           ", round(R.n, 3),    "\n") }
logisticPseudoR2s(lr.glm)
## Pseudo R^2 for logistic regression
## Hosmer and Lemeshow R^2   0.026 
## Cox and Snell R^2         0.033 
## Nagelkerke R^2            0.046

The low pseudo-R2 values show that our model has very low explanatory power as it only accounts for approximately 2.6 to 4.6 percent of the variance in the logged likelihoods (to get the percentages, you simply multiply the pseudo-R2 values by 100). Next, we extract the confidence intervals for the coefficients of the model.

# extract the confidence intervals for the coefficients
confint(lr.glm)
##               2.5 %  97.5 %
## (Intercept) -0.2761 -0.1888
## AgeOld      -0.8965 -0.7651
## GenderWomen -0.4735 -0.3667

Despite having low explanatory and predictive power, the age of speakers and their gender are significant as the confidence intervals of the coefficients do not overlap with 0.

4.5 Effect Size

In a next step, we compute odds ratios and their confidence intervals. Odds Ratios represent a common measure of effect size and can be used to compare effect sizes across models. Odds ratios rang between 0 and infinity. Values of 1 indicate that there is no effect. The further away the values are from 1, the stronger the effect. If the values are lower than 1, then the variable level correlates negatively with the occurrence of the outcome (the likelihood decreases) while values above 1 indicate a positive correlation and show that the variable level causes an increase in the likelihood of the outcome (the occurrence of EH).

exp(lr.glm$coefficients) # odds ratios
## (Intercept)      AgeOld GenderWomen 
##      0.7926      0.4358      0.6570
exp(confint(lr.glm))     # confidence intervals of the coefficients
##              2.5 % 97.5 %
## (Intercept) 0.7588 0.8280
## AgeOld      0.4080 0.4653
## GenderWomen 0.6228 0.6930

The odds ratios confirm that older speakers use eh significantly less often compared with younger speakers and that women use eh less frequently than men as the confidence intervals of the odds rations do not overlap with 1. In a next step, we calculate the prediction accuracy of the model.

4.6 Accuracy

In order to calculate the prediction accuracy of the model, we rearrange the data so that it does not reflect one speech unit per row but the number of speech units with EH and the number of speech units without eh per speaker! Thus, we transform the data into a per speaker rather than a per speech-unit format.

blrdata_byspeaker <- table(blrdata$ID, blrdata$EH)
blrdata_byspeaker <- data.frame(rownames(blrdata_byspeaker), blrdata_byspeaker[, 1], blrdata_byspeaker[, 2])
names(blrdata_byspeaker) <- c("ID", "NOEH", "EH")
rownames(blrdata_byspeaker) <- 1:length(blrdata_byspeaker[,1])

blrdata_byspeaker <- plyr::join(blrdata_byspeaker,  # join by-speaker data and biodata
                          blrdata, by = "ID", # join by ID
                          type = "left",      # only speakers for which bio data is provided
                          match = "first")    #
blrdata_byspeaker$EH <- NULL                  # remove EH column
First six rows of the by-speaker data.
ID NOEH Gender Age Ethnicity EH
<S1A-001#F> 95 Women Young Pakeha 0
<S1A-001#M> 97 Men Young Pakeha 0
<S1A-002#B> 99 Women Young Pakeha 1
<S1A-002#Q> 86 Women Young Pakeha 0
<S1A-003#B> 58 Men Young Pakeha 0
<S1A-003#M> 119 Men Young Pakeha 1
# use by.spk data to fit another model which we will use to test the accuracy of the model
lr.glm.spk <- glm(cbind(EH, NOEH) ~ Age*Gender + Ethnicity + Age:Ethnicity, data = blrdata_byspeaker, family = binomial)
correct <- sum(blrdata_byspeaker$EH * (predict(lr.glm.spk, type = "response") >= 0.5)) + sum(blrdata_byspeaker$NOEH * (predict(lr.glm.spk, type="response") < 0.5))
tot <- sum(blrdata_byspeaker$EH) + sum(blrdata_byspeaker$NOEH)
predict.acc <- (correct/tot)*100
predict.acc
## [1] 99.6

The models predicts 99.6 of cases accurately which appears to be a satisfactory result but in order to evaluate the prediction accuracy, we need to compare it to the accuracy of the minimal base-line model.

# extract prediction accuracy
lr.glm.spk.base <- glm(cbind(EH, NOEH) ~ 1, data = blrdata_byspeaker, family = binomial)
correct.b <- sum(blrdata_byspeaker$EH * (predict(lr.glm.spk.base, type = "response") >= 0.5)) + sum(blrdata_byspeaker$NOEH * (predict(lr.glm.spk.base, type="response") < 0.5))
tot.b <- sum(blrdata_byspeaker$EH) + sum(blrdata_byspeaker$NOEH)
predict.acc.base <- (correct.b/tot.b)*100
# inspect prediction accuracy
predict.acc.base
## [1] 99.6

Both, the final-minimal and the minimal base-line model have the same prediction accuracy. This is interesting and we need to determine why this is the case. We will extract the predictions based on both models to find out why the predictions are identical.

# compare preictions of final and base line model
which(lr.glm.spk$fitted > .5)
## named integer(0)
which(lr.glm.spk.base$fitted > .5)
## named integer(0)

The reason why both models arrive at the same predictions is that because both models always predict an absence of EH.

# create variable with contains the prediction of the model
blrdata$Prediction <- predict(lr.glm, blrdata, type = "response")
blrdata$Prediction <- ifelse(blrdata$Prediction > .5, 1, 0)
# convert predicted and observed into factors with the same levels
blrdata$Prediction <- factor(blrdata$Prediction, levels = c("0", "1"))
blrdata$EH <- factor(blrdata$EH, levels = c("0", "1"))
# create a confusion matrix with compares observed against predicted values
caret::confusionMatrix(blrdata$Prediction, blrdata$EH)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 17114  8707
##          1     0     0
##                                              
##                Accuracy : 0.663              
##                  95% CI : (0.657, 0.669)     
##     No Information Rate : 0.663              
##     P-Value [Acc > NIR] : 0.503              
##                                              
##                   Kappa : 0                  
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.000              
##             Specificity : 0.000              
##          Pos Pred Value : 0.663              
##          Neg Pred Value :   NaN              
##              Prevalence : 0.663              
##          Detection Rate : 0.663              
##    Detection Prevalence : 1.000              
##       Balanced Accuracy : 0.500              
##                                              
##        'Positive' Class : 0                  
## 

We can now plot the effects using the “visreg” package (see Breheny and Burchett 2013).

# create plot
par(mfrow = c(1, 2))
visreg(lr.glm, "Age", xlab = "Age",
       ylab = "Logged Odds (EH)",
       ylim = c(-3, 0))
visreg(lr.glm, "Gender", xlab = "Gender",
       ylab = "Logged Odds (EH)",
       ylim = c(-3, 0)); par(mfrow = c(1, 1))

A more intuitive way to visualze results id to plot the predicted values against the observed values.

# extract predicted probabilities
blrdata$Predicted <- predict(lr.glm, blrdata, type = "response")
# plot
ggplot(blrdata, aes(Age, Predicted, color = Gender)) +
  stat_summary(fun.y = mean, geom = "point") +
  stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) +
  theme_set(theme_bw(base_size = 10)) +
  theme(legend.position = "top") +
    ylim(0, .75) +
  labs(x = "", y = "Predicted Probabilty of eh") +
  scale_color_manual(values = c("gray20", "gray70"))

4.7 Model Diagnostics

We are now in a position to perform model diagnostics and test if the model violates distributional requirements. In a first step, we test for the existence of multicollinearity.

4.7.1 Multicollinearity

To check whether the final minimal model contains predictors that correlate with each other, we extract variance inflation factors (VIF). If a model contains predictors that have variance inflation factors (VIF) > 10 the model is completely unreliable and cannot claim the multicollinearity is absent (Myers 1990). Predictors causing such VIFs should be removed. Indeed, predictors with VIF values greater than 4 are usually already problematic but, for large data sets, even VIFs greater than 2 can lead inflated standard errors (Jaeger 2013:http://wiki.bcs.rochester.edu/HlpLab/LSA2013Regression?action=AttachFile&do=view&target=LSA13-Lecture6-CommonIssuesAndSolutions.pdf). Also, VIFs of 2.5 can be problematic (Szmrecsanyi 2006, 215) and (Zuur, Ieno, and Elphick 2010) proposes that variables with VIFs exceeding 3 should be removed.

vif(lr.glm)
##      AgeOld GenderWomen 
##       1.005       1.005

In addition, predictors with 1/VIF values \(<\) .1 must be removed (data points with values above .2 are considered problematic) (Menard 1995) and the mean value of VIFs should be \(<\) 1 (Bowerman and O’Connell 1990).

mean(vif(lr.glm))
## [1] 1.005

4.7.2 Outlier detection

In order to detect potential outliers, we will calculate diagnostic parameters and add these to our data set.

infl <- influence.measures(lr.glm) # calculate influence statistics
blrdata <- data.frame(blrdata, infl[[1]], infl[[2]]) # add influence statistics

In a next step, we use these diagnostic parameters to check if there are data points which should be removed as they unduly affect the model fit.

4.7.3 Sample Size

We now check whether the sample size is sufficient for our analysis (Green 1991). * if you are interested in the overall model: 50 + 8k (k = number of predictors) * if you are interested in individual predictors: 104 + k * if you are interested in both: take the higher value!

# function to evaluate sample size
smplesz <- function(x) {
  ifelse((length(x$fitted) < (104 + ncol(summary(x)$coefficients)-1)) == TRUE,
    return(
      paste("Sample too small: please increase your sample by ",
      104 + ncol(summary(x)$coefficients)-1 - length(x$fitted),
      " data points", collapse = "")),
    return("Sample size sufficient")) }
# apply unction to model
smplesz(lr.glm)
## [1] "Sample size sufficient"

According to rule of thumb provided in Green (1991), the sample size is sufficient for our analysis.

4.8 Summarizing Results

As a final step, we summarize our findings in tabulated form.

blrsummary <- blrsummary(lr.glm, lr.lrm, predict.acc) # summarize regression analysis
blrsummary
Summary of the final minimal adequate binomial logistic fixed-effects regression model which was fitted to predictors of eh in New Zealand English.
Estimate VIF OddsRatio CI(2.5%) CI(97.5%) Std. Error z value Pr(>|z|) Significance
(Intercept) -0.23 0.79 0.76 0.83 0.02 -10.44 0 p < .001***
AgeOld -0.83 1 0.44 0.41 0.47 0.03 -24.78 0 p < .001***
GenderWomen -0.42 1 0.66 0.62 0.69 0.03 -15.42 0 p < .001***
Model statistics Value
Number of cases in model 25821
Observed misses 0 : 17114
Observed successes 1 : 8707
Null deviance 33007.75
Residual deviance 32139.54
R2 (Nagelkerke) 0.046
R2 (Hosmer & Lemeshow) 0.026
R2 (Cox & Snell) 0.033
C 0.602
Somers’ Dxy 0.203
AIC 32145.54
Prediction accuracy 99.6%
Model Likelihood Ratio Test Model L.R.: 868.21 df: 2 p-value: 0 sig: p < .001***

R2 (Hosmer & Lemeshow)

Hosmer and Lemeshow’s R2 “is the proportional reduction in the absolute value of the log-likelihood measure and as such it is a measure of how much the badness of fit improves as a result of the inclusion of the predictor variables. It can vary between 0 (indicating that the predictors are useless at predicting the outcome variable) and 1 (indicating that the model predicts the outcome variable perfectly)” (Field, Miles, and Field 2012, 317).

R2 (Cox & Snell)

"Cox and Snell’s R2 (1989) is based on the deviance of the model (-2LL(new») and the deviance of the baseline model (-2LL(baseline), and the sample size, n […]. However, this statistic never reaches its theoretical maximum of 1.

R2 (Nagelkerke)

Since R2 (Cox & Snell) never reaches its theoretical maximum of 1, Nagelkerke (1991) suggested Nagelkerke’s R2 (Field, Miles, and Field 2012, 317–18).

Somers’ Dxy

Somers’ Dxy is a rank correlation between predicted probabilities and observed responses ranges between 0 (randomness) and 1 (perfect prediction). Somers’ Dxy should have a value higher than .5 for the model to be meaningful (Baayen 2008, 204).

C C is an index of concordance between the predicted probability and the observed response. When C takes the value 0.5, the predictions are random, when it is 1, prediction is perfect. A value above 0.8 indicates that the model may have some real predictive capacity (Baayen 2008, 204).

Akaike information criteria (AIC)

Akaike information criteria (AlC = -2LL + 2k) provide a value that reflects a ratio between the number of predictors in the model and the variance that is explained by these predictors. Changes in AIC can serve as a measure of whether the inclusion of a variable leads to a significant increase in the amount of variance that is explained by the model. “You can think of this as the price you pay for something: you get a better value of R2, but you pay a higher price, and was that higher price worth it? These information criteria help you to decide. The BIC is the same as the AIC but adjusts the penalty included in the AlC (i.e., 2k) by the number of cases: BlC = -2LL + 2k x log(n) in which n is the number of cases in the model” (Field, Miles, and Field 2012, 318).

5 Ordinal Regression

Ordinal regression is very similar to multiple linear regression but takes an ordinal dependent variable (Agresti 2010). For this reason, ordinal regression is one of the key methods in analysing Likert data.

To see how an ordinal regression is implemented in R, we load and inspect the “ordinaldata” data set. The data set consists of 400 observations of students that were either educated at this school (Internal = 1) or not (Internal = 0). Some of the students have been abroad (Exchange = 1) while other have not (Exchange = 0). In addition, the data set contains the students’ final score of a language test (FinalScore) and the dependent variable which the recommendation of a committee for an additional, very prestigious program. The recommendation has three levels (“very likely”, “somewhat likely”, and “unlikely”) and reflects the committees’ assessment of whether the student is likely to succeed in the program.

# load data
ordata <- read.delim("https://slcladal.github.io/data/ordinaldata.txt", sep = "\t", header = T)
colnames(ordata) <- c("Recommend", "Internal", "Exchange", "FinalScore")
# inspect data
str(ordata)
## 'data.frame':    400 obs. of  4 variables:
##  $ Recommend : chr  "very likely" "somewhat likely" "unlikely" "somewhat likely" ...
##  $ Internal  : int  0 1 1 0 0 0 0 0 0 1 ...
##  $ Exchange  : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ FinalScore: num  3.26 3.21 3.94 2.81 2.53 ...

In a first step, we need to relevel the ordinal variable to represent an ordinal factor (or a progression from “unlikely” over “somewhat likely” to “very likely”. And we will also factorize Internal and Exchange to make it easier to interpret the output later on.

# relevel data
ordata <- ordata %>%
dplyr::mutate(Recommend = factor(Recommend, 
                           levels=c("unlikely", "somewhat likely", "very likely"),
                           labels=c("unlikely",  "somewhat likely",  "very likely"))) %>%
  dplyr::mutate(Exchange = ifelse(Exchange == 1, "Exchange", "NoExchange")) %>%
  dplyr::mutate(Internal = ifelse(Internal == 1, "Internal", "External"))

Now that the dependent variable is releveled, we check the distribution of the variable levels by tabulating the data. To get a better understanding of the data we create frequency tables across variables rather than viewing the variables in isolation.

## three way cross tabs (xtabs) and flatten the table
ftable(xtabs(~ Exchange + Recommend + Internal, data = ordata))
##                            Internal External Internal
## Exchange   Recommend                                 
## Exchange   unlikely                       25        6
##            somewhat likely                12        4
##            very likely                     7        3
## NoExchange unlikely                      175       14
##            somewhat likely                98       26
##            very likely                    20       10

We also check the mean and standard deviation of the final score as final score is a numeric variable and cannot be tabulated (unless we convert it to a factor).

summary(ordata$FinalScore); sd(ordata$FinalScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.90    2.72    2.99    3.00    3.27    4.00
## [1] 0.3979

The lowest score is 1.9 and the highest score is a 4.0 with a mean of approximately 3. Finally, we inspect the distributions graphically.

# visualize data
ggplot(ordata, aes(x = Recommend, y = FinalScore)) +
  geom_boxplot(size = .75) +
  geom_jitter(alpha = .5) +
  facet_grid(Exchange ~ Internal, margins = TRUE) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

We see that we have only few students that have taken part in an exchange program and there are also only few internal students overall. With respect to recommendations, only few students are considered to very likely succeed in the program. We can now start with the modelling by using the “polr” function. To make things easier for us, we will only consider the main effects here as this tutorial only aims to how to implement an ordinal regression but not how it should be done in a proper study - then, the model fitting and diagnostic procedures would have to be performed accurately, of course.

## fit ordered logit model and store results 'm'
m <- polr(Recommend ~ Internal + Exchange + FinalScore, data = ordata, Hess=TRUE)
## view a summary of the model
summary(m)
## Call:
## polr(formula = Recommend ~ Internal + Exchange + FinalScore, 
##     data = ordata, Hess = TRUE)
## 
## Coefficients:
##                     Value Std. Error t value
## InternalInternal   1.0477      0.266   3.942
## ExchangeNoExchange 0.0587      0.298   0.197
## FinalScore         0.6157      0.261   2.363
## 
## Intercepts:
##                             Value Std. Error t value
## unlikely|somewhat likely    2.262 0.882      2.564  
## somewhat likely|very likely 4.357 0.904      4.818  
## 
## Residual Deviance: 717.02 
## AIC: 727.02

The results show that having studied here at this school increases the chances of receiving a positive recommendation but that having been on an exchange has a negative but insignificant effect on the recommendation. The final score also correlates positively with a positive recommendation but not as much as having studied here.

## store table
(ctable <- coef(summary(m)))
##                               Value Std. Error t value
## InternalInternal            1.04766     0.2658   3.942
## ExchangeNoExchange          0.05868     0.2979   0.197
## FinalScore                  0.61574     0.2606   2.363
## unlikely|somewhat likely    2.26200     0.8822   2.564
## somewhat likely|very likely 4.35744     0.9045   4.818

As the regression report does not provide p-values, we have to calculate them separately (after having calculated them, we add them to the coefficient table).

## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, "p value" = p))
##                               Value Std. Error t value     p value
## InternalInternal            1.04766     0.2658   3.942 0.000080902
## ExchangeNoExchange          0.05868     0.2979   0.197 0.843819939
## FinalScore                  0.61574     0.2606   2.363 0.018151727
## unlikely|somewhat likely    2.26200     0.8822   2.564 0.010343823
## somewhat likely|very likely 4.35744     0.9045   4.818 0.000001452

As predicted, Exchange does not have a significant effect but FinalScore and Internal both correlate significantly with the likelihood of receiving a positive recommendation.

# extract profiled confidence intervals
ci <- confint(m)
# calculate odds ratios and combine them with profiled CIs
exp(cbind(OR = coef(m), ci))
##                       OR 2.5 % 97.5 %
## InternalInternal   2.851 1.696  4.817
## ExchangeNoExchange 1.060 0.595  1.920
## FinalScore         1.851 1.114  3.098

The odds ratios show that internal students are 2.85 or 285 times as likely as non-internal students to receive positive evaluations and that a 1-point increase in the test score lead to a 1.85 or 185 percent increase in the chances of receiving a positive recommendation. The effect of an exchange is slightly negative but, as we have seen above, not significant.

6 Poisson Regression

Poisson regressions are used to analyse data where the dependent variable represents counts. In particular counts that are based on observations of something that is measured in set intervals. For instances the number of pauses in two-minute-long conversations. Poisson regressions are particularly appealing when dealing with rare events, i.e. when something only occurs very infrequently. In such cases, normal linear regressions do not work because the instances that do occur are automatically considered outliers. Therefore, it is useful to check if the data conform to a Poisson distribution.

However, the tricky thing about Poisson regressions is that the data has to conform to the Poisson distribution which is, according to my experience, rarely the case, unfortunately. In contrast to the Gaussian Normal Distribution which is very flexible because it is defined by two parameters, the mean (mu, i.e. \(\mu\)) and the standard deviation (sigma, i.e. \(\sigma\)). This allows the normal distribution to take very different shapes (for example, very high and slim (compressed) or very wide and flat). In contrast, the Poisson is defined by only one parameter (lambda, i.e. \(\lambda\)) which mean that if we have a mean of 2, then the standard deviation is also 2 (actually we would have to say that the mean is \(\lambda\) and the standard deviation is also \(\lambda\) or \(\lambda\) = \(\mu\) = \(\sigma\)). This is much trickier for natural data as this means that the Poisson distribution is very rigid.

As we can see, as \(\lambda\) takes on higher values, the distribution becomes wider and flatter - a compressed distribution with a high mean can therefore not be Poisson-distributed. We will now start by loading the data.

# load data
poissondata <- read.delim("https://slcladal.github.io/data/posdata.txt", sep = "\t", header = T, skipNul = T, quote = "")
# inspect data
summary(poissondata)
##        Id            Pauses       Language            Alcohol    
##  Min.   :  1.0   Min.   :0.00   Length:200         Min.   :33.0  
##  1st Qu.: 50.8   1st Qu.:0.00   Class :character   1st Qu.:45.0  
##  Median :100.5   Median :0.00   Mode  :character   Median :52.0  
##  Mean   :100.5   Mean   :0.63                      Mean   :52.6  
##  3rd Qu.:150.2   3rd Qu.:1.00                      3rd Qu.:59.0  
##  Max.   :200.0   Max.   :6.00                      Max.   :75.0

We will clean the data by factorizing Id which is currently considered a numeric variable rather than a factor.

# process data
poissondata <- poissondata %>%
  mutate(Id = factor(Id, levels = 1:200, labels = 1:200))
# inspect data
summary(poissondata); str(poissondata)
##        Id          Pauses       Language            Alcohol    
##  1      :  1   Min.   :0.00   Length:200         Min.   :33.0  
##  2      :  1   1st Qu.:0.00   Class :character   1st Qu.:45.0  
##  3      :  1   Median :0.00   Mode  :character   Median :52.0  
##  4      :  1   Mean   :0.63                      Mean   :52.6  
##  5      :  1   3rd Qu.:1.00                      3rd Qu.:59.0  
##  6      :  1   Max.   :6.00                      Max.   :75.0  
##  (Other):194
## 'data.frame':    200 obs. of  4 variables:
##  $ Id      : Factor w/ 200 levels "1","2","3","4",..: 45 108 15 67 153 51 164 133 2 53 ...
##  $ Pauses  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Language: chr  "German" "Russian" "German" "German" ...
##  $ Alcohol : int  41 41 44 42 40 42 46 40 33 46 ...

First, we check if the conditions for a Poisson regression are met.

# output the results
gf = goodfit(poissondata$Pauses,type= "poisson",method= "ML")
summary(gf)
## 
##   Goodness-of-fit test for poisson distribution
## 
##                    X^2 df    P(> X^2)
## Likelihood Ratio 33.01  5 0.000003742

If the p-values is smaller than .05, then data is not Poisson distributed which means that it differs significantly from a Poisson distribution and is very likely over-dispersed. We will check the divergence from a Poisson distribution visually by plotting the observed counts against the expected counts if the data were Poisson distributed.

plot(gf,main="Count data vs Poisson distribution")

Although the goodfit function reported that the data differs significantly from the Poisson distribution, the fit is rather good. We can use an additional Levene’s test to check if variance homogeneity is given.

# check homogeneity
leveneTest(poissondata$Pauses, poissondata$Language, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value     Pr(>F)    
## group   2    17.1 0.00000014 ***
##       197                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene’s test indicates that variance homogeneity is also violated. Since both the approximation to a Poisson distribution and variance homogeneity are violated, we should switch either to a quasi-Poisson model or a negative binomial model. However, as we are only interested in how to implement a Poisson model here, we continue despite the fact that this could not be recommended if we were actually interested in accurate results based on a reliable model.

In a next step, we summarize Progression by inspecting the means and standard deviations of the individual variable levels.

# extract mean and standard devaiation
with(poissondata, tapply(Pauses, Language, function(x) {
  sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
##                English                 German                Russian 
## "M (SD) = 1.00 (1.28)" "M (SD) = 0.24 (0.52)" "M (SD) = 0.20 (0.40)"

Now, we visualize the data.

# plot data
ggplot(poissondata, aes(Pauses, fill = Language)) +
  geom_histogram(binwidth=.5, position="dodge") +
  scale_fill_manual(values=c("gray30", "gray50", "gray70"))

# calculate poissant regression
m1.poisson <- glm(Pauses ~ Language + Alcohol, family="poisson", data=poissondata)
# inspect model
summary(m1.poisson)
## 
## Call:
## glm(formula = Pauses ~ Language + Alcohol, family = "poisson", 
##     data = poissondata)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.204  -0.844  -0.511   0.256   2.680  
## 
## Coefficients:
##                 Estimate Std. Error z value       Pr(>|z|)    
## (Intercept)      -4.1633     0.6629   -6.28 0.000000000337 ***
## LanguageGerman   -0.7140     0.3200   -2.23         0.0257 *  
## LanguageRussian  -1.0839     0.3583   -3.03         0.0025 ** 
## Alcohol           0.0702     0.0106    6.62 0.000000000036 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

In addition to the Estimates for the coefficients, we could also calculate the confidence intervals for the coefficients (LL stands for lower limit and UL for upper limit in the table below).

cov.m1 <- vcovHC(m1.poisson, type="HC0")
std.err <- sqrt(diag(cov.m1))
r.est <- cbind(Estimate= coef(m1.poisson), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(m1.poisson)/std.err), lower.tail=FALSE),
LL = coef(m1.poisson) - 1.96 * std.err,
UL = coef(m1.poisson) + 1.96 * std.err)
# inspect data
r.est
##                 Estimate Robust SE         Pr(>|z|)      LL       UL
## (Intercept)     -4.16327   0.64809 0.00000000013286 -5.4335 -2.89300
## LanguageGerman  -0.71405   0.29864 0.01680312040112 -1.2994 -0.12871
## LanguageRussian -1.08386   0.32105 0.00073547448242 -1.7131 -0.45460
## Alcohol          0.07015   0.01044 0.00000000001784  0.0497  0.09061
with(m1.poisson, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df      p
## [1,]        189.4 196 0.6182
## update m1 model dropping prog
m2.poisson <- update(m1.poisson, . ~ . - Language)
## test model differences with chi square test
anova(m2.poisson, m1.poisson, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: Pauses ~ Alcohol
## Model 2: Pauses ~ Language + Alcohol
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1       198        204                         
## 2       196        189  2     14.6  0.00069 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
s <- deltamethod(list(~ exp(x1), ~ exp(x2), ~ exp(x3), ~ exp(x4)), 
                                                coef(m1.poisson), cov.m1)

## exponentiate old estimates dropping the p values
rexp.est <- exp(r.est[, -3])
## replace SEs with estimates for exponentiated coefficients
rexp.est[, "Robust SE"] <- s

rexp.est
##                 Estimate Robust SE       LL      UL
## (Intercept)      0.01556   0.01008 0.004368 0.05541
## LanguageGerman   0.48966   0.14623 0.272698 0.87923
## LanguageRussian  0.33829   0.10861 0.180304 0.63470
## Alcohol          1.07267   0.01119 1.050955 1.09484
# extract predicted values
(s1 <- data.frame(Alcohol = mean(poissondata$Alcohol),
  Language = factor(1:3, levels = 1:3, labels = names(table(poissondata$Language)))))
##   Alcohol Language
## 1   52.65  English
## 2   52.65   German
## 3   52.65  Russian
predict(m1.poisson, s1, type="response", se.fit=TRUE)
## $fit
##      1      2      3 
## 0.6249 0.3060 0.2114 
## 
## $se.fit
##       1       2       3 
## 0.08628 0.08834 0.07050 
## 
## $residual.scale
## [1] 1
## calculate and store predicted values
poissondata$Predicted <- predict(m1.poisson, type="response")

## order by program and then by math
poissondata <- poissondata[with(poissondata, order(Language, Alcohol)), ]
## create the plot
ggplot(poissondata, aes(x = Alcohol, y = Predicted, colour = Language)) +
  geom_point(aes(y = Pauses), alpha=.5, 
             position=position_jitter(h=.2)) +
  geom_line(size = 1) +
  labs(x = "Alcohol (ml)", y = "Expected number of pauses") +
  scale_color_manual(values=c("gray30", "gray50", "gray70"))

7 Robust Regression

Robust regressions are linear regressions with added weights (Rousseeuw and Leroy 2005) and are thus used when a linear model is unduly affected by outliers but the data points should not be removed.

# load data
robustdata <- read.delim("https://slcladal.github.io/data/mlrdata.txt", sep = "\t", header = T)
# inspect data
summary(robustdata)
##     status           attraction            money       
##  Length:100         Length:100         Min.   :  0.93  
##  Class :character   Class :character   1st Qu.: 49.84  
##  Mode  :character   Mode  :character   Median : 81.73  
##                                        Mean   : 88.38  
##                                        3rd Qu.:121.57  
##                                        Max.   :200.99

We now fit an ordinary linear model (and although we know from the section on multiple regression, that the interaction between status and attraction is significant, we will disregard this for now as this will help to explain the weighing procedure which is the focus of this section).

# create model
slm <- lm(money ~ status+attraction, data = robustdata)
# inspect model
summary(slm)
## 
## Call:
## lm(formula = money ~ status + attraction, data = robustdata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -60.87 -15.79  -2.61  13.89  59.94 
## 
## Coefficients:
##                         Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)               114.95       4.29   26.80 < 0.0000000000000002 ***
## statusSingle               26.10       4.95    5.27           0.00000083 ***
## attractionNotInterested   -79.25       4.95  -16.00 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.8 on 97 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.74 
## F-statistic:  142 on 2 and 97 DF,  p-value: <0.0000000000000002

We now check whether the model is well fitted using diagnostic plots.

# create model diagnost plots
opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0))
plot(slm, las = 1)

par(opar)

The diagnostic plots indicate that there are three outliers in the data (data points 52, 83 and possibly 64). Therefore, we need to evaluate if the outliers severely affect the fit of the model.

robustdata[c(52, 64, 83), 1:2]
##    status    attraction
## 52 Single NotInterested
## 64 Single NotInterested
## 83 Single    Interested

We can now calculate Cook’s distance and standardized residuals check if the values of the potentially problematic points have unaccepatbly high values (-2 < ok < 2).

CooksDistance <- cooks.distance(slm)
StandardizedResiduals <- stdres(slm)
a <- cbind(robustdata, CooksDistance, StandardizedResiduals)
a[CooksDistance > 4/100, ]
##          status    attraction  money CooksDistance StandardizedResiduals
## 1  Relationship NotInterested  86.33       0.04442                 2.076
## 52       Single NotInterested   0.93       0.06419                -2.495
## 65       Single NotInterested  12.12       0.04276                -2.037
## 67       Single NotInterested  13.28       0.04079                -1.989
## 83       Single    Interested 200.99       0.06224                 2.457
## 84       Single    Interested 193.69       0.04800                 2.158
## 88       Single    Interested 193.78       0.04817                 2.162

We will calculate the absolute value of oder the table so that it is easier to check the values.

AbsoluteStandardizedResiduals <- abs(StandardizedResiduals)
a <- cbind(robustdata, CooksDistance, StandardizedResiduals, AbsoluteStandardizedResiduals)
asorted <- a[order(-AbsoluteStandardizedResiduals), ]
asorted[1:10, ]
##          status    attraction  money CooksDistance StandardizedResiduals
## 52       Single NotInterested   0.93       0.06419                -2.495
## 83       Single    Interested 200.99       0.06224                 2.457
## 88       Single    Interested 193.78       0.04817                 2.162
## 84       Single    Interested 193.69       0.04800                 2.158
## 1  Relationship NotInterested  86.33       0.04442                 2.076
## 65       Single NotInterested  12.12       0.04276                -2.037
## 67       Single NotInterested  13.28       0.04079                -1.989
## 78       Single    Interested 188.76       0.03943                 1.956
## 21 Relationship NotInterested  81.90       0.03698                 1.894
## 24 Relationship NotInterested  81.56       0.03644                 1.880
##    AbsoluteStandardizedResiduals
## 52                         2.495
## 83                         2.457
## 88                         2.162
## 84                         2.158
## 1                          2.076
## 65                         2.037
## 67                         1.989
## 78                         1.956
## 21                         1.894
## 24                         1.880

As Cook’s distance and the standardized residuals do have unaccepatble values, we re-calculate the linear model as a robust regression and inspect the results

# create robust regression model
rmodel <- rlm(money ~ status + attraction, data = robustdata)
# inspect model
summary(rmodel)
## 
## Call: rlm(formula = money ~ status + attraction, data = robustdata)
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -61.3  -15.2   -1.3   14.4   62.4 
## 
## Coefficients:
##                         Value   Std. Error t value
## (Intercept)             112.927   4.276     26.409
## statusSingle             25.677   4.938      5.200
## attractionNotInterested -76.345   4.938    -15.462
## 
## Residual standard error: 22.5 on 97 degrees of freedom

We will briefly check the weights to understand the process of weighing better. The idea of weighing is to downgrade data points that are too influential while not punishing data points that have a good fit and are thus less influential. This means that the problematic data points should have lower weights than other data points (the maximum is 1 - so points can only be made “lighter”).

hweights <- data.frame(status = robustdata$status, resid = rmodel$resid, weight = rmodel$w)
hweights2 <- hweights[order(rmodel$w), ]
hweights2[1:15, ]
##          status  resid weight
## 83       Single  62.39 0.4860
## 52       Single -61.33 0.4944
## 88       Single  55.18 0.5495
## 84       Single  55.09 0.5504
## 78       Single  50.16 0.6045
## 65       Single -50.14 0.6047
## 1  Relationship  49.75 0.6094
## 67       Single -48.98 0.6190
## 21 Relationship  45.32 0.6690
## 24 Relationship  44.98 0.6741
## 39 Relationship -43.64 0.6948
## 79       Single  40.79 0.7434
## 58       Single -40.71 0.7447
## 89       Single  39.94 0.7592
## 95       Single  39.79 0.7621

The values of the weights support our assumption that those data points that were deemed too influential are made “lighter” as they now only have weights of .486 and .494 respectively.

After inspecting the weights, we also want to extract the p-values for the predictors. The p-values have to be calculated separately using the “f.robftest” fucntion from the “sfsmisc” library.

p_status <- f.robftest(rmodel, var = 2)
p_attraction <- f.robftest(rmodel, var = 3)
# inspect results
p_status; p_attraction
## 
##  robust F-test (as if non-random weights)
## 
## data:  from rlm(formula = money ~ status + attraction, data = robustdata)
## F = 27, p-value = 0.000001
## alternative hypothesis: true statusSingle is not equal to 0
## 
##  robust F-test (as if non-random weights)
## 
## data:  from rlm(formula = money ~ status + attraction, data = robustdata)
## F = 239, p-value <0.0000000000000002
## alternative hypothesis: true attractionNotInterested is not equal to 0

The output shows that both status and attraction are significant but, as we have seen above, the effect that really matters is the interaction between status and attraction. This was, however, not the focus of this sections as this section meerly served to introduce the concept of weights and how they can be used in the context of a robust linear regression.

References

Achen, Christopher H. 1982. Interpreting and Using Regression. Vol. 29. Sage.

Agresti, Alan. 1996. An Introduction to Categorical Data Analysis. Hoboken, NJ: JohnWiley & Sons.

———. 2010. Analysis of Ordinal Categorical Data. Vol. 656. John Wiley & Sons.

Agresti, Alan, and Maria Kateri. 2011. Categorical Data Analysis. Springer.

Baayen, R Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University press.

Bortz, Jürgen. 2006. Statistik: Für Human-Und Sozialwissenschaftler. Springer-Verlag.

Bowerman, Bruce L, and Richard T O’Connell. 1990. Linear Statistical Models: An Applied Approach. Boston: PWS-Kent.

Breheny, Patrick, and Woodrow Burchett. 2013. “Visualization of Regression Models Using Visreg.” R package.

Crawley, Michael J. 2005. Statistics: An Introduction Using R. 2005. Chichester, West Sussex: John Wiley & Sons.

———. 2012. The R Book. John Wiley & Sons.

Faraway, Julian J. 2002. Practical Regression and Anova Using R. University of Bath.

Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.

Green, Samuel B. 1991. “How Many Subjects Does It Take to Do a Regression Analysis.” Multivariate Behavioral Research 26 (3): 499–510.

Gries, Stefan Th. 2009. Statistics for Linguistics Using R: A Practical Introduction. Berlin & New York: Mouton de Gruyter.

Harrell Jr, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer.

Levshina, Natalia. 2015. How to Do Linguistics with R: Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company.

Menard, Scott. 1995. Applied Logistic Regression Analysis: Sage University Series on Quantitative Applications in the Social Sciences. Thousand Oaks, CA: Sage.

Myers, Raymond H. 1990. Classical and Modern Regression with Applications. Vol. 2. Duxbury Press Belmont, CA.

Rousseeuw, Peter J, and Annick M Leroy. 2005. Robust Regression and Outlier Detection. Vol. 589. John wiley & sons.

Szmrecsanyi, Benedikt. 2006. Morphosyntactic Persistence in Spoken English: A Corpus Study at the Intersection of Variationist Sociolinguistics, Psycholinguistics, and Discourse Analysis. Berlin & New York: Walter de Gruyter.

Wilcox, Rand R. 2009. Basic Statistics: Understanding Conventional Methods and Modern Insights. Oxford University Press.

Zuur, Alain F., Elena N. Ieno, and Chris S. Elphick. 2010. “A Protocol for Data Exploration to Avoid Common Statistical Problems.” Methods in Ecology and Evolution 1 (1): 3–14.