The General Linear Model

I’m not a fan of how introductory statistics is taught, in decision-tree cookbook fashion where students have to memorize which analysis is most appropriate for which circumstance. I think a much better way to teach statistics (and a much better way to think of it) is to teach the general linear model (GLM). GLM doesn’t care whether there are one or two predictor variables, whether the variables are quantitative or qualitative, or the number of levels per variable. All it cares about is which variable is the outcome and which are the predictors. (And GLM can even handle multiple outcomes as well!)

Methodologists, as far as I know, have no “name” for an analysis with two categorical predictors and two quantitative predictors. Instead, we simply plug those variables in to the equation:

\(Y = A + B + X + Z\)

where A/B are categorical and Z/X are numeric. The computer (and the mathematics) don’t care what we call it and the computer doesn’t require a decision tree (other than one that specifies which predictors are numeric and which are categorical).

Shouldn’t plotting be like that as well? Shouldn’t we just have to tell a computer our outcome variable (i.e., what’s on the Y axis) and what predictor variables we have? Then shouldn’t the computer figure out for us how to plot it?

That’s the idea behind flexplot. Flexplot is a common language for plotting where the user simply specifies what the outcome is and what the predictors are. The computer then decides the optimal way of plotting the variables. However, the user does has flexibility to decide which variables are going to be on the X axis, which will be plotted as separate lines/symbols/colors, and which will be panelled.

In this article, I’m going to show you how to use flexplot to graph the most common sorts of analyses in psychology.

Preliminaries

I’m just going to load a dataset that I’m going to use throughout this post. I made these data up a few years ago and they’re good for showing just about any sort of analysis you might be interested in. Essentially, it simulates data where participants were randomly assigned to different therapies for weight loss (behaviorist versus cognitive therapy). The dataset also contains other variables, such as motivation scores and income.

So with that, let’s load the fifer package, as well as the exercise datasets:

require(fifer)
require(ggplot2)
  ### load the "exercise data" dataset
data("exercise_data")
  ### rename the exercise dataset (to make it easier to call objects within the dataset)
d = exercise_data

Independent T-Test/ANOVA

In this situation, we have a grouping variable (e.g., treatment versus control, male versus female, low/medium/high medication) and we want to see how scores on the dependent variable vary as a function of group. To do so, we can use a “median dot plot” as I call them (or a mean dot plot, if you choose to report the mean instead of the median). The median is shown as a large red dot, along with interquartile ranges. I prefer nonparametric versions (i.e., medians/IQRs) rather than means/standard errors, just in case the data are not normally distributed. The “scores” for the rewards/no-rewards conditions have been “jittered,” which just means that noise has been added so they don’t overlap as much.

flexplot(weight.loss~rewards, data=d)

Simple Regression

The clear “winner” for graphing simple regression is the scatterplot. One nuance I like to add (at least in the early stages of investigation) is to overlay a lowess line, rather than a regression line. A lowess line will “bend” with the data if it feels the need to bend, which prevents my eye from overlooking evidence of nonlinearity.

To do so in flexplot, one simply types:

  #### run the actual model
flexplot(weight.loss~health, data=d, se=F)

Notice the above graphic does bend. Good thing we plotted the lowess line!

ANCOVA

Before demonstrating how to plot an ANCOVA, it is necessary to begin with a little theoretical backgound. Recall that an ANCOVA attempts to determine whether groups (e.g., treatment versus control) differ on an outcome variable, after controlling for a continuous measure. This becomes a little tricky because we then have to represent graphically what it means to “control” for another variable. One way to do this is to do a simple regression between the outcome and the covariate, export the residuals, then plot the grouping variable on the X axis and the residuals on the Y axis. Unfortunately, this mucks up the scale of the Y axis (it will be centered at zero) and makes it harder to interpret. To fix that, I like to add the mean of the outcome back into the residuals These plots are called “Added Variable Plots” (though added variable plots don’t usually add the mean back in like I do).

But, if you’re doing this in R, you don’t have to worry about all that. Just use the added.plot function:

  #### added variable plot version of ANCOVA
added.plot(weight.loss~motivation + therapy.type, data=d)

Note that this function will control for the first listed variable (motivation in this case).

One limitation of added variable plots is they assume the covariate is linear and does not interact with the grouping variable. To check that this is the case, it’s best to plot the groups as separate lines and put the covariate on the X axis, as follows:

  #### different groups as separate lines
flexplot(weight.loss~motivation + therapy.type, data=d, se=F)

Or put the lines in separate panels:

  #### different groups as separate lines
flexplot(weight.loss~motivation | therapy.type, data=d)

Notice the notation: to panel a variable, put it to the right of a “|”. Otherwise, categorical variables show up as different lines.

In both cases, there’s a little “bendiness,” but not enough to worry about.

An important “take home” message from these three plots is this: if an analysis can be plotted multiple ways, plot it all ways. Each way provides a different “perspective” that may yield greater insight.

Factorial ANOVA

Recall a Factorial ANOVA has two IVs rather than just one and it allows us to assess the strength of an interaction between two categorical variables. This is quite easy to do in flexplot:

  #### different groups as separate lines
flexplot(weight.loss~rewards | therapy.type, data=d)

It is common to plot “interaction plots” when we have two categorical predictors and attempt to visually inspect the degree to which the two lines deviate from parallel. I always favor plotting the raw data in addition to the numeric summaries (such as the mean/standard deviation). For this reason, I prefer to show the lines in two separate panels to minimize overlap.

You could also plot them in the same panel:

  #### different groups as separate lines
flexplot(weight.loss~rewards + therapy.type, data=d)

Multiple Regression

Plotting multiple regression gets a little tricky since we humans cannot see beyond 4 dimensions (if you count time as a dimension). Though we could use 3d plots, I find them hard to interpret. I prefer to panel variables as we did with the ANCOVA plot. As before, any variable to the right of “|” will show up in panels (though the function only allows up to two paneled variables). So, to visualize a three-way interaction, we could do the following:

  #### different groups as separate lines
flexplot(weight.loss~motivation | income + satisfaction, data=d)
## The following variables are going to be binned: income, satisfaction

In the background, flexplot will “bin” both income and satisfaction and put them in separate panels. Alas, cognitive load is high with these plots (which is why I’m showing regression lines, rather than lowess lines). But there are tricks I have learned. If you follow the diagonals (in either direction) and the regression lines (or lowess lines) are not parallel, that provides evidence for an interaction. The exact nature of the interaction may require some mental gymnastics and it’s important to be careful not to interpret too much into things (especially if the sample size per panel is low, as is the case here).

Let’s try to make the above graphic more interpretable by making fewer “bins” on the paneled variables, add some nice labels, remove the standard error lines, and replace lowess lines with regression lines:

  #### different groups as separate lines
flexplot(weight.loss~motivation | income + satisfaction, data=d, 
         bins=3, # fewer bins
         labels=list(c("<98K", "98K-102K", ">102K"), c("<4", "4-6", ">6")), # pretty labels
         se=F, # remove standard error lines
         method="lm") # plot regression rather than lowess line
## The following variables are going to be binned: income, satisfaction

That’s better. (And these people are making bank!)

It seems that, in general, weight loss increases with increasing motivation. There’s little evidence for an interaction: most lines are positive and by about the same. Any deviations from that pattern may just be noise.

When quantitative variables are binned and put into panels, it’s really tricky to see these paneled variables’ main effects, so it’s always a good idea to try putting other variables on the x axis:

  #### different groups as separate lines
flexplot(weight.loss~satisfaction | income + motivation, data=d, 
         bins=3, # fewer bins
         labels=list(c("<98K", "98K-102K", ">102K"), c("low", "mid", "high")), # pretty labels
         se=F, # remove standard error lines
         method="lm") # regression line
## The following variables are going to be binned: income, motivation

In general, weight loss and satisfaction are positively correlated. And here’s another perspective:

  #### different groups as separate lines
flexplot(weight.loss~income | motivation + satisfaction, data=d, 
         bins=3, # fewer bins
         labels=list(c("low", "mid", "high"), c("low", "mid", "high")), # pretty labels
         se=F, # remove standard error lines
         method="lm") # regression line
## The following variables are going to be binned: motivation, satisfaction

And it seems income has little to do with weight loss. Seems you can’t buy a flat stomach.

Mixed (Hierarchical) Models

Recall that with mixed models, there’s some “nesting” structure in the data: multiple timepoints within the same person, for example, or multiple students within schools. Often this sort of clustering of datapoints violates the assumption of independent and we must account for that. To do so, we often use mixed models. The basic idea behind mixed models is that we (kinda sorta) fit a separate line for each cluster, then average those per-cluster lines into one “fixed effect” line.

That makes it tricky to graph, unfortunately. The way flexplot handles this is that it randomly samples clusters and plots the relationship of interest within these clusters. Suppose, for example, we are interested in how SES predicts Math Achievement, but we have “clustered” data because students are nested within schools. We can use flexplot to look at the relationship between SES and Math for a random sample of schools. But first, let’s load the math dataset:

data(math)
d = math

Then we simply fit our mixed model (I’m using lme4):

require(lme4)
mod = lmer(MathAch~SES + (SES | School), data=d)  ### random/fixed effects for both the slope of SES and the intercept

Now, we can use the mixed.mod.visual function:

  #### different groups as separate lines
mixed.mod.visual(MathAch~SES, data=d, n=6, model=mod)

The above image randomly samples 6 schools and illustrates the fit of that particular school (light gray line) as well as the fixed effects line (in red).

If we weren’t satisfied with only six schools, we could sample another six using the same code (the schools chosen are random):

  #### different groups as separate lines
mixed.mod.visual(MathAch~SES, data=d, n=6, model=mod)

Generalized Linear Models

Generalized linear models are different than general linear models. General linear models assume normality, homoskedasticity, etc. Generalized linear models, however, assume different distributions, such as Poisson, Binomial (i.e., logistic regression), or negative binomial. This fact alone does not complicate the graphics of the model; regardless of whether one assumes the relationship is linear or non-linear, the scatterplot (for example) remains unchanged. What does change is the fit of the model.

Previously, I’ve showed a lowess line and a regression line. Flexplot, however, can plot any function that generates predicted values. For example, suppose we’re plotting the relationship between mental illness and psychological distress from the NSDUH dataset:

data("nsduh")
d = nsduh
flexplot(distress~MI, data=d, sample=1000, alpha = .2, method="lm", jitter=T)

Here, I’m sampling 200 datapoints (the dataset contains over 40,000 and that makes the graph a bit messy), making the data mostly transparent, and I’m also jittering the datapoints to minimize overlap. Notice that the regression line isn’t doing too well: it seems to be pulled by the high values of MI.

Let’s fit a generalized linear model (as well as a standard general linear model), using a Gamma distribution:

d$distress = d$distress + 1 #### make distress positive so GLM can work
model.linear = lm(distress~MI, data=d)
model.gamma = glm(distress~MI, data=d, family=Gamma)

Now we can visually compare the fits of both models using the compare.fits function:

compare.fits(distress~MI, data=d, model.linear, model.gamma, sample=1000, alpha=.2, jitter=T)

#require(randomForest)
#d$distress = factor(d$distress, ordered=T)
#model.rf = randomForest(distress~MI, data=d)
#compare.fits(distress~MI, data=d, model.rf, model.gamma, sample=1000, alpha=.2, jitter=T)

Identifying the “best” model would likely require more than a visual analysis, but this is a good (and necessary, IMO) start.

Future Directions

The above plots shows probably 95% of all analyses I see in psychological research. They are not, however, comprehensive. At this point in time, I can’t think of a good way to visualize structural equation models, for example. However, flexplot provides a good (and necessary, IMO) start to visualizing data analysis.


Leave a Reply

Your email address will not be published. Required fields are marked *