Female instructors should not get bonus points to correct for gender bias

A slough of research has come out in the last few years (and there’s more forthcoming from my collaborators and me) showing that these end-of-semester ratings that students give teachers, usually on a scale from 1-5 or so, are significantly biased against female professors. The obvious question is: if not student evaluations of teaching (SET), how should we evaluate instructors? I recently saw this article on Twitter.  It argues that “female faculty should receive an automatic correction” on their SET scores, meaning that the administration would add a fixed number to every female instructor’s score in order to make it comparable to male instructors’ scores. This adjusted score would be used to decide whether the instructor should be rehired to teach, be given tenure, etc.

I don’t believe this can be done, for a number of reasons. There are other biases and confounding variables besides gender that make it impossible to find a single number to add to every female instructor’s score.

  • Biases are not consistent across fields. For example, at Sciences Po in Paris, there is a greater proportion of female instructors in sociology than in economics, and the observed gender bias is less in sociology than in economics.  Any correction to SET would have to vary by course matter.
  • Biases depend on student gender as well. Our research shows that in some schools, male students rate their male instructors significantly higher than their female instructors while female students tend to rate them the same.  This is a problem for adjusting scores because the gender balance in the class will affect the instructor’s score. For instance, imagine a hypothetical male instructor who teaches two identical classes. On average, his male students give him a rating of 4.5 and females give him a rating of 4.  In the first class, the gender balance is 50/50, so the average rating is 4.25.  In the second class, there are 80 males and 20 females, so the average rating will be 4.4.  There’s no one magic number to add or subtract from this average to cancel out the gender bias when comparing this score to the SET of other instructors.
  • There is some evidence that SET are biased by the instructor’s race and age as well.  We lack data on this, but similar work on bias in hiring decisions has showed that people (men and women alike) comparing identical resumes will tend to prefer job applicants with male, European-sounding names.  Anecdotally, instructors who have accents or are above average age (even as young as mid-thirties in some places!) fare worse on their SET.

The list could go on — I’m sure there are a ton of other confounding variables, like time of day of the class, difficulty of the course material, etc. which affect how students tend to rate their instructors.  In order to find a correcting factor for each female instructor, you’d have to look at all of these variables and average them out.  In fact, you ought to do that for male instructors too, since gender isn’t the only bias.  This just highlights the fact that SET aren’t measuring teaching effectiveness in the first place; they’re a better measure of how comfortable or satisfied a student is in the class.

Admittedly, the title of this post sounds combative. But it’s not — of course something needs to be done about the pervasive gender bias that’s causing female faculty to lose teaching positions and costing them job promotions.  I’m merely arguing that it is impossible to effectively “correct” for gender bias, and so alternative, more objective means for evaluating teaching effectiveness should be used instead of SET.


Estimation vs. Inference

I went to Marti Anderson’s talk at UC BIDS yesterday where she introduced a generalization of mixed model ANOVA which uses a correction factor in the estimation which essentially corrects for how random your random effects are.  She showed that if your random effects are actually close to fixed, then you gain statistical power by setting the correction factor close to 0.  This was all good until after the talk, somebody asked a question to the effect of “well I thought the big difference between fixed and random effect models was that your data don’t have to be normal for fixed effects, so how does that come into play here?”

The answer: it doesn’t.  Nowhere in her talk did Marti mention a normality assumption the data.  P-values for the were computed using permutation tests, which make no assumptions about the underlying distribution of your data.  All you need to use a permutation test is an invariance or symmetry assumption about your data under the null hypothesis: for example, if biological sex had no relation to height, then the distributions of male and female heights should be identical and we relabeled people male or female at random, the height distributions wouldn’t change.

People often confuse the assumptions needed to estimate parameters from their models and the assumptions to create confidence intervals and do hypothesis tests.  The most salient example is linear regression.  Even without making any assumptions, you can always fit a line to your data.  That’s because ordinary least squares is just an estimation procedure: given covariates X and outcomes Y, all you need to do is solve a linear equation.  Does that mean your estimate is good?  Not necessarily, and you need some assumptions to ensure that it is.  These are that your measurement errors are on average 0, they have the same variance (homoskedasticity), and they’re uncorrelated.  Then your regression line will be unbiased and your coefficient estimates will be “best” in a statistical sense (this is the Gauss-Markov theorem).  Of course, for this to work you also need more data points than parameters, no (near) duplicated covariates, and a linear relationship between Y and X, but hopefully you aren’t trying to use ordinary least squares if these don’t hold for your data!

Notice that there are no normality assumptions for the Gauss-Markov theorem to hold; you just need to assume a few things about the distribution of your Y data that can sort of be checked.  Normal theory only comes into play if you want to use “nice” well-known distributions to construct confidence intervals.  If you assume your errors are independent and identically distributed Gaussian mean-0 noise with common variance, then you can carry out F-tests for model selection and use the t-distribution to construct confidence intervals and to test whether your coefficients are significantly different from 0.

This is what usual statistical packages spit out at you when you run a linear regression, whether or not the normality assumptions hold.  The estimates of the parameters might be ok regardless, but your confidence intervals and p-values might be too small.  This is where permutation tests come in: you can still test your coefficients using a distribution-free test.

As an example, I’ve simulated data and done ordinary least squares in R, then compared the output of the lm function to the confidence intervals I construct by permutation tests.*  I’m just doing a simple univariate case so it’s easy to visualize.  The permutation test will test the null hypothesis that the regression coefficient is 0 by assuming there is no relationship between X and Y, so pairs are as if at random.  I scramble the order of the Xs to eliminate any relationship between X and Y, then rerun the linear model.  Doing this 1000 times gives an approximate distribution of regression coefficients if there were no association.


I generated 100 random X values, then simulated errors in two ways.  First, I generated independent and identically distributed Gaussian errors e1 and let Y=2X + e1. I ran linear models and permutation tests for both to get the following 95% confidence intervals:

> confint(mod1,2) # t test
2.5 % 97.5 %
X 1.917913 2.068841
> print(conf_perm1) # permutation test
1.598273 2.388481

When the normality assumptions are true, then we get smaller confidence intervals using the t-distribution than using permutation tests. That makes sense – we lose efficiency when we ignore information we know.

Then I generated random double exponential (heavy-tailed distribution) errors e2 that increase with X, so they break two assumptions for using t-tests for inference, and let Z = 2X + e2.  The least squares fit is very good, judging by the scatterplot of Z against X – the line goes right through the middle of the data.  There’s nothing wrong with the estimate of the coefficient, but the inference based on normal theory can’t be right.

> confint(mod2,2) # t test
2.5 % 97.5 %
X 1.716465 2.348086
> print(conf_perm2) # permutation test
1.531020 2.533531

Again, we get smaller confidence intervals for the t-test, but these are simply wrong.  They are artificially small because we made unfounded assumptions in using the t-test intervals; perhaps this points to one reason why so many research findings are false.  It’s this very reason that it’s important to keep in mind what assumptions you’re making when you carry out statistical inference on  estimates you obtain from your data – nonparametric tests may be the way to go.

* My simulation code is on Github.

Cognitive dissonance about research reform

An interesting editorial on research practices came out in PLOS Medicine yesterday.  It’s good to hear about reproducibility and reforms we need to see in science from a fellow statistician, John Ioannidis over at Stanford.  Each discipline has its own quirks and accepted practices, but statistics is a common factor in every study.  I believe we statisticians have a unique perspective on the problem: we get to play the role of data advisor on other peoples’ studies and the PIs on our own.

Ioannidis cites examples of things that work in several fields, including reproducibility practices, data registration, and stricter statistical analyses.  Then he proposes a new “structure of scientific careers” that doesn’t just favor old men with fancy titles and big grants.  In this framework,

 Resources and power are seen as opportunities, and researchers need to match their output to the opportunities that they have been offered—the more opportunities, the more the expected (replicated and, hopefully, even translated) output. Academic ranks have no value in this model and may even be eliminated: researchers simply have to maintain a non-negative balance of output versus opportunities. In this deliberately provocative scenario, investigators would be loath to obtain grants or become powerful (in the current sense), because this would be seen as a burden.

I got to this part of the article and thought, “Wait, this sounds crazy?”  It almost seems like there would be no incentive to work hard, like any award would come with some negative consequences and you’d be punished if your work didn’t produce results.  Isn’t that exactly what research reforms are trying to get around?  Maybe a greater emphasis on sharing negative results would get around this problem, but I digress.

After reading this the first time and feeling my knee-jerk disagreement, I took a step back and realized that my negative response is precisely due to my being immersed in the current culture of “publish or perish” and academic hierarchies.  I’m so entrenched in this way of thought that it’s hard to see other models for scientific careers.  However, I’m on Ioannidis’s side and I believe we need to seriously rethink the way research is done in order to have more high quality results.

Frankly my commentary on the subject is pretty useless because it’s a hard question and I’m no expert.  You should just go read the article here.