How I create fairer tests for my students

Have you ever had a professor say something along the lines of, “If I had the time, I’d just sit down and talk to each of you 1-on-1 to measure what you learned in this class. But I have to give you all a final exam instead”?

I heard a few of my professors express that sentiment. On one hand, it’s noble — test anxiety is a very real thing. But as I’ve gained more experience over the years as a teacher, I’ve come to view that perspective as a bit misguided.

I dislike it for the same reason I dislike most job interviews. Humans, despite their best intentions, will always have unconscious biases, and it’s very difficult to standardize and structure an interview such that each candidate has the exact same experience.

This is all a long-winded way of saying: I really like tests. They’re not a perfect form of assessment either, but with a little prep, they can be a valid and reliable way to measure knowledge, skills, and abilities.

In this tutorial, I’m going to share the methods I use to measure my students’ knowledge. Even if you’re not an educator, I think you might be able to find something useful to apply to your own line of work.

The assumption

There’s one major assumption I need to acknowledge before we jump into my methodology:

I assume that my tests are all measuring one single underlying construct: knowledge of course material.

The opposite of this would be a history professor who believes that her midterm measures two distinct factors (for instance):

  • Knowledge of the American Revolution
  • Knowledge of the Civil War

The professor in this example is assuming that knowledge of one period of history does not correlate with knowledge of another period.

In my experience, this just isn’t true. Whenever I perform a factor analysis on student responses, there’s clearly only one factor being measured.

Furthermore, we know that IQ is a single factor, and I recall reading that SAT math and verbal scores are moderately-to-highly correlated. (I can’t find the exact r anywhere, except for a few unsourced claims. Let me know if you find anything.)

So, we’ll continue on with the assumption that all test items are closely related.

Logistics

Now, a bit about my philosophy and the logistics of testing.

I write multiple-choice tests in Google Forms and download the results as a CSV. Therefore, all my tests are untimed and open-note. I get such a variance of scores in spite of this that I’m no longer concerned that the tests might be too easy. I still have to curve the scores at the end; the curve is just sometimes less than it would be if it were an in-class, closed-note test.

A big benefit of this format is that students don’t feel rushed, and it seems to significantly reduce testing anxiety. When my schedule permits, I lower the stakes even more by making the tests more frequent and worth fewer points. Formative assessment is a great tool.

Methodology

Okay, let’s talk about the statistics and Python! First let’s load the csv and remove the columns we don’t need:

data = pd.read_csv('quiz_responses.csv')

to_drop = ['Unnamed', 'Timestamp', 'Email Address']:
for d in to_drop:
    data = data[[i for i in data.columns if d not in i]]

The list comprehension is technically slower, but I prefer it because this code works even when the to_drop columns don’t exist.

Each column represents a test question, so let’s shorten the column names and then set the index equal to the student’s name.

data.columns = [i[:60].strip() + '...' if len(i) > 60 
                else i for i in data.columns]

# Find the name column and set it as the index
name_col = [i for i in data.columns if 'last name' in i.lower()][0]
data.index = data[name_col]
del data[name_col]

Things get a little weird here. The next thing I do is transpose the dataframe with .T and identify the column that’s my answer key. After that, we can grade by comparing each student’s column to the answer key column.

# Transpose and get answer key
data = data.T
key = [i for i in data.columns if 'answer' in i.lower()][0]

# Create new df called `results` and grade each test
results = pd.DataFrame(index=data.index)
for i in data.drop(key, axis=1).columns:
    results[i] = np.where(data[i] == data[key], 1, 0)
    
results = results.T
totals = results.sum(axis=1)

This code will turn each student’s test into 1’s and 0’s. They get a 1 when they get an answer right, and a 0 when they get it wrong.

Then we transpose the results dataframe again so that the questions are columns again, and then we calculate each student’s total number correct with totals = results.sum(axis=1).

At this point we’ve essentially graded all the tests. We can now look at the raw scores before we start improving the test itself.

curve = 0
total_pts = 0
for i in results.columns:
    total_pts += results[i].max() # [1]

# Convert to percentages    
scores_array = totals/total_pts

# Calculate stats
print('Mean:  ', scores_array.mean())
print('Median:', scores_array.median())
print('SD:    ', round(scores_array.std(ddof=0), 3))
print()

# Print student scores
(totals.sort_index()/(total_pts)) + curve

[1]: By calculating total points this way, I can add additional code to weight some questions to be worth more than others. I won’t get too deep in that process here, but all you have to do is multiply a question column by a scalar.

The code you’ve seen so far makes it really easy to just drop the .csv into my grading directory and run the Jupyter notebook. I could take it a step further and refactor the code into a class, but that really hasn’t been necessary thus far.

Once we run the code, we get a nice printout like this:

Mean:   0.671304347826087
Median: 0.7
SD:     0.179

# (Names are fake)
Adam Gates               0.92
Alexa Hurst              0.44
Amber Moon               0.52
Amy Wilkinson            0.48
Andrew Evans             0.68
Angela Bowen             0.82
Becky Williams           0.80
Bryan Young              0.78
Christina Fitzpatrick    0.18
Christopher Hunter       0.80
Corey Gibson             0.72
Eric Scott               0.72
Madeline Todd            0.80
Marc Martin              0.70
Mary Hernandez           0.70
Michael Kirk             0.62
Roberto Schwartz         0.80
Nathan Lewis             0.32
Robert Harrell           0.66
Ryan Hill                0.88
Scott Mcintosh           0.64
Terry White              0.86
Thomas Martinez          0.60
dtype: float64

…and that’s my grading process. Now let’s cover how I optimize my test for fairness.

Test item selection

I determine if a question is “fair” by examining the correlation between getting the question correct and a student’s overall score on the exam. This works because of our assumption that the test is only measuring one construct/factor.

There are more advanced statistical methods I could use here. Factor analysis is very appropriate here (as I mentioned earlier), and Cronbach alpha would be a good measure of reliability. With that said, correlations alone still give me great results and is easiest to explain to students.

“Fair” isn’t necessarily the best word to use. The real goal is to see if the question distinguishes between students who know the material, and students who don’t. But, if you string enough of those questions together, and critically evaluate your test for cultural bias, then you can make a strong argument that your test is, in fact, “fair.”

So, how can these correlations inform our decisions?

  • A question that everyone gets right isn’t a good discriminator of knowledge. The correlation between that question and a student’s overall score would be zero.
  • The same is true for a question everyone gets wrong.
  • A question that high performers get right and others get wrong is likely a good question.
  • A question that low performers get right and everyone else gets wrong probably a bad question. (Actually, it’s more commonly a sign I made a mistake on my answer key!)

Here’s the code for calculating correlations:

from scipy.stats import pearsonr

correlations = []

for i in results.columns:
    r = pearsonr(results[i], totals)
    corr = r[0]
    pval = r[1]
    corr = round(corr, 3)
    correlations.append((i, pval, corr))
    
correlations = pd.DataFrame(correlations, columns=['question', 'pvalue', 'r'])
correlations['absol'] = abs(correlations['r'])

Next, we can single out the questions that have low correlations with overall grade:

# Filters (subjective, but this is a good starting point)
low_correlation = correlations['r'] < .30
significant = correlations['pvalue'] < 0.1

# Print bad questions (Sorry, WordPress doesn't like my ampersand)
bad_questions = correlations[low_correlation &amp; significant]['question'].tolist()

if bad_questions:
    print('Bad questions:')
    for i in bad_questions:
        print(i)

correlations = correlations.dropna()
correlations.set_index('question', inplace=True)

And then we can use seaborn to make a graphical representation of it:

plt.figure(figsize=(3, len(correlations)//4))

sns.heatmap(correlations[['r']].sort_values('r', ascending=False), annot=True, cbar=False);

Success! The bright numbers with high scores appear to be good questions, while the darker numbers are questions I should consider tossing out. (Sometimes, however, this is a clue that I left students with misconceptions — which means that I should reconsider how I teach something.)

Restructuring the test

Now I know what questions were good measures of knowledge of which questions were bad ones.

Lately, I’ve taken to warning students that several questions will definitely be tossed out come grading time. This tempers their expectations and lets them know that their grade won’t simply be what percentage of questions they got right.

It’s like the SAT, I explain to them: new questions need to be tested to see if they deserve a permanent place in my test bank. In a way, I trade transparency for fairness, but I haven’t gotten any complaints yet!

Now, we can put together all the code we’ve seen to automate the question selection process and grade the tests:

# Grade tests
results = pd.DataFrame(index=data.index)
for i in data.drop(key, axis=1).columns:
    results[i] = np.where(data[i] == data[key], 1, 0)
    
results = results.T
totals = results.sum(axis=1)

results_copy = results.copy()


# Remove questions where everyone answered the same
# (Sorry, this filter is obnoxious but it works!)
everyone_answered_same = results_copy.corrwith(
    totals).sort_values()[
    results_copy.corrwith(totals).sort_values().isnull()]

for i in everyone_answered_same.index:
    del results_copy[i]


# Sequentially remove questions that don't correlate with overall score
worst = 0
threshold = .30
while worst < threshold:
    worst = results_copy.corrwith(totals).sort_values()[0]
    question = results_copy.corrwith(totals).sort_values().index[0]
    del results_copy[question]
    totals = results_copy.sum(axis=1)
    print(question)

Now we grade using the same code we saw earlier:

curve = 0

total_pts = 0
for i in results_copy.columns:
    total_pts += results_copy[i].max()

grades = round(totals2/total_pts, 2).sort_values()

print('Mean:  ', grades.mean())
print('Median:', grades.median())
print('SD:    ', grades.std(ddof=0))

grades.sort_values() + curve

At this point, I decide what my grading curve will be. A dirty little secret of academia is that professors can use just about any curve they want.

My general rule is that I want the median score to be 78% for lower division classes, 74% for upper division classes, and 70% for my stats class (which is supposed to be hard, IMO).

I rarely see medians above those targets, but if I did, I wouldn’t curve downward.

Now that the process is complete, we have test grades that fairly and accurately measure student knowledge!

I could take it a step further and adjust the standard deviation as well, but then my grading system would lose even more transparency, and I’m not willing to make that concession. College students understand:

“I threw out the 5 questions that were the worst measures of your knowledge.”

…but they’ll struggle with:

“And then I curved the test so the average grade was 78%, and then compressed the standard deviation to be 8 points.”

Well… my stats students should understand that, but it’s still an explanation that’s a bit too long.

I might fall just short of a standardized test that’s beyond reproach, but that’s asking too much of a college classroom test. Too many other factors vary from class to class (namely, my lectures), so I don’t think it’s worth spending additional effort on a perfectly uniform grading system. But this still goes a long way in showing me what my students know.

My two favorite machine learning scoring metrics (and why)

I’ll preface this by saying that what works for me won’t necessarily work for everyone else. You might be solving machine learning problems very different from the ones I usually work with. I’m typically working with regression and binary classification problems, so my preferences reflect that. So if you already have a workflow that suits you, I’d say to stick with that.

With that disclaimer out of the way, I almost always use sklearn.metrics.r2_score for regression and sklearn.metrics.roc_auc_score for binary classification.

What these two metrics have in common is that that they’re unit-agnostic. It doesn’t matter if you’re predicting the price of a house or the grade of a student; a dating match or an abusive tweet — you’ll always have an intuition of whether your model is doing a good job or not.

R-squared score

R-squared is used for regression problems: given X, predict the value of Y.

First let’s look at r before we square it. Pearson’s r is your correlation coefficient — in this case, the correlation between actual and predicted values.

Pearson’s r accounts for differences in scale. X1 might be measured from 1-10, while X2 might be measured from 6000-10000. Or X1 might be measured from 200-210 while X2 is from 0.01-1.0. It doesn’t matter what scalar you might multiply your measurements by; the correlation is comparing z-scores so it’s always an apples-to-apples comparison.

Now, let’s square r. This is commonly called the coefficient of determination in statistics. Mathematically, it tells you what percentage of variance in the Y variable is predicted from X — or, in other words, what percentage of the variance in the actual values is accounted for in your model’s predicted values.

Since r accounts for differences in scale, r2 does as well. This means that an r2 close to zero is (nearly) always bad and an r2 close to one is (nearly) always good. You don’t need to know what units your Y variable is measured in; it’s irrelevant to r2_score.

The only thing that might change from project to project are the expectations and demands you place on this score. For one project, you might require nothing less than an r2 > 0.70, while for another project, r2 > 0.20 might be considered phenomenal.

But the number always means the same thing. That’s the beauty of it.

Before we move on to ROC-AUC, let me share some rules-of-thumb for how to how to interpret your r2 :

R2Description
< 0Actively detrimental. Worse than random chance. Delete Python and burn your laptop.
0Useless
0.2Mediocre / Cusp of usability
0.3Okay / Satisfactory
0.5Good
0.7Very good
0.8Excellent
0.95Too good — Are you leaking data?

Remember! These can vary from project to project. But these guidelines have served me well over the years.

(Sidenote: How can r2 be negative?)

ROC-AUC Score

ROC-AUC is used for binary classification problems. It stands for Receiver Operating Characteristic – Area Under the Curve.

The number reflects the relationship between your true positive rate and your false-positive rate — that is, the positive classes you correctly identified vs. the negative classes you misidentified as positives.

Here’s why I love ROC-AUC so much (which I’ll just call AUC from now on):

Classification problems have the consistent dilemma of where to draw the line. Let’s use a smoke detector as an example. Should a smoke detector go off when:

  • There’s a 10% chance of a fire?
  • There’s a 1% chance of a fire?
  • There’s a 0.1% chance of a fire?
  • There’s a 0.01% chance of a fire?

It has to draw the line somewhere, and this same thing happens when you train a classifier to predict classes. Let’s look at some pseudocode:

clf = Classifier().fit(x_train, y_train)
pred = clf.predict(x_test)

print(pred)
>>> [0, 1, 1, 1, 0, 1, 0, 0, 1]

Now, you can try optimizing your classifier for avoiding false negatives (precision) or for catching true positives (recall), but I’ve never liked using the built-in sklearn scorers for this.

The beauty of AUC is that it works with your model’s class probabilities rather than class predictions. As we’ll see in a moment, this makes it much easier to answer the question of “where do we draw the line?”

Imagine you’re predicting 10 observations and there are 4 positive classes and 6 negative classes. You put your model’s predictions into a dataframe and get something like this:

PredictedActualTrue positives foundFalse positives foundTrue positive % found False positive % found
0.9911025%0%
0.9512050%0%
0.9102150%16.67%
0.8913175%16.67%
0.7403275%33.33%
0.6503375%50%
0.41143100%50%
0.35044100%66.67%
0.28045100%83.33%
0.09046100%100%

The last 4 columns (after Predicted and Actual) represent cumulative numbers. You also may find it a little weird to call a 9% prediction a “false positive” — in fact, it sounds like the model is awfully confident that it’s a negative class, right?

Well, you’re not entirely wrong, but remember these are probabilities, and strictly speaking, the model isn’t assigning a class to any of its predictions. So you might prefer to think of this as “this would be a true positive if the model predicted 1,” and “this would be a false positive if the model predicted 1.”

Maybe you’re starting to see what I mean by “where do we draw the line?” If your model was predicting classes, it would probably assign any probability above 50% a 1, and anything below 50% a 0.

But, now that you have this table, you can decide for yourself whether it’s more important to catch true positives or to avoid false negatives.

Let’s say these predictions represent criminal trials. You absolutely do not want to imprison an innocent person, so you determine that you’ll only predict a 1 if the probability is above 0.95. That’s optimizing for precision.

But then imagine these numbers represent cancer tests. You want to bring people in for additional tests if there’s even a small chance they have cancer. So, based on this data, you might draw the line and predict a 1 if the probability is above 0.41. (In reality, you’d probably go even lower than that.) That’s optimizing for recall.

That’s why I love AUC. You get much more nuance when working with probabilities, and you can easily convert those numbers to 0s and 1s using np.where():

# The index [:, 1] grabs just the probability of a positive class,
# which is exactly what you need
preds = clf.predict_proba(x_test)[:, 1]

# np.where(if this condition, make it equal this, else make it equal that)
pred_classes = np.where(preds > 0.41, 1, 0)

# Alternatively, do a listcomp
pred_classes = [1 if i > 0.41 else 0 for i in preds]

At this point, you can switch back to using a precision or recall score, or creating a confusion matrix.

You also get the benefit of being able to rank your predictions in order of certainty. What if those predictions represent the probabilities of a person buying from you? You could call or email the leads at the top of your list first and work your way down. Then, once you get 10 no’s in a row, or something like that, you might pack up your stuff and go home, because at that point you’ve found all the most obvious customers.

That’s why ROC-AUC is so awesome. But there’s one more reason and it’s what we discussed earlier: the ROC-AUC score, like r-squared, is always measured in the same units. Your ROC-AUC is realistically going to be between 0.5 and 1.0 every time you calculate it. Which means I can give you more guidelines for how to interpret it!

AUCDescription
< 0.50Actively detrimental. Try doing the exact opposite of what it says and maybe you’ll be on to something.
0.50Useless / Blindfolded monkey throwing darts
0.55Mediocre / Cusp of usability
0.65Okay / Satisfactory
0.70Good
0.75Very good
0.80Excellent
0.90Too good — Are you leaking data?

Another way to think of it is to add 10-15 percentage points to your score and that’s roughly your model’s letter grade. For example, I’d give a “B” to a model that scores 0.70.

But what do those numbers meeeean? ლ(ಠ益ಠლ)

That’s the last thing we’ll talk about before wrapping this thing up!

First, we can take our model’s predicted probabilities and graph rate at which we find (would-be) true positives and true negatives. Just like we saw in the table I created earlier.

Here’s the graph of of one of the highest ROC-AUC scores I ever obtained: 0.864 when predicting speed-dating matches:

So when we say “area under the curve,” we really do mean the area under curve. The red dotted line that represents “luck” is an AUC of 0.5. It’s what would happen if we just shuffled our predictions in a random order (rather than in order of descending probability).

Conversely, a perfect AUC of 1.0 would be a blue line that shoots straight up and then cuts across to the right. You’d find every single true positive before encountering a false positive. And, of course, that’s crystal-ball territory that never happens in the real world of machine learning. That’s why I say to check for data leakage once you’re scoring above 0.9. I was extremely skeptical with my score of 0.864 here, but I’ve triple-checked my work and can confirm it’s legit.

Summary

Try using sklearn.metrics.r2_score the next time you’re doing a regression problem, and sklearn.metrics.roc_auc_score the next time you’re performing binary classification.

Once you build up enough experience with these, you’ll develop an intuition of what a good model looks like. You’ll mentally compare each score to the ones you’ve seen before. And that will help you decide whether you’ve hit your target or if there’s more work to be done.