How I create fairer tests for my students

Have you ever had a professor say something along the lines of, “If I had the time, I’d just sit down and talk to each of you 1-on-1 to measure what you learned in this class. But I have to give you all a final exam instead”?

I heard a few of my professors express that sentiment. On one hand, it’s noble — test anxiety is a very real thing. But as I’ve gained more experience over the years as a teacher, I’ve come to view that perspective as a bit misguided.

I dislike it for the same reason I dislike most job interviews. Humans, despite their best intentions, will always have unconscious biases, and it’s very difficult to standardize and structure an interview such that each candidate has the exact same experience.

This is all a long-winded way of saying: I really like tests. They’re not a perfect form of assessment either, but with a little prep, they can be a valid and reliable way to measure knowledge, skills, and abilities.

In this tutorial, I’m going to share the methods I use to measure my students’ knowledge. Even if you’re not an educator, I think you might be able to find something useful to apply to your own line of work.

The assumption

There’s one major assumption I need to acknowledge before we jump into my methodology:

I assume that my tests are all measuring one single underlying construct: knowledge of course material.

The opposite of this would be a history professor who believes that her midterm measures two distinct factors (for instance):

  • Knowledge of the American Revolution
  • Knowledge of the Civil War

The professor in this example is assuming that knowledge of one period of history does not correlate with knowledge of another period.

In my experience, this just isn’t true. Whenever I perform a factor analysis on student responses, there’s clearly only one factor being measured.

Furthermore, we know that IQ is a single factor, and I recall reading that SAT math and verbal scores are moderately-to-highly correlated. (I can’t find the exact r anywhere, except for a few unsourced claims. Let me know if you find anything.)

So, we’ll continue on with the assumption that all test items are closely related.


Now, a bit about my philosophy and the logistics of testing.

I write multiple-choice tests in Google Forms and download the results as a CSV. Therefore, all my tests are untimed and open-note. I get such a variance of scores in spite of this that I’m no longer concerned that the tests might be too easy. I still have to curve the scores at the end; the curve is just sometimes less than it would be if it were an in-class, closed-note test.

A big benefit of this format is that students don’t feel rushed, and it seems to significantly reduce testing anxiety. When my schedule permits, I lower the stakes even more by making the tests more frequent and worth fewer points. Formative assessment is a great tool.


Okay, let’s talk about the statistics and Python! First let’s load the csv and remove the columns we don’t need:

data = pd.read_csv('quiz_responses.csv')

to_drop = ['Unnamed', 'Timestamp', 'Email Address']:
for d in to_drop:
    data = data[[i for i in data.columns if d not in i]]

The list comprehension is technically slower, but I prefer it because this code works even when the to_drop columns don’t exist.

Each column represents a test question, so let’s shorten the column names and then set the index equal to the student’s name.

data.columns = [i[:60].strip() + '...' if len(i) > 60 
                else i for i in data.columns]

# Find the name column and set it as the index
name_col = [i for i in data.columns if 'last name' in i.lower()][0]
data.index = data[name_col]
del data[name_col]

Things get a little weird here. The next thing I do is transpose the dataframe with .T and identify the column that’s my answer key. After that, we can grade by comparing each student’s column to the answer key column.

# Transpose and get answer key
data = data.T
key = [i for i in data.columns if 'answer' in i.lower()][0]

# Create new df called `results` and grade each test
results = pd.DataFrame(index=data.index)
for i in data.drop(key, axis=1).columns:
    results[i] = np.where(data[i] == data[key], 1, 0)
results = results.T
totals = results.sum(axis=1)

This code will turn each student’s test into 1’s and 0’s. They get a 1 when they get an answer right, and a 0 when they get it wrong.

Then we transpose the results dataframe again so that the questions are columns again, and then we calculate each student’s total number correct with totals = results.sum(axis=1).

At this point we’ve essentially graded all the tests. We can now look at the raw scores before we start improving the test itself.

curve = 0
total_pts = 0
for i in results.columns:
    total_pts += results[i].max() # [1]

# Convert to percentages    
scores_array = totals/total_pts

# Calculate stats
print('Mean:  ', scores_array.mean())
print('Median:', scores_array.median())
print('SD:    ', round(scores_array.std(ddof=0), 3))

# Print student scores
(totals.sort_index()/(total_pts)) + curve

[1]: By calculating total points this way, I can add additional code to weight some questions to be worth more than others. I won’t get too deep in that process here, but all you have to do is multiply a question column by a scalar.

The code you’ve seen so far makes it really easy to just drop the .csv into my grading directory and run the Jupyter notebook. I could take it a step further and refactor the code into a class, but that really hasn’t been necessary thus far.

Once we run the code, we get a nice printout like this:

Mean:   0.671304347826087
Median: 0.7
SD:     0.179

# (Names are fake)
Adam Gates               0.92
Alexa Hurst              0.44
Amber Moon               0.52
Amy Wilkinson            0.48
Andrew Evans             0.68
Angela Bowen             0.82
Becky Williams           0.80
Bryan Young              0.78
Christina Fitzpatrick    0.18
Christopher Hunter       0.80
Corey Gibson             0.72
Eric Scott               0.72
Madeline Todd            0.80
Marc Martin              0.70
Mary Hernandez           0.70
Michael Kirk             0.62
Roberto Schwartz         0.80
Nathan Lewis             0.32
Robert Harrell           0.66
Ryan Hill                0.88
Scott Mcintosh           0.64
Terry White              0.86
Thomas Martinez          0.60
dtype: float64

…and that’s my grading process. Now let’s cover how I optimize my test for fairness.

Test item selection

I determine if a question is “fair” by examining the correlation between getting the question correct and a student’s overall score on the exam. This works because of our assumption that the test is only measuring one construct/factor.

There are more advanced statistical methods I could use here. Factor analysis is very appropriate here (as I mentioned earlier), and Cronbach alpha would be a good measure of reliability. With that said, correlations alone still give me great results and is easiest to explain to students.

“Fair” isn’t necessarily the best word to use. The real goal is to see if the question distinguishes between students who know the material, and students who don’t. But, if you string enough of those questions together, and critically evaluate your test for cultural bias, then you can make a strong argument that your test is, in fact, “fair.”

So, how can these correlations inform our decisions?

  • A question that everyone gets right isn’t a good discriminator of knowledge. The correlation between that question and a student’s overall score would be zero.
  • The same is true for a question everyone gets wrong.
  • A question that high performers get right and others get wrong is likely a good question.
  • A question that low performers get right and everyone else gets wrong probably a bad question. (Actually, it’s more commonly a sign I made a mistake on my answer key!)

Here’s the code for calculating correlations:

from scipy.stats import pearsonr

correlations = []

for i in results.columns:
    r = pearsonr(results[i], totals)
    corr = r[0]
    pval = r[1]
    corr = round(corr, 3)
    correlations.append((i, pval, corr))
correlations = pd.DataFrame(correlations, columns=['question', 'pvalue', 'r'])
correlations['absol'] = abs(correlations['r'])

Next, we can single out the questions that have low correlations with overall grade:

# Filters (subjective, but this is a good starting point)
low_correlation = correlations['r'] < .30
significant = correlations['pvalue'] < 0.1

# Print bad questions (Sorry, WordPress doesn't like my ampersand)
bad_questions = correlations[low_correlation & significant]['question'].tolist()

if bad_questions:
    print('Bad questions:')
    for i in bad_questions:

correlations = correlations.dropna()
correlations.set_index('question', inplace=True)

And then we can use seaborn to make a graphical representation of it:

plt.figure(figsize=(3, len(correlations)//4))

sns.heatmap(correlations[['r']].sort_values('r', ascending=False), annot=True, cbar=False);

Success! The bright numbers with high scores appear to be good questions, while the darker numbers are questions I should consider tossing out. (Sometimes, however, this is a clue that I left students with misconceptions — which means that I should reconsider how I teach something.)

Restructuring the test

Now I know what questions were good measures of knowledge of which questions were bad ones.

Lately, I’ve taken to warning students that several questions will definitely be tossed out come grading time. This tempers their expectations and lets them know that their grade won’t simply be what percentage of questions they got right.

It’s like the SAT, I explain to them: new questions need to be tested to see if they deserve a permanent place in my test bank. In a way, I trade transparency for fairness, but I haven’t gotten any complaints yet!

Now, we can put together all the code we’ve seen to automate the question selection process and grade the tests:

# Grade tests
results = pd.DataFrame(index=data.index)
for i in data.drop(key, axis=1).columns:
    results[i] = np.where(data[i] == data[key], 1, 0)
results = results.T
totals = results.sum(axis=1)

results_copy = results.copy()

# Remove questions where everyone answered the same
# (Sorry, this filter is obnoxious but it works!)
everyone_answered_same = results_copy.corrwith(

for i in everyone_answered_same.index:
    del results_copy[i]

# Sequentially remove questions that don't correlate with overall score
worst = 0
threshold = .30
while worst < threshold:
    worst = results_copy.corrwith(totals).sort_values()[0]
    question = results_copy.corrwith(totals).sort_values().index[0]
    del results_copy[question]
    totals = results_copy.sum(axis=1)

Now we grade using the same code we saw earlier:

curve = 0

total_pts = 0
for i in results_copy.columns:
    total_pts += results_copy[i].max()

grades = round(totals2/total_pts, 2).sort_values()

print('Mean:  ', grades.mean())
print('Median:', grades.median())
print('SD:    ', grades.std(ddof=0))

grades.sort_values() + curve

At this point, I decide what my grading curve will be. A dirty little secret of academia is that professors can use just about any curve they want.

My general rule is that I want the median score to be 78% for lower division classes, 74% for upper division classes, and 70% for my stats class (which is supposed to be hard, IMO).

I rarely see medians above those targets, but if I did, I wouldn’t curve downward.

Now that the process is complete, we have test grades that fairly and accurately measure student knowledge!

I could take it a step further and adjust the standard deviation as well, but then my grading system would lose even more transparency, and I’m not willing to make that concession. College students understand:

“I threw out the 5 questions that were the worst measures of your knowledge.”

…but they’ll struggle with:

“And then I curved the test so the average grade was 78%, and then compressed the standard deviation to be 8 points.”

Well… my stats students should understand that, but it’s still an explanation that’s a bit too long.

I might fall just short of a standardized test that’s beyond reproach, but that’s asking too much of a college classroom test. Too many other factors vary from class to class (namely, my lectures), so I don’t think it’s worth spending additional effort on a perfectly uniform grading system. But this still goes a long way in showing me what my students know.

Leave a Reply

Your email address will not be published. Required fields are marked *