How I create fairer tests for my students

Have you ever had a professor say something along the lines of, “If I had the time, I’d just sit down and talk to each of you 1-on-1 to measure what you learned in this class. But I have to give you all a final exam instead”?

I heard a few of my professors express that sentiment. On one hand, it’s noble — test anxiety is a very real thing. But as I’ve gained more experience over the years as a teacher, I’ve come to view that perspective as a bit misguided.

I dislike it for the same reason I dislike most job interviews. Humans, despite their best intentions, will always have unconscious biases, and it’s very difficult to standardize and structure an interview such that each candidate has the exact same experience.

This is all a long-winded way of saying: I really like tests. They’re not a perfect form of assessment either, but with a little prep, they can be a valid and reliable way to measure knowledge, skills, and abilities.

In this tutorial, I’m going to share the methods I use to measure my students’ knowledge. Even if you’re not an educator, I think you might be able to find something useful to apply to your own line of work.

The assumption

There’s one major assumption I need to acknowledge before we jump into my methodology:

I assume that my tests are all measuring one single underlying construct: knowledge of course material.

The opposite of this would be a history professor who believes that her midterm measures two distinct factors (for instance):

  • Knowledge of the American Revolution
  • Knowledge of the Civil War

The professor in this example is assuming that knowledge of one period of history does not correlate with knowledge of another period.

In my experience, this just isn’t true. Whenever I perform a factor analysis on student responses, there’s clearly only one factor being measured.

Furthermore, we know that IQ is a single factor, and I recall reading that SAT math and verbal scores are moderately-to-highly correlated. (I can’t find the exact r anywhere, except for a few unsourced claims. Let me know if you find anything.)

So, we’ll continue on with the assumption that all test items are closely related.


Now, a bit about my philosophy and the logistics of testing.

I write multiple-choice tests in Google Forms and download the results as a CSV. Therefore, all my tests are untimed and open-note. I get such a variance of scores in spite of this that I’m no longer concerned that the tests might be too easy. I still have to curve the scores at the end; the curve is just sometimes less than it would be if it were an in-class, closed-note test.

A big benefit of this format is that students don’t feel rushed, and it seems to significantly reduce testing anxiety. When my schedule permits, I lower the stakes even more by making the tests more frequent and worth fewer points. Formative assessment is a great tool.


Okay, let’s talk about the statistics and Python! First let’s load the csv and remove the columns we don’t need:

data = pd.read_csv('quiz_responses.csv')

to_drop = ['Unnamed', 'Timestamp', 'Email Address']:
for d in to_drop:
    data = data[[i for i in data.columns if d not in i]]

The list comprehension is technically slower, but I prefer it because this code works even when the to_drop columns don’t exist.

Each column represents a test question, so let’s shorten the column names and then set the index equal to the student’s name.

data.columns = [i[:60].strip() + '...' if len(i) > 60 
                else i for i in data.columns]

# Find the name column and set it as the index
name_col = [i for i in data.columns if 'last name' in i.lower()][0]
data.index = data[name_col]
del data[name_col]

Things get a little weird here. The next thing I do is transpose the dataframe with .T and identify the column that’s my answer key. After that, we can grade by comparing each student’s column to the answer key column.

# Transpose and get answer key
data = data.T
key = [i for i in data.columns if 'answer' in i.lower()][0]

# Create new df called `results` and grade each test
results = pd.DataFrame(index=data.index)
for i in data.drop(key, axis=1).columns:
    results[i] = np.where(data[i] == data[key], 1, 0)
results = results.T
totals = results.sum(axis=1)

This code will turn each student’s test into 1’s and 0’s. They get a 1 when they get an answer right, and a 0 when they get it wrong.

Then we transpose the results dataframe again so that the questions are columns again, and then we calculate each student’s total number correct with totals = results.sum(axis=1).

At this point we’ve essentially graded all the tests. We can now look at the raw scores before we start improving the test itself.

curve = 0
total_pts = 0
for i in results.columns:
    total_pts += results[i].max() # [1]

# Convert to percentages    
scores_array = totals/total_pts

# Calculate stats
print('Mean:  ', scores_array.mean())
print('Median:', scores_array.median())
print('SD:    ', round(scores_array.std(ddof=0), 3))

# Print student scores
(totals.sort_index()/(total_pts)) + curve

[1]: By calculating total points this way, I can add additional code to weight some questions to be worth more than others. I won’t get too deep in that process here, but all you have to do is multiply a question column by a scalar.

The code you’ve seen so far makes it really easy to just drop the .csv into my grading directory and run the Jupyter notebook. I could take it a step further and refactor the code into a class, but that really hasn’t been necessary thus far.

Once we run the code, we get a nice printout like this:

Mean:   0.671304347826087
Median: 0.7
SD:     0.179

# (Names are fake)
Adam Gates               0.92
Alexa Hurst              0.44
Amber Moon               0.52
Amy Wilkinson            0.48
Andrew Evans             0.68
Angela Bowen             0.82
Becky Williams           0.80
Bryan Young              0.78
Christina Fitzpatrick    0.18
Christopher Hunter       0.80
Corey Gibson             0.72
Eric Scott               0.72
Madeline Todd            0.80
Marc Martin              0.70
Mary Hernandez           0.70
Michael Kirk             0.62
Roberto Schwartz         0.80
Nathan Lewis             0.32
Robert Harrell           0.66
Ryan Hill                0.88
Scott Mcintosh           0.64
Terry White              0.86
Thomas Martinez          0.60
dtype: float64

…and that’s my grading process. Now let’s cover how I optimize my test for fairness.

Test item selection

I determine if a question is “fair” by examining the correlation between getting the question correct and a student’s overall score on the exam. This works because of our assumption that the test is only measuring one construct/factor.

There are more advanced statistical methods I could use here. Factor analysis is very appropriate here (as I mentioned earlier), and Cronbach alpha would be a good measure of reliability. With that said, correlations alone still give me great results and is easiest to explain to students.

“Fair” isn’t necessarily the best word to use. The real goal is to see if the question distinguishes between students who know the material, and students who don’t. But, if you string enough of those questions together, and critically evaluate your test for cultural bias, then you can make a strong argument that your test is, in fact, “fair.”

So, how can these correlations inform our decisions?

  • A question that everyone gets right isn’t a good discriminator of knowledge. The correlation between that question and a student’s overall score would be zero.
  • The same is true for a question everyone gets wrong.
  • A question that high performers get right and others get wrong is likely a good question.
  • A question that low performers get right and everyone else gets wrong probably a bad question. (Actually, it’s more commonly a sign I made a mistake on my answer key!)

Here’s the code for calculating correlations:

from scipy.stats import pearsonr

correlations = []

for i in results.columns:
    r = pearsonr(results[i], totals)
    corr = r[0]
    pval = r[1]
    corr = round(corr, 3)
    correlations.append((i, pval, corr))
correlations = pd.DataFrame(correlations, columns=['question', 'pvalue', 'r'])
correlations['absol'] = abs(correlations['r'])

Next, we can single out the questions that have low correlations with overall grade:

# Filters (subjective, but this is a good starting point)
low_correlation = correlations['r'] < .30
significant = correlations['pvalue'] < 0.1

# Print bad questions (Sorry, WordPress doesn't like my ampersand)
bad_questions = correlations[low_correlation & significant]['question'].tolist()

if bad_questions:
    print('Bad questions:')
    for i in bad_questions:

correlations = correlations.dropna()
correlations.set_index('question', inplace=True)

And then we can use seaborn to make a graphical representation of it:

plt.figure(figsize=(3, len(correlations)//4))

sns.heatmap(correlations[['r']].sort_values('r', ascending=False), annot=True, cbar=False);

Success! The bright numbers with high scores appear to be good questions, while the darker numbers are questions I should consider tossing out. (Sometimes, however, this is a clue that I left students with misconceptions — which means that I should reconsider how I teach something.)

Restructuring the test

Now I know what questions were good measures of knowledge of which questions were bad ones.

Lately, I’ve taken to warning students that several questions will definitely be tossed out come grading time. This tempers their expectations and lets them know that their grade won’t simply be what percentage of questions they got right.

It’s like the SAT, I explain to them: new questions need to be tested to see if they deserve a permanent place in my test bank. In a way, I trade transparency for fairness, but I haven’t gotten any complaints yet!

Now, we can put together all the code we’ve seen to automate the question selection process and grade the tests:

# Grade tests
results = pd.DataFrame(index=data.index)
for i in data.drop(key, axis=1).columns:
    results[i] = np.where(data[i] == data[key], 1, 0)
results = results.T
totals = results.sum(axis=1)

results_copy = results.copy()

# Remove questions where everyone answered the same
# (Sorry, this filter is obnoxious but it works!)
everyone_answered_same = results_copy.corrwith(

for i in everyone_answered_same.index:
    del results_copy[i]

# Sequentially remove questions that don't correlate with overall score
worst = 0
threshold = .30
while worst < threshold:
    worst = results_copy.corrwith(totals).sort_values()[0]
    question = results_copy.corrwith(totals).sort_values().index[0]
    del results_copy[question]
    totals = results_copy.sum(axis=1)

Now we grade using the same code we saw earlier:

curve = 0

total_pts = 0
for i in results_copy.columns:
    total_pts += results_copy[i].max()

grades = round(totals2/total_pts, 2).sort_values()

print('Mean:  ', grades.mean())
print('Median:', grades.median())
print('SD:    ', grades.std(ddof=0))

grades.sort_values() + curve

At this point, I decide what my grading curve will be. A dirty little secret of academia is that professors can use just about any curve they want.

My general rule is that I want the median score to be 78% for lower division classes, 74% for upper division classes, and 70% for my stats class (which is supposed to be hard, IMO).

I rarely see medians above those targets, but if I did, I wouldn’t curve downward.

Now that the process is complete, we have test grades that fairly and accurately measure student knowledge!

I could take it a step further and adjust the standard deviation as well, but then my grading system would lose even more transparency, and I’m not willing to make that concession. College students understand:

“I threw out the 5 questions that were the worst measures of your knowledge.”

…but they’ll struggle with:

“And then I curved the test so the average grade was 78%, and then compressed the standard deviation to be 8 points.”

Well… my stats students should understand that, but it’s still an explanation that’s a bit too long.

I might fall just short of a standardized test that’s beyond reproach, but that’s asking too much of a college classroom test. Too many other factors vary from class to class (namely, my lectures), so I don’t think it’s worth spending additional effort on a perfectly uniform grading system. But this still goes a long way in showing me what my students know.

My two favorite machine learning scoring metrics (and why)

I’ll preface this by saying that what works for me won’t necessarily work for everyone else. You might be solving machine learning problems very different from the ones I usually work with. I’m typically working with regression and binary classification problems, so my preferences reflect that. So if you already have a workflow that suits you, I’d say to stick with that.

With that disclaimer out of the way, I almost always use sklearn.metrics.r2_score for regression and sklearn.metrics.roc_auc_score for binary classification.

What these two metrics have in common is that that they’re unit-agnostic. It doesn’t matter if you’re predicting the price of a house or the grade of a student; a dating match or an abusive tweet — you’ll always have an intuition of whether your model is doing a good job or not.

R-squared score

R-squared is used for regression problems: given X, predict the value of Y.

First let’s look at r before we square it. Pearson’s r is your correlation coefficient — in this case, the correlation between actual and predicted values.

Pearson’s r accounts for differences in scale. X1 might be measured from 1-10, while X2 might be measured from 6000-10000. Or X1 might be measured from 200-210 while X2 is from 0.01-1.0. It doesn’t matter what scalar you might multiply your measurements by; the correlation is comparing z-scores so it’s always an apples-to-apples comparison.

Now, let’s square r. This is commonly called the coefficient of determination in statistics. Mathematically, it tells you what percentage of variance in the Y variable is predicted from X — or, in other words, what percentage of the variance in the actual values is accounted for in your model’s predicted values.

Since r accounts for differences in scale, r2 does as well. This means that an r2 close to zero is (nearly) always bad and an r2 close to one is (nearly) always good. You don’t need to know what units your Y variable is measured in; it’s irrelevant to r2_score.

The only thing that might change from project to project are the expectations and demands you place on this score. For one project, you might require nothing less than an r2 > 0.70, while for another project, r2 > 0.20 might be considered phenomenal.

But the number always means the same thing. That’s the beauty of it.

Before we move on to ROC-AUC, let me share some rules-of-thumb for how to how to interpret your r2 :

< 0Actively detrimental. Worse than random chance. Delete Python and burn your laptop.
0.2Mediocre / Cusp of usability
0.3Okay / Satisfactory
0.7Very good
0.95Too good — Are you leaking data?

Remember! These can vary from project to project. But these guidelines have served me well over the years.

(Sidenote: How can r2 be negative?)


ROC-AUC is used for binary classification problems. It stands for Receiver Operating Characteristic – Area Under the Curve.

The number reflects the relationship between your true positive rate and your false-positive rate — that is, the positive classes you correctly identified vs. the negative classes you misidentified as positives.

Here’s why I love ROC-AUC so much (which I’ll just call AUC from now on):

Classification problems have the consistent dilemma of where to draw the line. Let’s use a smoke detector as an example. Should a smoke detector go off when:

  • There’s a 10% chance of a fire?
  • There’s a 1% chance of a fire?
  • There’s a 0.1% chance of a fire?
  • There’s a 0.01% chance of a fire?

It has to draw the line somewhere, and this same thing happens when you train a classifier to predict classes. Let’s look at some pseudocode:

clf = Classifier().fit(x_train, y_train)
pred = clf.predict(x_test)

>>> [0, 1, 1, 1, 0, 1, 0, 0, 1]

Now, you can try optimizing your classifier for avoiding false negatives (precision) or for catching true positives (recall), but I’ve never liked using the built-in sklearn scorers for this.

The beauty of AUC is that it works with your model’s class probabilities rather than class predictions. As we’ll see in a moment, this makes it much easier to answer the question of “where do we draw the line?”

Imagine you’re predicting 10 observations and there are 4 positive classes and 6 negative classes. You put your model’s predictions into a dataframe and get something like this:

PredictedActualTrue positives foundFalse positives foundTrue positive % found False positive % found

The last 4 columns (after Predicted and Actual) represent cumulative numbers. You also may find it a little weird to call a 9% prediction a “false positive” — in fact, it sounds like the model is awfully confident that it’s a negative class, right?

Well, you’re not entirely wrong, but remember these are probabilities, and strictly speaking, the model isn’t assigning a class to any of its predictions. So you might prefer to think of this as “this would be a true positive if the model predicted 1,” and “this would be a false positive if the model predicted 1.”

Maybe you’re starting to see what I mean by “where do we draw the line?” If your model was predicting classes, it would probably assign any probability above 50% a 1, and anything below 50% a 0.

But, now that you have this table, you can decide for yourself whether it’s more important to catch true positives or to avoid false negatives.

Let’s say these predictions represent criminal trials. You absolutely do not want to imprison an innocent person, so you determine that you’ll only predict a 1 if the probability is above 0.95. That’s optimizing for precision.

But then imagine these numbers represent cancer tests. You want to bring people in for additional tests if there’s even a small chance they have cancer. So, based on this data, you might draw the line and predict a 1 if the probability is above 0.41. (In reality, you’d probably go even lower than that.) That’s optimizing for recall.

That’s why I love AUC. You get much more nuance when working with probabilities, and you can easily convert those numbers to 0s and 1s using np.where():

# The index [:, 1] grabs just the probability of a positive class,
# which is exactly what you need
preds = clf.predict_proba(x_test)[:, 1]

# np.where(if this condition, make it equal this, else make it equal that)
pred_classes = np.where(preds > 0.41, 1, 0)

# Alternatively, do a listcomp
pred_classes = [1 if i > 0.41 else 0 for i in preds]

At this point, you can switch back to using a precision or recall score, or creating a confusion matrix.

You also get the benefit of being able to rank your predictions in order of certainty. What if those predictions represent the probabilities of a person buying from you? You could call or email the leads at the top of your list first and work your way down. Then, once you get 10 no’s in a row, or something like that, you might pack up your stuff and go home, because at that point you’ve found all the most obvious customers.

That’s why ROC-AUC is so awesome. But there’s one more reason and it’s what we discussed earlier: the ROC-AUC score, like r-squared, is always measured in the same units. Your ROC-AUC is realistically going to be between 0.5 and 1.0 every time you calculate it. Which means I can give you more guidelines for how to interpret it!

< 0.50Actively detrimental. Try doing the exact opposite of what it says and maybe you’ll be on to something.
0.50Useless / Blindfolded monkey throwing darts
0.55Mediocre / Cusp of usability
0.65Okay / Satisfactory
0.75Very good
0.90Too good — Are you leaking data?

Another way to think of it is to add 10-15 percentage points to your score and that’s roughly your model’s letter grade. For example, I’d give a “B” to a model that scores 0.70.

But what do those numbers meeeean? ლ(ಠ益ಠლ)

That’s the last thing we’ll talk about before wrapping this thing up!

First, we can take our model’s predicted probabilities and graph rate at which we find (would-be) true positives and true negatives. Just like we saw in the table I created earlier.

Here’s the graph of of one of the highest ROC-AUC scores I ever obtained: 0.864 when predicting speed-dating matches:

So when we say “area under the curve,” we really do mean the area under curve. The red dotted line that represents “luck” is an AUC of 0.5. It’s what would happen if we just shuffled our predictions in a random order (rather than in order of descending probability).

Conversely, a perfect AUC of 1.0 would be a blue line that shoots straight up and then cuts across to the right. You’d find every single true positive before encountering a false positive. And, of course, that’s crystal-ball territory that never happens in the real world of machine learning. That’s why I say to check for data leakage once you’re scoring above 0.9. I was extremely skeptical with my score of 0.864 here, but I’ve triple-checked my work and can confirm it’s legit.


Try using sklearn.metrics.r2_score the next time you’re doing a regression problem, and sklearn.metrics.roc_auc_score the next time you’re performing binary classification.

Once you build up enough experience with these, you’ll develop an intuition of what a good model looks like. You’ll mentally compare each score to the ones you’ve seen before. And that will help you decide whether you’ve hit your target or if there’s more work to be done.

I made a machine learning model to predict speed dating matches. Here’s what I learned.

For my Udacity capstone project in machine learning, I decided to challenge myself with a big, messy, and unpredictable dataset: speed dating data from professors Ray Fisman and Sheena Iyengar. This data was used in their paper, Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment.

It’s pretty fun to collect speed dating data as a researcher! You have unlimited access to lonely undergrads, so you tell them you’re throwing a speed dating party with free food, and all they have to do in return is share their data with you. You then get access to human behavior in a high-stakes, naturalistic setting — way better than a psych lab, I’d say!

Follow along:

The dataset

We have a lot of features in this dataset — 195 columns, to be exact. You can read the data dictionary for a full breakdown, but let me give you a summary. Among other things, the researchers recorded:

  • Basic demographic data about each participant, including age, race, and importance of religion
  • The zip code where each person grew up, which can be used to predict their family income and socioeconomic status.
  • Career goals
  • Hobbies
  • How attractive, fun, sincere, ambitious, and intelligent you think you are
  • How much you value those same qualities in a partner

An early challenge I encountered was the fact that this dataset contains information from 21 different speed dating events, and these events often had different rules and rating systems. You can view my notebook to see how I dealt with these problems, but I’ll tell you it required a lot of cross-validation and trial-and-error.

Devising a machine learning model: What’s cheating and what’s not?

Another problem I had to solve was figuring out what data I was allowed to use and what should be considered off-limits.

My purpose for creating this model was to simulate a dating app. Given two users who have filled out their profiles, what is the probability that they would be a match for each other?

First, let’s look at the columns I had to remove. Several were a little too similar to the match variable I was predicting. Some good examples of this are the columns:

  • “How attracted are you to this person?”
  • Correlation between participant’s and partner’s ratings

So the heuristic I used was to remove any column that had a correlation of more than 0.25 with the match column. A correlation coefficient of 0.25 is actually pretty low by most standards, but this was the threshold I needed to remove all the columns I considered “cheating.”

match_corrs = data.select_dtypes(include=[np.number])\

match_corrs = match_corrs[match_corrs > .25].index

data = data.dropna(subset=['id', 'pid'], axis=0)

for i in match_corrs[1:]:
    del data[i]

# Other columns that are too predictive
del data['int_corr']
del data['them_cal']
del data['you_call']

del data['field'] # redundant

Next, I thought about what sorts of data leakage would be acceptable. What information would a dating app have about a user, that might not be apparent in a speed dating event?

The answer: data about a person’s choosiness and desirability. I defined desirability as the percentage of partners who wanted to see this person again, and choosiness as the percentage of partners this person wanted to see again. (Read that carefully! They’re just inverses of one another.)

desirability = data.groupby('iid').mean()['dec_o'].to_dict()
data['desirability'] = data['iid'].map(desirability)

choosiness = data.groupby('iid').mean()['dec'].to_dict()
data['choosiness'] = data['iid'].map(choosiness)

On one hand, I used post-hoc information to calculate this (decision of partner the night of the event), but I’ll argue that this isn’t too different from the user data an app would have. For any given user, we know what percentage of profiles they “swipe right” on, and what percentage of users “swipe right” on them.

So, as long as our sample size is sufficiently large, our machine learning model isn’t really able to peek at the answer while training.

If you’re following along inside my notebook, the next part contains some data wrangling that’s totally uninteresting. I’ll skip it in this blog post and jump to my additional feature engineering.

Feature engineering

One of the most fundamental, repeatable findings of what we call “mate selection” is the principle of homogamy. What it means, essentially, is birds of a feather flock together. If anyone’s ever told you opposites attract, don’t believe them.

We date and befriend people who are similar to us; people who share our values, opinions, hobbies, education level, and personal background. Once you think about this for a bit, it’s almost absurd to believe otherwise.

With this in mind, we need to engineer some features that will be markers for homogamy. In effect, we need to find features that are the same for both the individual and their speed dating partner.

In the notebook, you’ll see a lot of feature engineering at this point, but I’ll just tell you about the features that proved the most useful.

Comparing socioeconomic status. I went about this in several ways, but in general I compared the zip code income data of where someone grew up, to that of their partner’s. Sample code:

def get_partner_data(pid, col):
    '''Looks up the person's partner and adds their data
    as new features. If the partner ID doesn't exist,
    returns a -1.'''
        partner = data[data['iid'] == pid].head(1)[col].iloc[0]
        if partner:
            return partner
            return -1
        return -1

# Income (where income data is available, take the log difference)
data['partner_income'] = data['pid'].apply(
	get_partner_data, col='income')
data['income_difference'] = np.where(
	(data.partner_income == -1) | data.income == -1),
	np.log1p(np.abs(data.income - data.partner_income))

Comparing choosiness and desirability. To put this in slightly crude terms, 9’s tend to date 9’s and 6’s tend to date 6’s. 🙂 So let’s make sure we capture that information with a new feature! Sample code:

data['partner_desirability'] = data['pid'].apply(
	get_partner_data, col='desirability')

data['des_diff'] = data['desirability'] - data['partner_desirability']

data['partner_choosiness'] = data['pid'].apply(
	get_partner_data, col='choosiness')

data['choose_diff'] = data['choosiness'] - data['partner_choosiness']

Expectations. Some of the most important features I engineered measured how well a speed dating partner met someone’s expectations. Among these, the most important turned out to be ambition — whether your date was ambitious enough for you. I actually measured this in several different ways, but here’s some sample code:

data['partner_ambition'] = data['pid'].apply(
	get_partner_data, col='amb3_1')

# Self data & partner data were measured with different scales,
# hence the math you see here
data['amb_expectations'] = (
	10*data['amb1_1']/data['amb1_1'].max()) - data['partner_ambition']

You get the idea. I made many, many new features like this and then used some dimensionality reduction to get rid of the unimportant ones. Let’s move on the fun part: the findings!

The results

I made a few different visualizations, but this one was my favorite. The stereotype is that men care about looks more than women do, and women care about a man’s career. Does the data support this?

It sure does! But what I find interesting are the handful of men who really prioritize having an ambitious partner. Chill out, Wall Street!

The machine learning model

So, what about the model itself? I tried many different algorithms, and good ol’ XGBoost came out on top. After dozens of hours of grid searching, the best model had a max_depth of 3 and only trained on the top 54% of features.

I used AUC to score my model. If you’re unfamiliar with this metric, the ELI5 is that it orders your model’s predictions from most certain of a match to most certain of a non-match, and then compares that to whether there actually was a match between those two daters. (I love AUC for binary classification.)

My model got a final score of 0.864 on my testing data. With AUC, 1.0 represents perfection and 0.5 is random chance, so that’s some pretty great performance! Let’s visualize it:

That’s with 5 folds of cross-validation, so you can see the model consistently performs well.

So, what features turned out to be the most important in predicting a match?

The feature importances will be different each time we run the model, but the engineered features are consistently accounting for about half of the ones deemed most important.

This means that the matchmaking data is there, it just needed to be transformed into something usable.

The bottom line

With an AUC of 0.864, I think it’s fair to say we’ve cracked the code of matchmaking.

Love is supposed to be this magical, ineffable force, but it turns out computers can probably predict love better than we can ourselves!

I’m getting ahead of myself. Sorry, I get excited when I see results like this.

The algorithm can predict dates, but whether machine learning can predict successful relationships remains an open question. As a psychologist with some background in this subject matter, I’m about 90% sure that a relationship model would have comparable performance. The factors that predict relationship satisfaction and divorce are pretty easily quantifiable. Empathy, communication style, mental health, and socioeconomic status are the big ones.

Future direction

This dataset can easily be adapted to create a recommender system. I’m finishing it up as I type this, and I’ll be sure to share those results as well. Long story short, the recommender system uses collaborative filtering and obtains a similar AUC score. The benefit of this alternative approach is that it’s faster and may not require as much feature engineering.

My success with this project gave me the confidence to launch a social network called Metta. Can a personality test help you find new friends or dates? It turns out it totally can. Join now and get your personality analyzed along 30+ dimensions for free.

Thanks for reading!