For my Udacity capstone project in machine learning, I decided to challenge myself with a big, messy, and unpredictable dataset: speed dating data from professors Ray Fisman and Sheena Iyengar. This data was used in their paper, Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment.
It’s pretty fun to collect speed dating data as a researcher! You have unlimited access to lonely undergrads, so you tell them you’re throwing a speed dating party with free food, and all they have to do in return is share their data with you. You then get access to human behavior in a high-stakes, naturalistic setting — way better than a psych lab, I’d say!
We have a lot of features in this dataset — 195 columns, to be exact. You can read the data dictionary for a full breakdown, but let me give you a summary. Among other things, the researchers recorded:
- Basic demographic data about each participant, including age, race, and importance of religion
- The zip code where each person grew up, which can be used to predict their family income and socioeconomic status.
- Career goals
- How attractive, fun, sincere, ambitious, and intelligent you think you are
- How much you value those same qualities in a partner
An early challenge I encountered was the fact that this dataset contains information from 21 different speed dating events, and these events often had different rules and rating systems. You can view my notebook to see how I dealt with these problems, but I’ll tell you it required a lot of cross-validation and trial-and-error.
Devising a machine learning model: What’s cheating and what’s not?
Another problem I had to solve was figuring out what data I was allowed to use and what should be considered off-limits.
My purpose for creating this model was to simulate a dating app. Given two users who have filled out their profiles, what is the probability that they would be a match for each other?
First, let’s look at the columns I had to remove. Several were a little too similar to the
match variable I was predicting. Some good examples of this are the columns:
- “How attracted are you to this person?”
- Correlation between participant’s and partner’s ratings
So the heuristic I used was to remove any column that had a correlation of more than 0.25 with the
match column. A correlation coefficient of 0.25 is actually pretty low by most standards, but this was the threshold I needed to remove all the columns I considered “cheating.”
match_corrs = data.select_dtypes(include=[np.number])\ .corrwith(data.match)\ .sort_values(ascending=False) match_corrs = match_corrs[match_corrs > .25].index data = data.dropna(subset=['id', 'pid'], axis=0) for i in match_corrs[1:]: del data[i] # Other columns that are too predictive del data['int_corr'] del data['them_cal'] del data['you_call'] del data['field'] # redundant
Next, I thought about what sorts of data leakage would be acceptable. What information would a dating app have about a user, that might not be apparent in a speed dating event?
The answer: data about a person’s choosiness and desirability. I defined desirability as the percentage of partners who wanted to see this person again, and choosiness as the percentage of partners this person wanted to see again. (Read that carefully! They’re just inverses of one another.)
desirability = data.groupby('iid').mean()['dec_o'].to_dict() data['desirability'] = data['iid'].map(desirability) choosiness = data.groupby('iid').mean()['dec'].to_dict() data['choosiness'] = data['iid'].map(choosiness)
On one hand, I used post-hoc information to calculate this (decision of partner the night of the event), but I’ll argue that this isn’t too different from the user data an app would have. For any given user, we know what percentage of profiles they “swipe right” on, and what percentage of users “swipe right” on them.
So, as long as our sample size is sufficiently large, our machine learning model isn’t really able to peek at the answer while training.
If you’re following along inside my notebook, the next part contains some data wrangling that’s totally uninteresting. I’ll skip it in this blog post and jump to my additional feature engineering.
One of the most fundamental, repeatable findings of what we call “mate selection” is the principle of homogamy. What it means, essentially, is birds of a feather flock together. If anyone’s ever told you opposites attract, don’t believe them.
We date and befriend people who are similar to us; people who share our values, opinions, hobbies, education level, and personal background. Once you think about this for a bit, it’s almost absurd to believe otherwise.
With this in mind, we need to engineer some features that will be markers for homogamy. In effect, we need to find features that are the same for both the individual and their speed dating partner.
In the notebook, you’ll see a lot of feature engineering at this point, but I’ll just tell you about the features that proved the most useful.
Comparing socioeconomic status. I went about this in several ways, but in general I compared the zip code income data of where someone grew up, to that of their partner’s. Sample code:
def get_partner_data(pid, col): '''Looks up the person's partner and adds their data as new features. If the partner ID doesn't exist, returns a -1.''' try: partner = data[data['iid'] == pid].head(1)[col].iloc if partner: return partner else: return -1 except: return -1 # Income (where income data is available, take the log difference) data['partner_income'] = data['pid'].apply( get_partner_data, col='income') data['income_difference'] = np.where( (data.partner_income == -1) | data.income == -1), -1, np.log1p(np.abs(data.income - data.partner_income)) )
Comparing choosiness and desirability. To put this in slightly crude terms, 9’s tend to date 9’s and 6’s tend to date 6’s. 🙂 So let’s make sure we capture that information with a new feature! Sample code:
data['partner_desirability'] = data['pid'].apply( get_partner_data, col='desirability') data['des_diff'] = data['desirability'] - data['partner_desirability'] data['partner_choosiness'] = data['pid'].apply( get_partner_data, col='choosiness') data['choose_diff'] = data['choosiness'] - data['partner_choosiness']
Expectations. Some of the most important features I engineered measured how well a speed dating partner met someone’s expectations. Among these, the most important turned out to be ambition — whether your date was ambitious enough for you. I actually measured this in several different ways, but here’s some sample code:
data['partner_ambition'] = data['pid'].apply( get_partner_data, col='amb3_1') # Self data & partner data were measured with different scales, # hence the math you see here data['amb_expectations'] = ( 10*data['amb1_1']/data['amb1_1'].max()) - data['partner_ambition']
You get the idea. I made many, many new features like this and then used some dimensionality reduction to get rid of the unimportant ones. Let’s move on the fun part: the findings!
I made a few different visualizations, but this one was my favorite. The stereotype is that men care about looks more than women do, and women care about a man’s career. Does the data support this?
It sure does! But what I find interesting are the handful of men who really prioritize having an ambitious partner. Chill out, Wall Street!
The machine learning model
So, what about the model itself? I tried many different algorithms, and good ol’ XGBoost came out on top. After dozens of hours of grid searching, the best model had a
max_depth of 3 and only trained on the top 54% of features.
I used AUC to score my model. If you’re unfamiliar with this metric, the ELI5 is that it orders your model’s predictions from most certain of a match to most certain of a non-match, and then compares that to whether there actually was a match between those two daters. (I love AUC for binary classification.)
My model got a final score of 0.864 on my testing data. With AUC, 1.0 represents perfection and 0.5 is random chance, so that’s some pretty great performance! Let’s visualize it:
That’s with 5 folds of cross-validation, so you can see the model consistently performs well.
So, what features turned out to be the most important in predicting a match?
The feature importances will be different each time we run the model, but the engineered features are consistently accounting for about half of the ones deemed most important.
This means that the matchmaking data is there, it just needed to be transformed into something usable.
The bottom line
With an AUC of 0.864, I think it’s fair to say we’ve cracked the code of matchmaking.
Love is supposed to be this magical, ineffable force, but it turns out computers can probably predict love better than we can ourselves!
I’m getting ahead of myself. Sorry, I get excited when I see results like this.
The algorithm can predict dates, but whether machine learning can predict successful relationships remains an open question. As a psychologist with some background in this subject matter, I’m about 90% sure that a relationship model would have comparable performance. The factors that predict relationship satisfaction and divorce are pretty easily quantifiable. Empathy, communication style, mental health, and socioeconomic status are the big ones.
This dataset can easily be adapted to create a recommender system. I’m finishing it up as I type this, and I’ll be sure to share those results as well. Long story short, the recommender system uses collaborative filtering and obtains a similar AUC score. The benefit of this alternative approach is that it’s faster and may not require as much feature engineering.
My success with this project gave me the confidence to launch a social network called Metta. Can a personality test help you find new friends or dates? It turns out it totally can. Join now and get your personality analyzed along 30+ dimensions for free.
Thanks for reading!