My two favorite machine learning scoring metrics (and why)

I’ll preface this by saying that what works for me won’t necessarily work for everyone else. You might be solving machine learning problems very different from the ones I usually work with. I’m typically working with regression and binary classification problems, so my preferences reflect that. So if you already have a workflow that suits you, I’d say to stick with that.

With that disclaimer out of the way, I almost always use sklearn.metrics.r2_score for regression and sklearn.metrics.roc_auc_score for binary classification.

What these two metrics have in common is that that they’re unit-agnostic. It doesn’t matter if you’re predicting the price of a house or the grade of a student; a dating match or an abusive tweet — you’ll always have an intuition of whether your model is doing a good job or not.

R-squared score

R-squared is used for regression problems: given X, predict the value of Y.

First let’s look at r before we square it. Pearson’s r is your correlation coefficient — in this case, the correlation between actual and predicted values.

Pearson’s r accounts for differences in scale. X1 might be measured from 1-10, while X2 might be measured from 6000-10000. Or X1 might be measured from 200-210 while X2 is from 0.01-1.0. It doesn’t matter what scalar you might multiply your measurements by; the correlation is comparing z-scores so it’s always an apples-to-apples comparison.

Now, let’s square r. This is commonly called the coefficient of determination in statistics. Mathematically, it tells you what percentage of variance in the Y variable is predicted from X — or, in other words, what percentage of the variance in the actual values is accounted for in your model’s predicted values.

Since r accounts for differences in scale, r2 does as well. This means that an r2 close to zero is (nearly) always bad and an r2 close to one is (nearly) always good. You don’t need to know what units your Y variable is measured in; it’s irrelevant to r2_score.

The only thing that might change from project to project are the expectations and demands you place on this score. For one project, you might require nothing less than an r2 > 0.70, while for another project, r2 > 0.20 might be considered phenomenal.

But the number always means the same thing. That’s the beauty of it.

Before we move on to ROC-AUC, let me share some rules-of-thumb for how to how to interpret your r2 :

R2Description
< 0Actively detrimental. Worse than random chance. Delete Python and burn your laptop.
0Useless
0.2Mediocre / Cusp of usability
0.3Okay / Satisfactory
0.5Good
0.7Very good
0.8Excellent
0.95Too good — Are you leaking data?

Remember! These can vary from project to project. But these guidelines have served me well over the years.

(Sidenote: How can r2 be negative?)

ROC-AUC Score

ROC-AUC is used for binary classification problems. It stands for Receiver Operating Characteristic – Area Under the Curve.

The number reflects the relationship between your true positive rate and your false-positive rate — that is, the positive classes you correctly identified vs. the negative classes you misidentified as positives.

Here’s why I love ROC-AUC so much (which I’ll just call AUC from now on):

Classification problems have the consistent dilemma of where to draw the line. Let’s use a smoke detector as an example. Should a smoke detector go off when:

  • There’s a 10% chance of a fire?
  • There’s a 1% chance of a fire?
  • There’s a 0.1% chance of a fire?
  • There’s a 0.01% chance of a fire?

It has to draw the line somewhere, and this same thing happens when you train a classifier to predict classes. Let’s look at some pseudocode:

clf = Classifier().fit(x_train, y_train)
pred = clf.predict(x_test)

print(pred)
>>> [0, 1, 1, 1, 0, 1, 0, 0, 1]

Now, you can try optimizing your classifier for avoiding false negatives (precision) or for catching true positives (recall), but I’ve never liked using the built-in sklearn scorers for this.

The beauty of AUC is that it works with your model’s class probabilities rather than class predictions. As we’ll see in a moment, this makes it much easier to answer the question of “where do we draw the line?”

Imagine you’re predicting 10 observations and there are 4 positive classes and 6 negative classes. You put your model’s predictions into a dataframe and get something like this:

PredictedActualTrue positives foundFalse positives foundTrue positive % found False positive % found
0.9911025%0%
0.9512050%0%
0.9102150%16.67%
0.8913175%16.67%
0.7403275%33.33%
0.6503375%50%
0.41143100%50%
0.35044100%66.67%
0.28045100%83.33%
0.09046100%100%

The last 4 columns (after Predicted and Actual) represent cumulative numbers. You also may find it a little weird to call a 9% prediction a “false positive” — in fact, it sounds like the model is awfully confident that it’s a negative class, right?

Well, you’re not entirely wrong, but remember these are probabilities, and strictly speaking, the model isn’t assigning a class to any of its predictions. So you might prefer to think of this as “this would be a true positive if the model predicted 1,” and “this would be a false positive if the model predicted 1.”

Maybe you’re starting to see what I mean by “where do we draw the line?” If your model was predicting classes, it would probably assign any probability above 50% a 1, and anything below 50% a 0.

But, now that you have this table, you can decide for yourself whether it’s more important to catch true positives or to avoid false negatives.

Let’s say these predictions represent criminal trials. You absolutely do not want to imprison an innocent person, so you determine that you’ll only predict a 1 if the probability is above 0.95. That’s optimizing for precision.

But then imagine these numbers represent cancer tests. You want to bring people in for additional tests if there’s even a small chance they have cancer. So, based on this data, you might draw the line and predict a 1 if the probability is above 0.41. (In reality, you’d probably go even lower than that.) That’s optimizing for recall.

That’s why I love AUC. You get much more nuance when working with probabilities, and you can easily convert those numbers to 0s and 1s using np.where():

# The index [:, 1] grabs just the probability of a positive class,
# which is exactly what you need
preds = clf.predict_proba(x_test)[:, 1]

# np.where(if this condition, make it equal this, else make it equal that)
pred_classes = np.where(preds > 0.41, 1, 0)

# Alternatively, do a listcomp
pred_classes = [1 if i > 0.41 else 0 for i in preds]

At this point, you can switch back to using a precision or recall score, or creating a confusion matrix.

You also get the benefit of being able to rank your predictions in order of certainty. What if those predictions represent the probabilities of a person buying from you? You could call or email the leads at the top of your list first and work your way down. Then, once you get 10 no’s in a row, or something like that, you might pack up your stuff and go home, because at that point you’ve found all the most obvious customers.

That’s why ROC-AUC is so awesome. But there’s one more reason and it’s what we discussed earlier: the ROC-AUC score, like r-squared, is always measured in the same units. Your ROC-AUC is realistically going to be between 0.5 and 1.0 every time you calculate it. Which means I can give you more guidelines for how to interpret it!

AUCDescription
< 0.50Actively detrimental. Try doing the exact opposite of what it says and maybe you’ll be on to something.
0.50Useless / Blindfolded monkey throwing darts
0.55Mediocre / Cusp of usability
0.65Okay / Satisfactory
0.70Good
0.75Very good
0.80Excellent
0.90Too good — Are you leaking data?

Another way to think of it is to add 10-15 percentage points to your score and that’s roughly your model’s letter grade. For example, I’d give a “B” to a model that scores 0.70.

But what do those numbers meeeean? ლ(ಠ益ಠლ)

That’s the last thing we’ll talk about before wrapping this thing up!

First, we can take our model’s predicted probabilities and graph rate at which we find (would-be) true positives and true negatives. Just like we saw in the table I created earlier.

Here’s the graph of of one of the highest ROC-AUC scores I ever obtained: 0.864 when predicting speed-dating matches:

So when we say “area under the curve,” we really do mean the area under curve. The red dotted line that represents “luck” is an AUC of 0.5. It’s what would happen if we just shuffled our predictions in a random order (rather than in order of descending probability).

Conversely, a perfect AUC of 1.0 would be a blue line that shoots straight up and then cuts across to the right. You’d find every single true positive before encountering a false positive. And, of course, that’s crystal-ball territory that never happens in the real world of machine learning. That’s why I say to check for data leakage once you’re scoring above 0.9. I was extremely skeptical with my score of 0.864 here, but I’ve triple-checked my work and can confirm it’s legit.

Summary

Try using sklearn.metrics.r2_score the next time you’re doing a regression problem, and sklearn.metrics.roc_auc_score the next time you’re performing binary classification.

Once you build up enough experience with these, you’ll develop an intuition of what a good model looks like. You’ll mentally compare each score to the ones you’ve seen before. And that will help you decide whether you’ve hit your target or if there’s more work to be done.

One thought on “My two favorite machine learning scoring metrics (and why)”

Leave a Reply

Your email address will not be published. Required fields are marked *