Understanding the F-Beta metric

What is F-Beta, how should I use it, and what in the hell is ‘average’ and ‘sample_weight’

Author

Wayde Gilliam

Published

January 1, 2019

Overview

scikit-learn describes the F-Beta score “as the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0” with the “beta parameter [determining] the weight of recall in the combined score.” It is one of the most common metrics enlisted in demonstrating the performance of binary, multi-classification, and multi-label classifiers.

So what does all that mean?

In a nutshell, it says that this metric can be used to help you understand how good your classification model is based on the relative importance you ascribe to precision and recall in making that determination. Common beta values are 0.5 (precision is king), 1 (precision and recall are equally important), and 2 (recall is king).

When you look at the documentation, you’ll notice there are several other interesting arguments you can pass into it. Two of the more mysterious ones being average and sample_weight. Will explore what they mean how you may want to use them based on your dataset.

The two metrics, along with other important terms, are described well in this post. Let’s imagine a multi-classification model that tries to determine whether a photo show a picture of a dog, cat, or bird.

Precision vs. Recall

The two metrics, along with other important terms, are described really well in this post. Let’s imagine a multi-classification model that tries to determine whether a given photo is a picture of a dog, cat, or bird.

Precision

Definition: When your classifier predicted a label, how often was it correct?

Example: When you predicted ‘cat’, how often were you right?

Formula: True Positive (TP) / PREDICTED Label (TP + False Positive or FP)

# TP = number of cat prediction you got right
tp = 100
# FP = number of cat predictions you got wrong
fp = 10
precision = tp / (tp + fp)
# = 0.91

Recall

Definition: For every actual label in your dataset, how often did your classifier pick the correct one?

Example: When it’s actually ‘cat’, how often did it predict ‘cat’?

Formula: True Positive (TP) / ACTUAL Label (TP + False Negative or FN)

# TP = number of cat prediction you got right
tp = 100
# FN = number of actual cats you predicted as something else
fn = 5
recall = tp / (tp + fn)
# = 0.95

Okay, so which one should I use?

This depends on your task.

If you’re task is to predict whether a patient has cancer given set of symptoms and test results, it’s going to be far more important to you that all actual cancer patients get flagged even at the expense of non-cancer patients being flagged incorrectly. This is recall. In this particular kind of task, you’re also likely going to be facing a dataset were the vast majority of examples are “not cancer.” A case where using metrics like precision and accuracy will likely look really good but be completely misleading. Other examples where you want to maximize recall include fraud and network anomaly detection.

On the otherhand, if you’re task is to predict whether an e-mail is spam or not (1=spam|0=not spam), you recognize that it’s not the end of the world if your user gets a junk e-mail. If fact, it would be worse if an actual e-mail got flagged as junk and they didn’t see it. Getting it wrong is more acceptable than making sure all the true cases are gotten right. This is precision. Here, you’re more concerned about your classifiers overall predictive capability in coming up with the right answer, yes or no.

What about our cats, dogs, birds?

Good question, again it depends on the task. All things be equal, most likely we care more about precision or we care about both equally in this case. Fortunately, the F-Beta metric gives us the power to determine the worth of our model regardless of how we want to weight the two.