All Collections
Evaluating a Model
Binary models
Model performance metrics for binary models
Model performance metrics for binary models Written by Ori Sagi
Updated over a week ago

In a binary classification model, results are classified into two mutually exclusive populations that can be labeled (e.g. “converted” vs. “did not convert”). In Pecan, there will be a prediction of either 1 or 0 in the “Label” column for each entity. There can be no other classifications like “null”, “2” or “\$75”.

(That being said: behind the scenes, each binary model does indeed produce a score between 0 and 1. However, your model’s threshold setting determines the final prediction – whether each entity is assigned a prediction of 0 or 1.)

Below is a list of metrics that are useful for evaluating the results of a binary model in Pecan:

# Base Rate

The base rate describes the percentage of the population that currently exhibits the target behavior (or meets certain criteria). Therefore, in a predictive model, it refers to the rate of target behavior within the training data.

In other words, it indicates the probability that members of the population will carry out the behavior we want to predict in the absence of other information.

# Precision Rate

The precision rate indicates how precise a model was in predicting instances of the target behavior. In other words, it calculates the percentage of positive predictions that turned out to be correct:

This formula answers the question: “When you predict a positive outcome, how often are you correct?

For example, if a model predicts that 1,000 people will become High-Value Customers, but only 800 of them actually do so, then the model’s precision rate is 80%.

This metric is important in cases where taking action has a significant cost or risk, and we want to make sure we are acting only on the right population. Picking a conservative threshold will improve precision, typically at the expense of detection (see below).

# Detection Rate (a.k.a. Recall Rate)

The detection rate indicates a model’s ability to detect instances of target behavior. In other words, it calculates the percentage of all target behavior that was correctly predicted:

Detection rate = Detected correctly / (Detected correctly + Ignored incorrectly)

This formula answers the question: “How much of a particular outcome did you manage to predict?

For example, if 1,000 people became High-Value Customers, but the model predicted only 400 of them to do so, then the model’s detection rate is 40%.

This metric is important in cases where not acting has a significant cost or risk, and you want to miss as few positive instances as possible. Picking a liberal threshold will improve this metric, typically at the expense of precision (see above).

The trade-off between precision and detection

As explained above and as illustrated in the Venn diagram below, adjusting a model’s threshold affects the balance between precision and detection.

A higher precision rate means that your model is increasingly correct when predicting instances of the target behavior. However, this comes at the expense of the detection rate, since fewer instances of the target behavior will be detected overall.

Conversely, a higher detection rate means that your model will detect more instances of the target behavior overall. But this comes at the expense of the precision rate, since there will be more incorrect detections of the target behavior.

# Venn diagram (predicted vs. actual performance)

Pecan uses a Venn diagram to compare your model’s predictions against actual outcomes. It communicates both the precision and detection of your model, illustrating the percentage of correct and incorrect predictions in either direction.

It appears in the dashboard for a binary model, and appears as follows:

Here’s how to interpret the colors of the Venn Diagram:

• Detected correctly (purple) – the target behavior was correctly predicted to occur.

• Detected incorrectly (turquoise) – the target behavior was incorrectly predicted to occur.

• Ignored incorrectly (red) – the target behavior occurred, but was not predicted to do so.

• Ignored correctly (gray) – the target behavior was correctly predicted not to occur.

So, for example, if this Venn diagram was for a churn model, “detected correctly” (the purple area) would represent instances where a customer was correctly predicted to churn.

The exact numbers for each result appears on the right side of the diagram.

The numbers behind your Venn Diagram give rise to the precision and detection metrics of your model. As illustrated below, as your model’s sensitivity threshold varies, so too does the ratio of correct and incorrect predictions.

For a model to be considered good (i.e. have relatively high precision and accuracy), the majority of your population should fall within the purple and gray circles.

However, model performance will vary depending on where you set its sensitivity threshold. This cutoff point should be established based on your unique business goals (see Understanding threshold logic).

# Accuracy Rate

Accuracy rate indicates how frequently your model is correct, but not how frequently it is incorrect. It does not consider false-positive and false-negative predictions, even though they can have a strong impact on the usefulness of a model:

Accuracy rate = (Detected correctly + Ignored correctly) / Total number of predictions

Unfortunately, the goal of simply maximizing correct predictions is generally misguided. Accuracy is not always a useful measure of evaluation, especially in imbalanced datasets where instances of target behavior are rare (and thus most positive predictions are bound to be correct).

For example, say you have a model designed to detect fraudulent activity (which comprise only 0.1% of all customer activity). Now let’s say the model predicts that that all activity is non-fraudulent – this means your model would be correct 99.9% of the time (a 99.9% accuracy rate). But this would still be a poor model because you can’t understand how good the model is at actually predicting fraudulent activity).

If, however, it was a more balanced dataset in terms of target behavior (i.e. a higher base rate), such as 45% of customer activity being fraudulent, the metric would be more sensitive to making mistakes, and thus be more useful,

Generally, it's difficult to use accuracy rate to guide actionable business goals and measure model impact, so it is not a useful metric for most use-cases.

# AUC (Area Under the Curve)

Area Under the Curve demonstrates the diagnostic ability of your model as its threshold is varied. It allows for the statement: “When the rate of false-positive predictions is X, the rate of true-positive predictions is Y.”

This helps you understand how well a model separates positive and negative instances of target behavior. AUC score is attained by calculating the area under an ROC curve, and unlike the other metrics covered in this article, it is threshold-agnostic.

If the total area under the curve amounts to a fraction of 1.0, this represents a perfect predictive score.

Meanwhile, a score of 0.5 is equivalent to using the base rate as a predictor. (This means you would make about as many correct predictions as incorrect predictions.)

In predictive modeling, an AUC score between .65 and .95 is considered an acceptable balance between precision and detection. But where you place your threshold within that range will depend on your unique business needs (see What should your AUC be?).

In Pecan, AUC is the default performance metric displayed in the dashboard for binary models.

For a deeper dive into this metric, see Understanding Area Under the Curve (AUC).

# LogLoss (Logarithmic Loss)

LogLoss is an advanced metric that’s commonly used for predictive models. It helps differentiate between “good” and “poor” models by indicating how close their predictions are to the actual values (to 0 or 1 in the case of binary classification).

A model with perfect predictive ability would have a LogLoss of 0. And the more predictions diverge from the actual values, the higher the score.

What’s unique about LogLoss is that it penalizes a model more for making bigger mistakes than for making smaller mistakes. This is useful for binary models since it tends to push predictions towards higher and lower values that are more clearly distinguishable (as opposed to a flatter prediction curve where many entities receive a similar value).

This is a lot more helpful when trying to identify whom to perform a business treatment on. It’s also why, unlike accuracy rate, this metric is more robust for evaluating unbalanced datasets.

This how LogLoss is calculated:

Since LogLoss penalizes a model more for bigger errors than for smaller errors, using it to optimize a model can help you be more certain of individual predictions. Of note, this metric is threshold-agnostic – its calculation depends on the threshold you’ve chosen for your model.