Overfitting occurs when your predictions correspond too closely (or even exactly) to your training data, and as a result, your model is unable to predict future observations reliably. A model is said to be overfitting if it’s more accurate in fitting known data than it is in predicting new data.

A simple example

Say you have a customer transaction table that includes the purchasers, the items bought, and the date and time of each purchase. If your model were to use the date and time of purchase during training, your predictions would fit the training set perfectly – but the model wouldn’t be able to generalize to new data, because those past times will never occur again.

In other words: if your model is too specific, good matches will be excluded from the results.

What causes overfitting?

In Pecan, overfitting occurs when you have too many columns of data (“Attributes”) given the number of records, or when your training set has an inherent pattern that's unique to that set and not to the entire dataset.

As a result, the model adjusts to specific features of the training data that have no causal relation to the target activity, or trains itself on information that’s not relevant to future datasets. So while the model’s performance on training data may be exceptional, it comes at the expense of performance on unseen data.

How to identify overfitting in Pecan

  1. In the dashboard for your Pecan model, check the metrics of your test set by clicking “Technical details” at the top of the screen.

    If your “Holdout AUC” is above 0.95, this is a strong indicator of overfitting. The ideal range for a predictive model in Pecan is generally 0.65-0.95, although a number above 0.90 may also be cause for suspicion. (Note: “holdout data” is the 10% of your dataset that’s used to test the model once it’s trained and validated.)

    Why is this the case? Your model’s AUC (Area Under the Curve) reflects the diagnostic ability of your model. If the value is extremely high (approaching a score of 1.0), it means you will have a high True Positive Rate and a Low False Positive Rate. This may sound ideal, but the problem is that the model will not generalize well to new data – and thus be unable to perform the predictions it was intended for.

  2. In Pecan, the model-training process is divided into three stages: training, validation and testing. The Precision Rate and Recall Rate for the training and validation sets should be quite close to those of the test set. If that’s not the case (e.g. 65% for the training set and 10% for the test set), this suggests overfitting. This discrepancy indicates that the model performs better on training data than it will on future data.

    If you suspect overfit, talk to a Pecan expert, who will compare your test metrics (which appear in your dashboard) against your training and evaluation metrics (which do not appear in your dashboard).

How to resolve overfitting

In Pecan, the solution to is to reduce the number of attribute columns fed into your model. This helps you achieve a leaner model that’s more generalizable to future data.

Generally, you will want to remove attribute columns if:

  • They are unlikely to be causally related to your predictions

  • They are representative of your training set, but not your entire dataset and/or future data

  • Their impact on your model (as reflected by “Feature Importance”) seems unusual

One approach is to reduce your input to data to 10-15 important attributes, and then consider adding more attributes once overfit is resolved.

However, Pecan also provides you with an easy to identify attributes (represented as “Features” in your model) that may be causing overfitting. To do so, follow these steps:

  1. Open the relevant model in Pecan, and click Code at the top of the screen.

  2. In the window that pops up, in the right-side panel, you will see an “Analyzer_json” that shows useful information from your model. (You can also download it by click “Download analyzer report”.)

  3. Type “Ctrl+F” and search for the word “overfit”. This will take you to a section of the JSON that lists which features are suspected to be causing overfit in your model.

    In the below example, you can see an list of features (named “overfit_features”) that may be causing overfit in a Pecan model:

If you would like help solving overfitting or identifying features in your model (a.k.a. attributes in your data) that may be causing it, be sure to reach out to your Pecan expert.

Did this answer your question?