What is overfitting?

Overfitting is when a model memorizes training data instead of learning patterns. Resolve it by reducing attributes and adding more data.

Linor Ben-El avatar
Written by Linor Ben-El
Updated over a week ago

Have you ever studied for a test and memorized everything instead of understanding it?

If the test contains the exact questions you've memorized, you'll get an A, but with new questions you've never seen before... let's say you won't be at the top of the class.

That's what happens when a model overfits - it means the model has memorized the answers instead of actually learning how to solve the problem.

Overfitting occurs when your predictions correspond too closely (or even precisely) to the entities in the training data set, and as a result, your model is unable to predict for new, unseen entities.

You can think of a binary model as a line that separates two groups (i.e. churn and non-churn) and a regression model as a line that tries to go through all the right points. When a model overfits, it looks too good to be true. It is too precise:

Overfitting - MATLAB & Simulink

When a model underfits, it simplifies the process too much, so it does a not-so-good job.

What causes overfitting?

  1. Too many attributes:
    Too many columns of data (“Attributes”) may cause the model to try and use all of them to make predictions, even if some are not useful for the task. This can make the model overly complex and cause it to fit the training data too closely.

  2. Not enough entities:
    If there aren't enough entities for the model to learn from, it might overfit by learning the entities too well and not being able to generalize to new data.

Proactive Overfit Prevention in Pecan

At Pecan, we have a unique way of detecting overfitting. We compare performance across the training, validation, and test sets. If the gap exceeds 10%, we raise an alert, which is available in the model's dashboard. This approach helps us ensure that the model's predictions are not just accurate on historical data but will also remain robust for future, unseen data.

How to resolve overfitting?

If you suspect overfit, we recommend taking two steps:

  1. Reduce the number of attribute columns fed into your model. This helps you achieve a leaner model that is more generalizable to future data.
    Generally, you will want to remove attribute columns if:

    1. The attribute is unlikely to be causally related to your predictions.

    2. They attribute cause leakage, meaning it’s representative of your training set, but won’t be available for future data.

    3. The attribute impact on your model (as reflected by “Feature Importance”) seems unusual.

  2. Add more entities. Providing the model with more entities to learn from can increase the variety of patterns and behaviors it can learn and increase its predictive power for future data.

Still need help?

In the event that you suspect your model might be overfitting, or if you're seeking guidance on enhancing its performance, please feel free to reach out to the Pecan team. As part of our commitment to your success, we're always prepared to provide in-depth assistance, share further insights, and help you navigate potential challenges.


Did this answer your question?