When training a model with Pecan, it will learn from historical data – like customer demographics, transactions and a multitude of other features – in order to make predictions based on new data. However, any data that contributes to the model must be registered before the moment of the prediction.
Data leakage happens when the model is exposed to data that occurred after the prediction. This “future information” allows the model to learn or know something that it shouldn’t, and thus "cheat" by seeing the future before making a prediction. Worst case, the data being used to train the model may contain the very information you are trying to predict (e.g. whether a customer will churn), which would totally invalidate the results of the model being trained.
To put it more succinctly: data leakage occurs when information that doesn’t belong in your training dataset is leaked to the model, thus contaminating it by letting it peek into the future.
As illustrated below, when a model is being trained, data that provides the basis for predictions may only be included if it occurs prior to the moment of the prediction:
An example of data leakage
Say you’re building a model to predict churn, and that your company updates a “last_quit_date” column whenever a customer churns. Using this column as a feature will likely cause data leakage because the model would be able to "check" this column and see if a customers will churn in the future – before issuing its predictions.
How to prevent data leakage
To avoid data leakage, do not use “with” information in your training data if it happened after the moment of the prediction. As a rule of thumb, don’t import columns which contain data that’s likely to be entered or overwritten after the marker date (in the Entity table) or is typically updated over time.
In the left-side example, we see that a column is being used incorrectly since its data was updated after the moment of prediction.
In the right-side example, we see that it’s safe to use columns that contain data from prior to the prediction.
What if you want to use data that changes over time?
Data typically does contain a variety of fields that change over time – think age, last purchased product, or customer status. This data is often found in customer tables that have one row per customer, which are thus especially likely to cause data leakage.
The important things is to ensure you’re always using data that's true at the moment of prediction – not that becomes true after it. Here are a couple examples to think about as you consider how to incorporate these types of data into your model-training dataset.
If you want to include customer age, the table containing this data may have been updated more recently than the marker date in your Entity table. So instead of using the current “age” property, you would use a field that calculates the time difference between their date of birth and the date/time of prediction. This way, as a property in your model, customer age will have a consistent reference point and always be true for the moment of prediction.
If you want to use a property like “last purchased product” or “customer status”, you’ll want to pull these values from a dedicate table that includes, in this case, the date of purchase or date of status change. That way, when feeding training data into your model, you’ll be able to easily filter out data that occurs beyond the marker date.
How to identify sources of data leakage
If future information is being introduced into your dataset, it’s very likely to influence your predictions. Here’s how to identify a couple common sources of data leakage::
Carefully review your SQL queries in Pecan Query Language Studio. Look for any ON or filtering conditions that might include data from after the marker date. For more details about customizing your input tables with SQL queries, see Customizing tables with SQL.
Carefully check all of your tables that have been imported as a snapshot. These tables' values are updated in entirety and continually – and thus risk exposing your model to data that was registered after the marker date. As a rule of thumb, if you're unsure whether a column might cause leakage, don’t include it. Ideally, you will have “data dictionary” that defines what each column in your raw data means. Suspicious names may indicate a potential source of data leakage. For example, if you are predicting churn, you would not want to use columns named “has_churned”, “called_to_churn” or “quit_reason” in your model.