How and why we split training data

A common practice in almost any Data Science pipeline is splitting the available data for a project into separate train/validation/test sets. This split provides an unbiased and quick way to confirm model integrity and assess its performance.

As seen below, around 80% of the data is used for training, 10% is used for validation, and the remaining 10% tests the predictive performance of the model.

  • Training set: This includes most of the available dataset. The data is fed directly to the Machine Learning algorithms, constituting the "knowledge" of a model: what it has seen and learned.

  • Validation set: Considerably smaller than the Training set, the Validation set is used to fine-tune the parameters of our Machine Learning models during the Training stages: how quickly models should learn, how much error is allowed, what is the structure of learning, and other technical criteria. Therefore, the model eventually accesses this data for validation purposes, but does not incorporate it into its "knowledge".

  • Test set: Also smaller than the Training set, the Test set is a part of the data the model did not access during training or validation. Therefore, the events in the Test set provide the model with completely new data, thus testing the model's predictive performance.

For example: imagine we trained a model and want to verify that the predictions indeed correspond to their true outcomes. Had we not put aside a Test set with known outcomes before training our model, it would be necessary to wait until outcomes truly happen so they can be compared to the predictions. With a Test set, predictions can be readily compared with the actual outcomes and generate Pecan's model performance metrics, perfectly replicating future predictive scenarios.

Training/validation/test data splits are performed automatically by Pecan, however, they differ for Time-dependent models or Time independent models.

Time-independent models

Time independent models have their data randomly split 80% to Training, 10% to Validation, and 10% to Test. All sorts of tests are automatically performed to ensure that different sets contain the proper data from all tables and do not overlap (in other words, that all sets are disjointed).

Time independent data splits are triggered automatically by not specifying a Marker column in the Model Editor.

Time-dependent models

Time-dependent models show event correlations and the time they happened, thus random data splits will very likely generate data leakage.

To overcome this, any data splits will have to take into account when events happened. At Pecan, we use the date of prediction as the splitting criteria (in Pecan terms, the Marker column).

Pecan orders all data by the prediction date. Then, it takes the 10% most recent data and assigns it to the Test set. From the 90% data left, it splits the 10% most recent data to the Validation set and 80% to the Training set. Therefore, the exact date of splits will depend on the distribution of your data across time.

If most of your data happened from Oct-19 onward, so much so that less than 50% of your dataset was left to the Training set, you will get a notification/email that your model might not have had enough data to learn from.

To help illustrate the perils of randomly splitting time-dependent data, let's suppose an online gaming platform is trying to predict churn among its customers. For that purpose, the company collected all sorts of data about their players through the course of 2019, deciding to use it to build a Churn predictive model.

Now imagine that we randomly split the first 6 months of data between Training/Validation/Test sets, ending up with the following configuration:

The model completed the training successfully. We checked the results and saw a VERY good performance, actually so good that it even sounds suspicious. We use to run live predictions, wait a month, and see that the performance turns out to be disappointing, very different to the numbers we saw before. What happened?

Our model learned from data in the Training set, so it "knows" all data about customers in Jan-19, Feb-19, Apr-19, and June-19. Turning to the Test set to confirm our model's performance, we guessed if customers from Mar-19 would churn in Apr-19. But there is no need to guess. Our model already "knows" what happens in Apr-19 and Jun-19, because these months' data are included in the Training set. So it can simply check if a customer was indeed active during Apr-19, without the need to make any predictions. This is a classic example of data leakage.

Time-dependent data splits are triggered automatically by specifying a Marker column in the Model Editor.

Did this answer your question?