Skip to main content

Configuring Train/Test Data Splits in Pecan

Guide to time-based data partitioning for accurate machine-learning models

Ori Sagi avatar
Written by Ori Sagi
Updated over a month ago

After you define the core set and attribute sets for model training in the notebook, the next step is splitting the data. This step decides how the data is divided between training and testing sets so you can evaluate models fairly and avoid bias.


What is a data split?

A data split partitions the core dataset into two subsets:

  • Training set: used to train and validate the predictive model.

  • Test set (holdout): used to evaluate the model’s performance on unseen data.

By default, Pecan applies a time-based 90 / 10 split:

  • The first 90 % of entities (ordered by the sampled-date column, earliest to latest) go to the training set.

  • The most recent 10 % go to the test set.

No action is required unless you want a different split.

💡Curious about the finer points of train vs validation vs test splits?

Grab the full scoop in our detailed guide right here.


Customizing the data split

You can change the default split in two ways:

  1. Split by specific date

    Choose a date that marks the start of the test period.

    Example: To evaluate model performance on July 2025, set the split date to 2025-07-01. All entities with sampled dates after this point will be included in the test set.

  2. Change the train / test ratio

    Adjust the percentage of data used for training versus testing.

    Examples:

    • Use an 80 / 20 split to allocate more data to testing.

    • Use a 99 / 1 split to maximize training data when you’re confident in model performance.

Both options are available in the configuration panel before you launch training.


Split validations and rules

Before training starts, Pecan runs automatic checks to protect data quality and model robustness.

  • Label representation

    Both the training and test sets must contain at least two distinct outcome label values so the model sees the full range of scenarios.

  • Minimum set sizes

    • The training set must include at least 50 % of the total data.

    • The test set must include at least 1 % of the total data.

      These thresholds follow ML best practices and can be adjusted by advanced users.

  • Valid split-date range

    If you set a specific split date, Pecan verifies that the date falls within the sampled-date range of the core dataset.


Summary

The data split is a foundational step in your modeling workflow. Use the default 90 / 10 setup for quick iteration, or customize the split to fit your business context. Ensuring that training and test sets are valid and well distributed leads to stronger insights and more confident decisions.

Did this answer your question?