Overview
Outliers are extreme data points that differ significantly from most other values in a dataset. These unusual values can sometimes arise from rare events or anomalies. In regression models, outliers in the training set may harm the training process by skewing the model’s focus toward unpredictable patterns.
To help identify and address outliers, Pecan now provides an Outliers Alert in the model dashboard for production-quality regression models.
How It Works
1. Detecting Outliers
Pecan analyzes the Label (the target variable) of your regression models. The system identifies label values that fall outside a normal distribution (i.e., considerably higher or lower than the majority of values).
2. Healthy Range
For each regression model, Pecan calculates a ‘healthy range’ with minimum and maximum acceptable label values. A Health Check is triggered if any outliers are found in the training set (including both training and validation subsets).
3. Dashboard Alert
When outliers are detected, an alert will appear in your model’s dashboard. You can review the details and decide whether to adjust your data to mitigate these extreme values.
What to Do If Your Model Has Outliers?
You can duplicate your model and send it to train again while using the remove (clip) outliers setting during the training process:
1. Duplicate your Predictive Notebook
Go to the Prepare Data tab at the top of the dashboard, and click Duplicate. Then click Train model, run validations, and click Continue to model training.
2. Change the Training Configuration
Set the Training mode to Production Grade:
3. Enable Outlier Removal
Select the option to remove outliers:
Values above the healthy range will be clipped to the healthy maximum.
Values below the healthy range will be clipped to the healthy minimum.
4. Send Your Model To Train
Click Train model your model again with the outlier removal setting enabled.
Important Notes
Production Quality Models Only
Outlier alerts and removal options are available for production-quality regression models.
Production Grade training takes a bit longer than fast training, as Pecan will make sure to remove all the outliers, and also engineer more complex features. Learn more here.
Outliers in the Test Set
Typically, outliers in the test set are less of a concern because you want your model trained on “clean” data. Removing outliers may improve model performance by reducing skew in the training data. However, you should evaluate whether the outliers themselves are critical data points that need special attention.
If a large number of outliers appear in the test set, it may indicate a mismatch between training and test data distributions and could raise questions about the fairness or reliability of your evaluation.
Need More Help?
If you have questions about how outliers are detected, how to configure outlier removal, or how to interpret alerts in your dashboard, feel free to reach out to our Support Team. We’re here to help you get the best possible results from your regression models!