Evaluating your model requires a comprehensive approach, more than just simple scores or metrics - because many other aspects can affect the model's reliability. These aspects might not be noticed if we simply look at scores.
That's why Pecan uses special health checks for each model, which look deep into the model to find and suggest fixes for different problems quickly. This article explains these checks by Pecan, helping you understand and fix any issues with your model.
Think of it like a doctor's checkup where the model is the patient, and Pecan's health checks are the tests. Like a doctor looking for the root cause of a problem, not just the symptoms, these checks look at the model's structure and workings, helping you improve your model effectively.
#1: Data Leakage Check
This health check analyzes the importance of each feature to the model's predictions. If a single feature has over 50% importance, and the model's performance is overly promising, it might indicate data leakage.
Data leakage is a common issue where your model seems to perform exceptionally well but will fail when it comes to new, unseen data. This is often due to a feature that indirectly contains information about the target variable. Identifying such features can prevent data leakage and improve your model's real-world accuracy.
It's important to mention that over 50% importance is not always a problem - sometimes, it just might be a strong predictor highly correlated with the label variable.
How can you solve feature imbalance?
Prevent Data Leakage: Ensure your column doesn't contain information unavailable before the entity's marker. An example of this could be a
churn datecolumn. This information wouldn't be known before a user churns, so using it could cause data leakage. If this is the case, consider removing this column.
Apply Date Filters: The issue may not always be with the column but could lie within the attribute query. Applying a proper date filter can prevent the model from using data that occurs after the marker. This will ensure that your model is only utilizing relevant, accurate data. If the attribute is a transactional table (which contains multiple rows per entity), choose this option when adding this table as an attribute, and the date filter will be implemented automatically.
#2: Performance Stability Check
Performance stability over training and testing datasets is a vital indicator of a model's health. A discrepancy of over 20 in performance metrics (PR AUC for binary classification and WMAPE for regression) between the training and test sets may indicate overfitting.
Overfitting occurs when the model learns the training data too well, capturing noise and outliers, and hence performs poorly on unseen data. Keeping a check on performance stability will help you ensure that your model generalizes well to new data.
How to solve overfitting?
Simplify Your Model: An overcomplicated model with numerous attribute columns may perform exceptionally well on the training data but struggle to generalize to new, unseen data. This is a common cause of overfitting. By reducing the number of attributes, you're simplifying your model, thus making it less likely to overfit. The model becomes less specialized to the training data and more capable of making accurate predictions on new data.
Add Entities to the Training Data: Overfitting often occurs when the model doesn't have enough data to learn from. Adding more samples to the training set allows the model to learn more diverse patterns and reduce overfitting. The broader understanding gained from more varied data can help your model make better predictions when faced with new, unseen data. Remember, the more examples your model has to learn from, the better it can perform.
#3 Target Drift Check
ML models assume that the distribution of the target variable is stationary; that is, it does not change over time. However, in real-world scenarios, this is not always true. For instance, in sales forecasting, the number of sales can increase during holiday seasons, creating a time-related drift in the target variable.
To account for this, we perform a 'target drift' check, which analyzes whether the target variable correlates with time. If so, we suggest a new cut-off point to start the training set, ensuring that your model is trained on the most relevant data, improving its accuracy. Remember, sometimes less is more.
#4 Data Volume Check
The size of your dataset is a crucial factor in determining the health of your ML model. More data usually results in a more accurate and robust model, given that it can learn from a more comprehensive set of examples. We recommend a minimum of 1000 entities and 10 features (columns) to ensure a healthy model. If your dataset falls short of this, it might be unable to adjust to new data.
In conclusion, a solid grasp and accurate interpretation of these health checks are the keys to constructing and maintaining robust models. This practice directly leads to more precise and dependable predictions. And remember, if you encounter any obstacles or uncertainties along the way, don't hesitate to contact us for further guidance and assistance. We're here to help you make the most out of your modeling journey.