One of the common issues with AI models is a lack of clarity around how they arrive at their conclusions. Pecan has made it a priority to overcome this issue for two main reasons:
Understanding which factors contribute to a model’s predictions – and to what degree – will enable you to refine your dataset and optimize the model itself. For example, you may wish to reduce “noise” in the model by reducing data points that have no meaningful impact, or to enhance prediction accuracy by adding potentially relevant data to the model.
Understanding which factors can be used to predict reality will reveal opportunities to influence that reality by encouraging particular outcomes or behavior. For example, you may decide to send an email to customers who are predicted to churn, and this may even be informed by the fact that your model detects “date_since_last_email” as being a meaningful factor in predicting churn.
In Pecan, these contributing factors are known as “features”. And “feature importance” quantifies the impact of each feature on the model’s predictions. This score is calculated by: 1) calculating the SHAP value of all features (a number that will increase as their contribution increases), and 2) normalizing the result so the sum of all feature scores equals 100%.
Pecan provides multiple visualizations to help you understand which features drive a model to its predictions, and to what degree. Each feature is listed in the Feature Importance Widget, labeled and sorted in accordance with its feature importance. You’ll also see the 10 highest-contributing features for each prediction in your Pecan Predictions Table.
Of course, due to the complex nature of machine learning, it’s generally impossible to achieve complete clarity on how a model works. Each model is a complex statistical structure that graphs cannot fully explain. And as a model becomes more complex, features reveal only limited aspects of the entire decision tree, with progressively less impact on the predictions.
Feature Importance Widget
For every trained Pecan model, a “Feature Importance Widget” will display the most influential features in your model.
Interpreting these features will help you understand the causal factors behind your predictions, as well as what variables you can act on in order to influence outcomes. You may also discover features that are disproportionate or inappropriate in their impact, and should be excluded from your model.
This is how the widget appears in your model’s dashboard:
Left pane: Top 20 features
The widget lists the 20 features that contribute the most to your predictions, as determined during initial model training. Above each feature’s name, you’ll see the name of the table that contains this data.
A purple bar illustrates the importance of each feature relative to the strongest feature (which always has a full bar.)
When you hover over each bar, a popup bubble shows the actual feature importance of each feature.
Note that features are not linearly correlated, and some will be dependent on – or interact with – others. This means that feature importance is a somewhat abstract measure; a higher number indicates a strong feature, but does not account for the complex interactions between features.
A small number of strong features implies that the model is simple and/or might have data leakage, while a more even distribution indicates a more balanced or complex model.
Features that are highly correlated (greater than ±95%) will automatically be detected by Pecan’s feature selection process. When this occurs, one of the features will be removed since it doesn’t add new information to the model, and thus only slows down compute time and harms model explicability.
To download the list of all features and their importance, click Save as CSV.
To learn how features are engineered, named and tagged in Pecan, see Feature engineering and description tags.
Right side: Feature Importance Graph (a.k.a. Partial Dependency Plot)
Clicking each feature in the widget loads a Feature Importance Graph (a.k.a. Partial Dependency Plot), which illustrates the effect of each feature on your model’s predictions.
The graph will be a bar chart if the feature is categorical (e.g. day of week), as illustrated below for the feature of “country_code”:
However, it will be a line graph if the feature is continuous (e.g. purchase amount, number of sears), as illustrated below for the feature of “average_points”:
The horizontal axis represents the range of values for a given feature, and the vertical axis represents the normalized effect on your predictions (a.k.a. the SHAP value).
By clicking and dragging on the graph, you can zoom in on specific sections of it.
And by hovering over an individual data point, you can see the effect of that value on the prediction. A score of 1 would mean that the feature perfectly predicts instances of the the target behavior, while a score of -1 would mean the opposite. Note: for continuous features (where there is a line graph), the trend is typically more important than any particular plot, peak or trough.
For more detailed information on how to interpret this graph, see Partial Dependency Plots (PDP).
A SHAP value quantifies the contribution of a feature to a prediction. As such, it is the basis of the “Effect on Prediction” plot in the above Feature Importance Graph (a.k.a. PDP).
SHAP values are attained by training a predictive model on every combination of Attribute data that’s fed into model. This means you can expect SHAP values to change whenever you add or make changes to Attribute data, adjust the parameters of a model, or retrain the model, etc. For additional explanation, see SHAP values.
At the bottom of your model’s dashboard (for both binary and regressive models), you’ll see an “Output Preview” table that shows the first 100 predictions of your model – along with the top 10 features contributing to each outcome, according to their SHAP value. This is illustrated below:
For an overview of this table and how to interpret it, see Preview Table.