First, what is a feature?
In machine learning, a feature is a property that provides meaningful information to a model. Features comprise the historical data that allow models to learn from the past and generate predictions. As such, they are crucial to the functioning of any model.
In Pecan, features may be numeric values, string values, or categories, and they are derived from attribute tables (which may contain data like transaction history, demographics, and a range of other properties.) Once your model is sent for training, each column in your attribute table(s) becomes an ML feature.
Each feature will have its own unique impact on a model’s predictions, and in Pecan, you can visualize this in your model’s dashboard (see Understanding feature importance).
What is feature engineering and why is it important?
Pecan automatically generates features in addition to those manually added to your model.
This happens during the data preparation stage (when you're creating an AI-ready dataset) before model training begins. The raw data from your attribute table(s) is summarized and aggregated into new columns, thus creating new features and introducing new “points of view” for the model.
The act of deciding which features to create based on the raw data is called “feature engineering”. Typically done by data scientists with the help of data engineers, this work is done automatically with Pecan. Thanks to vast experience with multiple business use cases, Pecan is able to do this in a robust manner, constantly enriching and improving the best practices based on ongoing projects with Pecan customers.
What are the key benefits of this?
The art of getting more from the data by manipulating & transforming it to get more ‘signal’ and improve model accuracy
Feature eng is typically done by using simple and complex data manipulations (e.g. key statistics as mean, median, average, data aggregations, etc.)
In some cases the engineered features prove to have the highest impact on prediction, however, they are not always intuitive enough to get insights and make business sense
“For example: based on a column named “purchase_value”, you could extrapolate additional features like “min_purchase_value, max_purchase_value, and median_purchase value”.
e.g. in the Attribute table, you have columns for Support Call Date and Support Call Length per customer ID. From this, you could derive the Number of customer support interactions in the 30 days prior + average support call length. Can be surprised.
How are engineered features named?
A pecan-generated feature will include the word "pecan" in its name and several other
components:
A prefix of one or two numbers:
The first number points to the table from which the feature originated (tables are numbered according to the order in which they were inserted into the builder).
If a second number is included, it indicates the recency of the attribute relative to the
date of prediction. For example, 1 would refer to the most recent order a user placed, 2 would
refer to the second most recent order, etc.
The name of the column that was used to generate the feature.
An aggregate function or a date function:
· max, min - Maximum/minimum value aggregations.
· avg - Average value.
· count_distinct - Count of distinct values per entity (order/ user etc).
· stddev - Standard deviation.
· sum
· dayofweek - Day of the week (1 to 7).
· day - Day number (1 to 31).
· hour - Hour of the day (0 to 23).
· marker_diff - Difference between the dates in the column and the day of the
prediction (Marker).
· month - Month of the year (1 to 12).
· prev_transaction_dist - Count of days since the last transaction per
row/event in the Entity table. In other words, it indicates the frequency of
transactions per entity.
· quarter - Year quarter (1 to 4).
· weekofyear - Week of the year (1 to 52)
Feature engineering description tags
In your dashboard, you’ll see the features that help the model arrive at its predictions, ranked in the Feature Importance widget.
A color-coded tag will be attached to certain features. These tags indicate that the feature has been engineered, or in other words, synthetically created from your data.
The color of the tag indicates the type of data manipulation that was performed to engineer that feature.
This is what the color of each tag represents…
Blue: date-related manipulation
Green: aggregated data
Orange: Pecan Number-related manipulation
Here are the different tags you may encounter:
n_rows - the number of rows per Entity in the Attribute table (e.g. how many transactions within the last week, how many visits within the last month)
1, 2, 3, etc. - indicates the recency of the feature in relation to the prediction. For instance: 1 would correspond to the row containing the latest transaction of a user, 2 would correspond to the row containing the transaction before that, and so on. This is known as the Pecan Number.
Refers to the pecan_number. In other words, indicates how recent Attribute rows are to the date of the prediction. i.e. 1 would correspond to the last transaction of a user, 2 would correspond to the second-most recent, etc.
max, min - the maximum/minimum value in numeric columns
avg - the mean value in numeric columns
mode - the most common value in categorical columns
median - the median value in numeric columns (often useful for datasets with extreme outliers)
count_distinct - the number of distinct categorical values when there are multiple rows per entity in the attribute table
stddev - the standard deviation of the feature value
sum - the sum of all values in the relevant column in the attribute table
hour - hour of day of the timestamp (0 to 23)
dayofweek - day of week of the timestamp (1 to 7)
day - day of month of the timestamp (1 to 31)
weekofyear - week of the year of the timestamp (1 to 52)
month - month of the year of the timestamp (1 to 12)
quarter - quarter of the year of the timestamp (1 to 4)
marker_diff - the number of days between the date of the activity and the date of the prediction (the “marker” date)
prev_transaction_dist - the number of days that passed since the most recent additional row for that Entity in the Attribute table.