Attribute tables enrich your records with relevant data that will be valuable for generating predictions, and with Pecan, adding an attribute takes a couple of seconds!
Attribute tables play a crucial role by providing additional dimensions of data, adding depth to the analysis. Imagine them as more puzzle pieces that enhance the bigger picture of your data story.
What is an Attribute table?
Attribute tables provide the model with all the data it needs to work on to uncover hidden patterns that will allow it to make predictions.
A wide variety of data may factor into the target behavior of a customer, such as their transaction history, gaming results, demographic information, and NPS score. Indeed, a very large number of attributes may be involved in making a prediction. Machine learning is what enables you to detect patterns among such vast quantities of data, and to create features out of these columns.
Below is an example of an Attribute Table, which contains information about customer behavior and would help form the basis of a model’s predictions.
customer id | transaction date | category | country | campaign id | gender | transaction amount |
57380 | 2022-02-28 08:43:07 | Garden | Japan | 4041379140569 | Female | $11.57 |
35395 | 2022-03-14 21:25:00 | Shoes | Jamaica | 5010124916284297 | Male | $68.22 |
78463 | 2022-04-06 06:43:10 | Industrial | Canada | 6378381216678255 | Female | $5.13 |
Through automatic processes of feature engineering and feature selection, Pecan will uncover the features that may contribute to a model, and then determine whether the significance is high enough to generate useful predictions (see Understanding Column Importance).
What tables can be used as attributes?
While adding attribute tables is super important and easy, it's crucial that each table will have:
An ID column allowing you to join the table with the core_set.
Additional columns that hold the data that can be used to make predictions.
Adding an attribute
To create an attribute, you must first import the table into Pecan, either in your connections tab or by using the + Add data button in the Data connections panel in your notebook.
Using Pecan's AI Assistant (recommended)
In the notebook, hover over the empty space below the cell underneath you'd like to add the attribute, and click Generate Attribute.
Click the new cell to select the table you want to use. Pecan will automatically generate a query to use all the columns in the table, joining it with the core set:
Pecan will also take into account whether it is a 1:1 or 1:many table (see below for more information).
Once generated, you can edit the query as usual:
Add attributes manually
In the notebook, hover over the empty space below the cell underneath you'd like to add the attribute, and click + Query. A new empty SQL cell will be created in which you can write your own attribute query.
Let's go over the two SQL templates we have for adding attributes:
A template for 1:1 tables (one line per identifier)
SELECT [column_names]
FROM [table_name]
JOIN core_set
ON [table_name].[identifier_column] = core_set.[identifier_column]
A template for 1:many tables (multiple lines per identifier)
SELECT [column_names]
FROM [table_name]
JOIN core_set
ON [table_name].[identifier_column] = core_set.[identifier_column]
WHERE
[table_name].[date_column] < core_set.[sampled_date_column]
AND
[table_name].[date_column] >= CORE_SET.[sampled_date_column] - INTERVAL 1 YEAR
--You can change this interval to reflect how much past data you have. Up to two years is usually sufficient.
In the 1:many template, you also have the WHERE function with the following conditions:
Temporal Filtering:
[table_name].[date_column] < core_set.[sampled_date_column]
This part of the condition helps to prevent "future" information from being included in the model, which is crucial for preventing data leakage. To read more about data leakage, click here.
Time Window:
[table_name].[date_column] >= CORE_SET.[sampled_date_column] - INTERVAL 1 YEAR
This part of the condition establishes a time window by ensuring that the call date is within one year before the marker date. This helps control the historical scope of the data being considered. For example, you may not want to use data collected during COVID as this was an abnormal time. To learn more about the marker date, click here.
Important things to remember:
You must add a JOIN to the CORE SET
We need to perform aJOIN
on thecore_set
, so all the new data we added will be linked to the current Entity.
Avoid using
SELECT *
and specify the column names you want to use in the attribute. NEVER include the column that was used as the label in the core table.
This will minimize the chances of data leakage
Use only columns that will actually be available for the model once it needs to make predictions. If the model encounters a different schema or missing columns when trying to make predictions, it will encounter an error.
Add more attribute tables
You can add as many attribute tables as you want. Pecan will automatically run feature selection and engineering on all of them when your model is sent to train.
Naming the attribute Cell and marking it as an Attribute Table
After adding an attribute query to your notebook, make sure to give it a unique name and mark it as an attribute table.
Name the cell by writing its name at the bottom of the table:
Mark it as an attribute table, by clicking on the bottom right side of the table, and selecting "Attribute table" from the dropdown: