What is a predictive question?
When creating a model in Pecan, you’ll define your “predictive question”. It communicates the prediction you’re trying to make and will vary depending on the business problem you’re trying to solve.
An example framework for a predictive question is: “Given a particular customer at a particular point in time, what is the likelihood that they'll perform a certain activity in the future?”
You’ll need to plug specific values into that question; not being specific enough will create unnecessary noise in the model and result in poor predictive performance. (This is where a data analyst can add immense value, using their knowledge and expertise to formulate a question that helps generate accurate predictions.)
Your predictive question will have four key components:
Who exhibits the behavior you want to predict? (These are your entities)
When do you want to make a prediction?
Predictions can be made on a recurring or a one-time basis.
What activity or behavior do you want to predict?
How far into the future do you want to predict?
Here are a few examples of a good predictive question:
“7 days after subscribing, which Tier-1 customers are likely to churn within the next 60 days?”
“14 days after app installation, how much revenue is expected to be generated by each new user within the next 365 days?
“Predicting on a weekly basis, what's the likelihood of a customer who made a purchase within the last 30 days to upgrade their plan within the next 60 days?”
Once you’ve defined your predictive question, it will serve as a basis for your predictive flow (see below). It is reflected as a set of Pecan that will translate your question into a set of SQL queries, which will be used to transform your historical dataset into an AI-ready dataset your model can train itself on.
What is a predictive flow?
In Pecan, a predictive flow is the full prediction process, made out of 3 parts:
Queries - where you tell your model how to interpret your data using SQL in the Nutbook. This is where the majority of your work will happen. It consists of a set of SQL queries that query your imported data to Pecan. These queries allow Pecan to generate AI-ready tables that can be interpreted by Pecan’s AutoML (automated machine learning). The model will use these tables to learn from the data, train itself to make accurate predictions, and make predictions for future datasets.
Model - your trained model's dashboard, that allows you to evaluate your model's ability to make precise predictions.
Predict - where you configure and monitor your prediction production cycles, to empower your decision-making.
What is the core set?
When you’re training or using a model in Pecan, the core_set table is the basics for each prediction. It is the subject of your predictions (a person, SKU etc) at a particular point in time that you are either learning from when training a model or making a prediction for once the model is deployed.
Let's imagine that we have an app and want to predict who will stop their subscription in the next monthly cycle. That means that for each user we need their ID and the renewal date for their subscription, so that a prediction can be made.
As such, an entity is comprised of two elements:
An identifier (e.g. customer_ID, user_ID) or activity identifier (e.g. transaction_ID, event_ID)
A date
customer_id | date |
67380 | 2022-02-30 |
67380 | 2022-03-30 |
67380 | 2022-04-30 |
215013 | 2022-02-18 |
215013 | 2022-03-18 |
215013 | 2022-04-18 |
As you can see, a model can make a prediction for an entity at multiple points in time (with multiple dates). In such cases, the same ID can and will exist in multiple rows/entities. This is common in Pecan models designed to predict churn, retention, upsells, and engagement.
What is the label column in the core set?
Only when training a model, Pecan add a third column to the core set called "label". This is the outcome you want to predict - but since the model trains on past data and we already know the outcomes for it, the label column is like the answer key in your dataset. It contains the actual outcomes or results that you want the model to learn to predict.
For example, if you’re trying to predict whether a customer will renew a subscription, the label column would indicate whether each customer in your dataset actually renewed or not. If you're trying to determine lifetime value, the label will contain the actual lifetime value of that customer in that specific point in time.
This column is used to train and test your model. Once you use this model to create actual future predictions, this column will become the prediction values.
Why do we need a date column per entity?
When you’re creating a predictive flow for your model, Pecan needs a date for each entity. This date indicates the moment of prediction; it is the precise date or timestamp at which you'll make a prediction for each entity.
When a date is placed alongside a customer identifier, it creates an instance of each entity at a particular point in time, for whom you can make a prediction. Here's a simplified example of how your training dataset might look:
user_id | date | campaign | state | age | made purchase within 30 days of installation |
13455 | 2022-12-01 | Garden | TX | 45-54 | 1 |
13456 | 2022-12-01 | Shoes | OK | 18-24 | 1 |
13457 | 2022-12-02 | Industrial | PA | 35-44 | 0 |
13458 | 2022-02-03 | Electronics | CA | 45-54 | 1 |
13459 | 2022-04-03 | Toys | UT | 35-44 | 0 |
13460 | 2022-05-04 | Sports | CO | 18-24 | 0 |
How do we arrive at each marker date? It depends on your predictive question.
Imagine you want to predict the likelihood of new users making a purchase, and you want to make this prediction 14 days after they install your app. In this scenario, your marker date would be “installation + 14 days”. In the above table, each date would be calculated by adding 14 days to each customer’s onboarding date.
Since the date defines the time of prediction, it’s also the point at which you sample a customer’s activity – it’s the final date from which you may use historical data for model training purposes. So, if the date is defined as “onboarding + 14 days”, your model will only train itself based on data that occurred up until the date (for each individual entity).
In some cases, you’ll want to make recurring predictions. Say you want to predict a user’s likelihood to churn on a monthly basis, regardless of their installation date. In your predictive flow for this model, you would define the frequency as “monthly”.
What are the differences between “fastest” and “production grade” Training modes?
When sending a model to train, you can choose between two training modes:
Fastest - this setting is great for iterating on your queries and training set, seeing how different data elements affect accuracy, and whether you are on the right track in defining what you are trying to predict. Training will run faster but won’t produce the best possible accuracy.
Production quality - use this when you are ready to train a production-grade version of your model, or if you want more granular control over the training process. Training will be slower but will produce the best possible accuracy.
We recommend starting with the “fastest” training mode, and if you see the results make sense - you can duplicate your predictive flow and train it again in production mode.
The main differences between the two training modes are as follows:
| Fastest | Production grade |
Training time | Between 15-30 minutes. | Several hours. |
Model accuracy | Good.
Faster training time at the expense of somewhat reduced accuracy.
| Excellent. |
Data size | Limitations on data size -tens of millions of rows. | No limitations. |
Predetermined. | Adjustable. |