What is a predictive question?
When creating a model in Pecan, you’ll define your “predictive question”. It communicates the prediction you’re trying to make, and will vary depending on the business problem you’re trying to solve.
An example framework for a predictive question is: “Given a particular customer at a particular point in time, what is the likelihood that they'll perform a certain activity in the future?”
You’ll need to plug specific values into that question; not being specific enough will create unnecessary noise in the model and result in poor predictive performance. (This is where a data analyst can add immense value, using their knowledge and expertise to formulate a question that helps generate accurate predictions.)
Your predictive question will have four key components:
Who exhibits the behavior you want to predict? (These are your entities)
When do you want to make a prediction?
Predictions can be made on a recurring or a one-time basis.
What activity or behavior do you want to predict?
How far into the future do you want to make a prediction for?
Here are a few examples of a good predictive question:
“7 days after subscribing, which Tier-1 customers are likely to churn within the next 60 days?”
“14 days after app installation, how much revenue is expected to be generated by each new user within the next 365 days?
“Predicting on a weekly basis, what's the likelihood of a customer who made a purchase within the last 30 days to upgrade their plan within the next 60 days?”
Once you’ve defined your predictive question, it will serve as a basis for your blueprint (see below). it is reflected as a set of Pecan will translate your question into a set of SQL queries, which will be used to transform your historical dataset into an AI-ready dataset your model can train itself on.
What is a blueprint?
In Pecan, a blueprint is where you tell your model how to interpret your data – for both training purposes and for making predictions.
In doing so, it frames your predictive question in a way the model can understand. For example, it might convey the following story: “Here's what a churned customer looks like in the historical data. Based on that, how likely is it that other customers will churn within a certain time period?”
Each blueprint consists of a set of SQL queries that query the data you’ve imported to Pecan. These queries allow Pecan to generate AI-ready tables that can be interpreted by Pecan’s AutoML (automated machine learning). The model will use these tables to learn from the data, train itself to make accurate predictions, and make predictions for future datasets.
Blueprints may be built from scratch or based on an existing Pecan template. You can also use variables to create queries more easily and adjust the parameters of your model.
To learn more about creating blueprints in Pecan, see Introduction to the Blueprint Editor. Or, for a deeper dive into the specific queries, see Creating your ETA queries with SQL.
What is an entity?
When you’re training or using a model in Pecan, an entity is the subject of each prediction. It is a person at a particular point in time that you are either learning from when training a model, or making a prediction for once the model is deployed.
As such, an entity is comprised of two elements:
A customer identifier (e.g. customer_ID, user_ID) or activity identifier (e.g. transaction_ID, event_ID)
A marker date.
In the below table (a simple example of an Entity Table), each row represents a single entity:
customer_id | marker date |
67380 | 2022-02-30 |
67380 | 2022-03-30 |
67380 | 2022-04-30 |
215013 | 2022-02-30 |
215013 | 2022-03-30 |
215013 | 2022-04-30 |
As you can see, a model can make a prediction for a customer at multiple points in time (with multiple marker dates). In such cases, a customer can and will exist in multiple rows/entities. This is common in Pecan models designed to predict churn, retention, upsells and engagement.
When training a model, these two columns will also be added to your Target Table so the ground-truth labels (a.k.a. actual outcomes) in the “Labels” column can be matched up with each entity.
What is a marker date?
When you’re creating a blueprint for your model, Pecan will add a “marker date” column to your Entity and Target tables. A marker date indicates the moment of prediction; it is the precise date or timestamp at which you'll make a prediction for each customer.
When a marker date is placed alongside a customer identifier, it creates an instance of each customer at a particular point in time, for whom you can make a prediction (a.k.a. an entity). Here's a simplified example of how your training dataset might look:
user_id | marker date | campaign | state | age | made purchase within 30 days of installation |
13455 | 2022-12-01 | Garden | TX | 45-54 | 1 |
13456 | 2022-12-01 | Shoes | OK | 18-24 | 1 |
13457 | 2022-12-02 | Industrial | PA | 35-44 | 0 |
13458 | 2022-02-03 | Electronics | CA | 45-54 | 1 |
13459 | 2022-04-03 | Toys | UT | 35-44 | 0 |
13460 | 2022-05-04 | Sports | CO | 18-24 | 0 |
How do we arrive at each marker date? It depends on your your predictive question.
Imagine you want to predict the likelihood of new users to make a purchase, and you want to make this prediction 14 days after they install your app. In this scenario, your marker date would be “installation + 14 days”. In the above table, each marker would be calculated by adding 14 days to each customer’s onboarding date.
Since a marker defines the time of prediction, it’s also the point at which you sample a customer’s activity – it’s the final date from which you may use historical data for model-training purposes. So, if the marker is defined as “onboarding + 14 days”, your model will only train itself based on data that occurred up until the marker date (for each individual entity).
In some cases, you’ll want to make recurring predictions. Say you want to predict a user’s likelihood to churn on a monthly basis, regardless of their installation date. In your blueprint for this model, you would define the marker frequency as “monthly”.
What are the differences between “fastest” and “production grade” Training modes?
When sending a model to train, you can choose between two training modes:
Fastest - this setting is great for testing how good of a model Pecan can create from your data and queries or getting some quick data insights, and when your data is not too big.
Production quality - when you want to create a model you plan to use to create predictions, have lots of data, or if you want granular control over your model.
We recommend starting with the “fastest” training mode, and if you see the results make sense - you can duplicate your model as a blueprint and train it again in production mode.
The main differences between the two training modes are as follows:
| Fastest | Production grade |
Training time | About 12 minutes. | A couple of hours. |
Model accuracy | Good.
Speed is increased by engineering less feature and by training one model. | Excellent. |
Create predictions | No. | Yes. |
Data size | Up to 1 billion cells. | Unlimited. |
Predetermined. | Can be adjusted. |