All Collections
Creating a Model
Creating a model | FAQ
Creating a model | FAQ
Ori Sagi avatar
Written by Ori Sagi
Updated over a week ago

What is a predictive question?

When creating a model in Pecan, you’ll define your “predictive question”. It communicates the prediction you’re trying to make and will vary depending on the business problem you’re trying to solve.

An example framework for a predictive question is: “Given a particular customer at a particular point in time, what is the likelihood that they'll perform a certain activity in the future?”

You’ll need to plug specific values into that question; not being specific enough will create unnecessary noise in the model and result in poor predictive performance. (This is where a data analyst can add immense value, using their knowledge and expertise to formulate a question that helps generate accurate predictions.)

Your predictive question will have four key components:

  • Who exhibits the behavior you want to predict? (These are your entities)

  • When do you want to make a prediction?

    Predictions can be made on a recurring or a one-time basis.

  • What activity or behavior do you want to predict?

  • How far into the future do you want to predict?

Here are a few examples of a good predictive question:

  • 7 days after subscribing, which Tier-1 customers are likely to churn within the next 60 days?”

  • “14 days after app installation, how much revenue is expected to be generated by each new user within the next 365 days?

  • “Predicting on a weekly basis, what's the likelihood of a customer who made a purchase within the last 30 days to upgrade their plan within the next 60 days?”

Once you’ve defined your predictive question, it will serve as a basis for your predictive flow (see below). It is reflected as a set of Pecan that will translate your question into a set of SQL queries, which will be used to transform your historical dataset into an AI-ready dataset your model can train itself on.

What is a predictive flow?

In Pecan, a predictive flow is the full prediction process, made out of 3 parts:

  1. Queries - where you tell your model how to interpret your data using SQL in the Nutbook. This is where the majority of your work will happen. It consists of a set of SQL queries that query your imported data to Pecan. These queries allow Pecan to generate AI-ready tables that can be interpreted by Pecan’s AutoML (automated machine learning). The model will use these tables to learn from the data, train itself to make accurate predictions, and make predictions for future datasets.

  2. Model - your trained model's dashboard, that allows you to evaluate your model's ability to make precise predictions.

  3. Predict - where you configure and monitor your prediction production cycles, to empower your decision-making.

What is an entity?

When you’re training or using a model in Pecan, an entity is the subject of each prediction. It is a person or an SKU at a particular point in time that you are either learning from when training a model or making a prediction for once the model is deployed.

As such, an entity is comprised of two elements:

  • A customer identifier (e.g. customer_ID, user_ID) or activity identifier (e.g. transaction_ID, event_ID)

  • A date

customer_id

date

67380

2022-02-30

67380

2022-03-30

67380

2022-04-30

215013

2022-02-30

215013

2022-03-30

215013

2022-04-30

As you can see, a model can make a prediction for a customer at multiple points in time (with multiple dates). In such cases, a customer can and will exist in multiple rows/entities. This is common in Pecan models designed to predict churn, retention, upsells, and engagement.

When training a model, these two columns will also be added to your Core Table so the ground-truth labels (a.k.a. actual outcomes) in the “Labels” column can be matched up with each entity.

Why do we need a date column per entity?

When you’re creating a predictive flow for your model, Pecan needs a date for each entity. This date indicates the moment of prediction; it is the precise date or timestamp at which you'll make a prediction for each entity.

When a date is placed alongside a customer identifier, it creates an instance of each entity at a particular point in time, for whom you can make a prediction. Here's a simplified example of how your training dataset might look:

user_id

date

campaign

state

age

made purchase within 30 days of installation

13455

2022-12-01

Garden

TX

45-54

1

13456

2022-12-01

Shoes

OK

18-24

1

13457

2022-12-02

Industrial

PA

35-44

0

13458

2022-02-03

Electronics

CA

45-54

1

13459

2022-04-03

Toys

UT

35-44

0

13460

2022-05-04

Sports

CO

18-24

0

How do we arrive at each marker date? It depends on your predictive question.

Imagine you want to predict the likelihood of new users making a purchase, and you want to make this prediction 14 days after they install your app. In this scenario, your marker date would be “installation + 14 days”. In the above table, each date would be calculated by adding 14 days to each customer’s onboarding date.

Since the date defines the time of prediction, it’s also the point at which you sample a customer’s activity – it’s the final date from which you may use historical data for model training purposes. So, if the date is defined as “onboarding + 14 days”, your model will only train itself based on data that occurred up until the date (for each individual entity).

In some cases, you’ll want to make recurring predictions. Say you want to predict a user’s likelihood to churn on a monthly basis, regardless of their installation date. In your predictive flow for this model, you would define the frequency as “monthly”.

What are the differences between “fastest” and “production grade” Training modes?

When sending a model to train, you can choose between two training modes:

  • Fastest - this setting is great for iterating on your queries and training set, seeing how different data elements affect accuracy, and whether you are on the right track in defining what you are trying to predict. Training will run faster but won’t produce the best possible accuracy.

  • Production quality - use this when you are ready to train a production-grade version of your model, or if you want more granular control over the training process. Training will be slower but will produce the best possible accuracy.

We recommend starting with the “fastest” training mode, and if you see the results make sense - you can duplicate your predictive flow and train it again in production mode.

The main differences between the two training modes are as follows:

Fastest

Production grade

Training time

Between 15-30 minutes.

(Up to 60 minutes for large data sets)

Several hours.

Model accuracy

Good.

Faster training time at the expense of somewhat reduced accuracy.

Excellent.

Maximum accuracy using our most advanced feature engineering and training algorithms. Takes longer.

Data size

Limitations on data size -tens of millions of rows.

No limitations.

Predetermined.

Adjustable.

Did this answer your question?