What is a predictive question?

What is an entity?

What is a marker?

What is a predictive question?

When creating a model in Pecan, you’ll need to define your “predictive question”. This is a statement that communicates the prediction you’re trying to make. Naturally, it will vary depending on the business problem you’re trying to solve.

One generalized example of a predictive question is: “Given a particular customer at a particular point in time, what is the likelihood that they'll perform a certain activity in the future?”

Not being specific enough in your question will create unnecessary noise in the model and result in poor predictive performance. So this is where a data analyst can add immense value, using their knowledge and expertise to formulate a question that helps generate accurate predictions.

Your predictive question will have four key components:

  • Who exhibits the behavior you want to predict? (These are your entities)

  • When do you want to make a prediction?

    • Predictions can be made on a recurring or a one-time basis.

  • What activity or behavior do you want to predict?

  • How far into the future do you want to make a prediction for?

Here are a few examples of a good predictive question:

  • Examples for a binary model:

    • 7 days after registering, how likely is a Tier-1 customer to churn within the next 60 days?”

    • “Predicting on a weekly basis, what's the likelihood of a customer who made a purchase within the last 30 days to upgrade their plan within the next 60 days?”

  • Examples for a regression model:

    • “14 days after app installation, how much revenue is expected to be generated by a new user within the next 365 days?

    • 14 days after subscribing, which customers are predicted to be the highest-value customers over the next 90 days?

As you can imagine, each predictive question will be impacted differently by whatever historical data is fed into the model.

Once you’ve defined your predictive question, it will serve as the basis for your model in the Blueprint Editor. Pecan will translate your question into a set of SQL queries, which will be used to transform your historical dataset into an AI-ready dataset your model can train itself on.

What is an entity?

When you’re creating and using models in Pecan, an entity is the basis of each prediction.

Quite simply, an entity is a person, at a particular point in time, that you are either learning from (when training a model) or making predictions for (when testing a model or making new predictions).

As such, an entity is made up of two elements: the customer identifier (e.g. customer_ID) and the marker date. In the below table, which is a small example of an Entity Table, each row represents an entity.

customer_id

marker

67380

2022-02-30

67380

2022-03-30

67380

2022-04-30

215013

2022-02-30

215013

2022-03-30

215013

2022-04-30

In the your model’s blueprint, these columns will be joined and defined as a single property called “entity ID”. Your entity_id column will then serve as the join column between the tables that comprise your AI-ready dataset (and are fed into your model).

As you can see above, if a series of predictions will be made for each customer, each customer will have multiple entity IDs.

What is a marker?

When training and using predictive models in Pecan, you’ll be adding a “marker” or “marker date” column to your datasets.

This value marks the moment of prediction – it is the precise date (or timestamp) at which you will make a prediction for a certain customer. When placed alongside a customer identifier (e.g. customer_ID), it creates an instance of that person at a particular point into time.

The marker dates you add to your datasets will depend on your business needs and predictive question.

Let’s say you want to make an upsell prediction for customers on the 14th day after onboarding. In such a case, the marker date will be defined as “onboarding + 14 days” for any given customer.

Or, let’s say you want to make a monthly churn prediction for active customers. In such a case, you can assign a marker frequency – which might be the second Friday of each month.

Here is an example of how a marker column would look within an training dataset (albeit a simple one).

customer_id

marker

category

state

transaction

made a purchase within 30 days of marker

67380

2022-02-30

Garden

TX

$11.54

1

67380

2022-03-30

Shoes

TX

$68.00

1

67380

2022-04-30

Industrial

TX

$55.13

0

67380

2022-05-30

Garden

TX

$2.99

0

215013

2022-02-30

Electronics

CA

$25.79

1

215013

2022-03-30

Electronics

CA

$143.50

0

215013

2022-04-30

Toys

CA

$17.00

0

215013

2022-05-30

Sports

CA

$91.57

0

Since a marker defines the time of prediction, it’s also the point at which you sample a customer’s activity. It marks the final date at which you may use customer data for model-training purposes.

Let’s say you’re creating a model for churn prediction, and you define the marker for each entity as “installation + 30 days”. In this case, the model will only train itself on customer activity that occurred up until the marker date (and as far back as defined by the model’s parameters.)

(If data from beyond the marker date were used to train your model, this would result in data leakage.)

Did this answer your question?