Importing an entire table to Pecan (a.k.a. “Full table import”) requires more time and computing resources than importing just a portion of the same table (a.k.a. “Import new rows”). This can make for a dramatic difference when working with extremely large tables, especially on a recurring basis.

Importing an entire table can be avoided through the use of a reference column. A reference column, which typically contains date or timestamp values, makes it possible to recognize new vs. pre-existing rows. This way, whenever you re-import data for a particular table, Pecan can instantly recognize which rows need to be imported – and which can be skipped.

When do you not need a reference column?

You should not add a reference column to tables that are based on dimensional data – that is, where there is a single row per customer, and thus a one-to-one correspondence between the customer and each field.

Such is the case in the sample table below:

Customer ID	Mobile plan	Pricing	Member since
10121	Basic Phone + Internet	$25	12/12/2019
10235	Unlimited Phone + Internet + Roaming	$39	6/9/2017
10346	Data only	$17	4/25/2022

Therefore, any time you want to generate a new prediction, you will want to take a new snapshot of the entire table.

When do you need a reference column?

A reference column is useful when working with transactional tables. These tables contain an ever-growing record of transactions, meaning you will have multiple rows of data for certain customers, and thus a many-to-one relationship between each customer and other fields.

Such is the case in below sample table, where a separate row exists for each customer transaction.

Customer ID	Transaction date	Purchase	Price
10121	5/1/2022 12:05	2-in-1 laptop	699.98
10121	5/1/2022 12:05	Cable accessories	47.55
10121	5/1/2022 12:05	Phone charger	12.95
10235	6/6/2022 19:21	Home stereo system	121.49
10235	6/6/2022 19:21	Warranty plan	49.99
10235	6/7/2022 10:33	Cable accessories	27.24

Since you’ll want to make predictions for an ever-growing data set, this table will need to be re-imported on a regular basis. “Transaction date” would serve as the reference column – enabling Pecan to recognize older records and thus skip re-importing them each time.

Note: you do not need a reference column when importing data from Parquet, Delta or CSV files, since they are automatically able to detect an identifier column. For example, for Parquet files, the first identified date type partition (“date=”) is used to calculate the time frame. (Keep in mind that the date type partition must be defined at the source. There are no limitations on using other partitions in tandem with the date type partition. You can read more about partition data types on Spark's documentation.)

Why use a reference column?

When new transactional data needs to be fed into a model, this often amounts to many thousands or millions of rows – many of which have already been imported. By including a reference column, you will enable Pecan to recognize pre-existing rows and skip importing them. This achieves two things:

Dramatically speeds up the importing process
Helps you avoid hitting query limits from your data-service provider (and potentially incurring additional data expenses)

How to import data from your connection

Data partitioning in Pecan

Data importing configurations and methods

Getting started: What table you need for predictive success

Enhanced Table Previews

What is a reference column, and when do you need one?

When do you not need a reference column?

When do you need a reference column?

Why use a reference column?