All Collections
Connecting Your Data
What is a reference column, and when do you need one?
What is a reference column, and when do you need one?

Pecan optimizes table imports using reference columns to identify new rows, saving you time and resources.

Ori Sagi avatar
Written by Ori Sagi
Updated over a week ago

Importing an entire table to Pecan (a.k.a. “Full table import”) requires more time and computing resources than importing just a portion of the same table (a.k.a. “Import new rows”). This can make for a dramatic difference when working with extremely large tables, especially on a recurring basis.

Importing an entire table can be avoided through the use of a reference column. A reference column, which typically contains date or timestamp values, makes it possible to recognize new vs. pre-existing rows. This way, whenever you re-import data for a particular table, Pecan can instantly recognize which rows need to be imported – and which can be skipped.

When do you not need a reference column?

You should not add a reference column to tables that are based on dimensional data – that is, where there is a single row per customer, and thus a one-to-one correspondence between the customer and each field.

Such is the case in the sample table below:

Customer ID

Mobile plan

Pricing

Member since

10121

Basic Phone + Internet

$25

12/12/2019

10235

Unlimited Phone + Internet + Roaming

$39

6/9/2017

10346

Data only

$17

4/25/2022

Therefore, any time you want to generate a new prediction, you will want to take a new snapshot of the entire table.

When do you need a reference column?

A reference column is useful when working with transactional tables. These tables contain an ever-growing record of transactions, meaning you will have multiple rows of data for certain customers, and thus a many-to-one relationship between each customer and other fields.

Such is the case in below sample table, where a separate row exists for each customer transaction.

Customer ID

Transaction date

Purchase

Price

10121

5/1/2022 12:05

2-in-1 laptop

699.98

10121

5/1/2022 12:05

Cable accessories

47.55

10121

5/1/2022 12:05

Phone charger

12.95

10235

6/6/2022 19:21

Home stereo system

121.49

10235

6/6/2022 19:21

Warranty plan

49.99

10235

6/7/2022 10:33

Cable accessories

27.24

Since you’ll want to make predictions for an ever-growing data set, this table will need to be re-imported on a regular basis. “Transaction date” would serve as the reference column – enabling Pecan to recognize older records and thus skip re-importing them each time.

Note: you do not need a reference column when importing data from Parquet, Delta or CSV files, since they are automatically able to detect an identifier column. For example, for Parquet files, the first identified date type partition (“date=”) is used to calculate the time frame. (Keep in mind that the date type partition must be defined at the source. There are no limitations on using other partitions in tandem with the date type partition. You can read more about partition data types on Spark's documentation.)

Why use a reference column?

When new transactional data needs to be fed into a model, this often amounts to many thousands or millions of rows – many of which have already been imported. By including a reference column, you will enable Pecan to recognize pre-existing rows and skip importing them. This achieves two things:

  • Dramatically speeds up the importing process

  • Helps you avoid hitting query limits from your data-service provider (and potentially incurring additional data expenses)

Did this answer your question?