Importing an entire table to Pecan (a.k.a. “Full table import”) requires more time and computing resources than importing just a portion of the same table (a.k.a. “Import new rows”). This can make for a dramatic difference when working with extremely large tables, especially on a recurring basis.
Importing an entire table can be avoided through the use of a reference column. A reference column, which typically contains date or timestamp values, makes it possible to recognize new vs. pre-existing rows. This way, whenever you re-import data for a particular table, Pecan can instantly recognize which rows need to be imported – and which can be skipped.
When do you not need a reference column?
You should not add a reference column to tables that are based on dimensional data – that is, where there is a single row per customer, and thus a one-to-one correspondence between the customer and each field.
Such is the case in the sample table below:
Customer ID | Mobile plan | Pricing | Member since |
10121 | Basic Phone + Internet | $25 | 12/12/2019 |
10235 | Unlimited Phone + Internet + Roaming | $39 | 6/9/2017 |
10346 | Data only | $17 | 4/25/2022 |
Therefore, any time you want to generate a new prediction, you will want to take a new snapshot of the entire table.
When do you need a reference column?
A reference column is useful when working with transactional tables. These tables contain an ever-growing record of transactions, meaning you will have multiple rows of data for certain customers, and thus a many-to-one relationship between each customer and other fields.
Such is the case in below sample table, where a separate row exists for each customer transaction.
Customer ID | Transaction date | Purchase | Price |
10121 | 5/1/2022 12:05 | 2-in-1 laptop | 699.98 |
10121 | 5/1/2022 12:05 | Cable accessories | 47.55 |
10121 | 5/1/2022 12:05 | Phone charger | 12.95 |
10235 | 6/6/2022 19:21 | Home stereo system | 121.49 |
10235 | 6/6/2022 19:21 | Warranty plan | 49.99 |
10235 | 6/7/2022 10:33 | Cable accessories | 27.24 |
Since you’ll want to make predictions for an ever-growing data set, this table will need to be re-imported on a regular basis. “Transaction date” would serve as the reference column – enabling Pecan to recognize older records and thus skip re-importing them each time.
Note: you do not need a reference column when importing data from Parquet, Delta or CSV files, since they are automatically able to detect an identifier column. For example, for Parquet files, the first identified date type partition (“date=”) is used to calculate the time frame. (Keep in mind that the date type partition must be defined at the source. There are no limitations on using other partitions in tandem with the date type partition. You can read more about partition data types on Spark's documentation.)
Why use a reference column?
When new transactional data needs to be fed into a model, this often amounts to many thousands or millions of rows – many of which have already been imported. By including a reference column, you will enable Pecan to recognize pre-existing rows and skip importing them. This achieves two things:
Dramatically speeds up the importing process
Helps you avoid hitting query limits from your data-service provider (and potentially incurring additional data expenses)