What is data partitioning?

Data partitioning is a strategic approach used in database management, which involves segmenting a large dataset into smaller, more manageable subsets for improved efficiency and accessibility.

Think of it as an organized library; instead of having all the books randomly scattered, they’re sorted into categories and sections, making it easier to locate a particular book you need to use in your research.

Similarly, data partitioning helps categorize and store large volumes of data, enhancing query performance, reducing system load, and facilitating more effective data management. Partitioning the data you import to Pecan can help your queries run faster, and your models generate predictions quicker.

As a rule of thumb, consider using data partitioning in Pecan when you have a table that contains events (transactions) with more than 20M rows.

Why use partitions?

Using partitions may reduce data sync time
Model training time and prediction generation time may be reduced

To maximize these benefits, it is important to select the right partitioning strategy.

Pecan’s partition strategy

Pecan determines a partitioning strategy for each table separately.
When importing Parquet files, Pecan decides how to divide the data based on the way Parquet files are partitioned on the customer side - a process called partition discovery.
Pecan allows you to partition your data stored on Pecan's side regardless of your organization’s partitioning strategy.

Partitioning your data independently

You can partition your data in Pecan regardless of whether your data is partitioned in your organization and what is your partitioning strategy.

Pecan allows you to partition your data horizontally, meaning - divide your table into subsets based on the values of a specific column.

Pecan supports independent partitioning by columns of type Timestamp, String, and Date.

We’ll use the following table as an example. The table holds three columns, each of a type Pecan supports for partitioning.

Note: a real table might hold additional columns with different types. As long as you have a column of one of these data types, you will be able to use them for partitioning.

transaction_time [Timestamp]	username [String]	install_date [Date]
2022-10-06 16:14:26	cdaintrey0	2022-28-05
2022-10-06 5:19:43	gpickett1	2022-28-05
2022-07-08 4:42:33	gpickett1	2022-28-05
2022-07-08 11:40:26	nprobbin3	2022-17-06
2022-07-08 8:29:48	sguice4	2022-17-06
2022-07-08 3:07:18	mgotthard7	2022-19-07
2022-22-07 11:20:13	mgotthard7	2022-19-07
2022-24-10 04:35:51	mgotthard7	2022-19-07
2022-24-10 02:02:54	vsundin8	2022-22-08
2022-22-10 02:32:49	ltreagust9	2022-22-08

Partitioning by timestamp data type

Columns of type timestamp can be used to split your data into partitions (subset), where each partition will hold one day of data.
For example, if you select transaction_time as your partitioning column, the data will be organized as follows:

partition	transaction_time [Timestamp]	username [String]	install_date [Date]
2022-17-06	2022-07-08 3:07:18	mgotthard7	2022-19-07
2022-07-08 4:42:33	gpickett1	2022-28-05
2022-19-07	2022-07-08 8:29:48	sguice4	2022-17-06
2022-07-08 11:40:26	nprobbin3	2022-17-06
2022-10-06 5:19:43	gpickett1	2022-28-05
2022-22-08	2022-10-06 16:14:26	cdaintrey0	2022-28-05
2022-22-07 11:20:13	mgotthard7	2022-19-07
2022-28-05	2022-22-10 02:32:49	ltreagust9	2022-22-08
2022-24-10 02:02:54	vsundin8	2022-22-08
2022-24-10 04:35:51	mgotthard7	2022-19-07

Partitioning by date data type

Columns of type date can be used to split your data into partitions (subset), where each partition will hold one day of data.
For example, if you select install_date as your partitioning column, the data will be organized as follows:

partition	transaction_time [Timestamp]	username [String]	install_date [Date]
2022-17-06	2022-07-08 8:29:48	sguice4	2022-17-06
2022-07-08 11:40:26	nprobbin3	2022-17-06
2022-19-07	2022-07-08 3:07:18	mgotthard7	2022-19-07
2022-22-07 11:20:13	mgotthard7	2022-19-07
2022-24-10 04:35:51	mgotthard7	2022-19-07
2022-22-08	2022-22-10 02:32:49	ltreagust9	2022-22-08
2022-24-10 02:02:54	vsundin8	2022-22-08
2022-28-05	2022-07-08 4:42:33	gpickett1	2022-28-05
2022-10-06 5:19:43	gpickett1	2022-28-05
2022-10-06 16:14:26	cdaintrey0	2022-28-05

Partitioning by string data type

Columns of type string can be used to split your data into partitions (subset). In this case, every unique value in the column would create its own partition.
For example, if you select username as your partitioning column, the data will be organized as follows:

partition	transaction_time [Timestamp]	username [String]	install_date [Date]
cdaintrey0	2022-10-06 16:14:26	cdaintrey0	2022-28-05
gpickett1	2022-07-08 4:42:33	gpickett1	2022-28-05
2022-10-06 5:19:43	gpickett1	2022-28-05
ltreagust9	2022-22-10 02:32:49	ltreagust9	2022-22-08
mgotthard7	2022-07-08 3:07:18	mgotthard7	2022-19-07
2022-22-07 11:20:13	mgotthard7	2022-19-07
2022-24-10 04:35:51	mgotthard7	2022-19-07
nprobbin3	2022-07-08 11:40:26	nprobbin3	2022-17-06
sguice4	2022-07-08 8:29:48	sguice4	2022-17-06
vsundin8	2022-24-10 02:02:54	vsundin8	2022-22-08

Partitioning by 2 data types

In cases like an especially large table, you might want to use more than one level of partitioning.
Pecan allows partitioning by up to two data types.
Note the order of the columns is important. To make sure you maximize the benefits of this feature, Pecan recommends the first column to be the one of the two that will result in a smaller amount of partitions.

Following the above examples, between partition_time (4 partitions) and username (7 partitions), it would be best to select partition_time as the first partition column.

How to define an independent partitioning strategy

You can decide to partition your data once you import it for the first time or apply partitioning to data you've already imported - the process is the same.

First, go to the connections tab, and select the connection with the table you'd like to partition.

For each table you’d like to partition:

Click on the three dots next to the table
Select “settings”
Select the “Advanced” tab
Under “select partitioning columns”, select the columns by which you’d like to partition.

Remember: when using two columns, the order has an impact.

How to define a custom partition strategy

If you’d like to customize the way your data is partitioned, open a support ticket with details on how you wish your data to be organized, and the Pecan team will handle it for you.

How to add attributes to your model

Configuring Train/Test Data Splits in Pecan

Data partitioning in Pecan