All Collections
Connecting Your Data
Data partitioning in Pecan
Data partitioning in Pecan

Efficiently manage large datasets in Pecan with strategic data partitioning, enhancing query performance and prediction speed.

Ori Sagi avatar
Written by Ori Sagi
Updated over a week ago

What is data partitioning?

Data partitioning is a strategic approach used in database management, which involves segmenting a large dataset into smaller, more manageable subsets for improved efficiency and accessibility.

Think of it as an organized library; instead of having all the books randomly scattered, they’re sorted into categories and sections, making it easier to locate a particular book you need to use in your research.

Similarly, data partitioning helps categorize and store large volumes of data, enhancing query performance, reducing system load, and facilitating more effective data management. Partitioning the data you import to Pecan can help your queries run faster, and your models generate predictions quicker.

Why use partitions?

  • Using partitions may reduce data sync time

  • Model training time and prediction generation time may be reduced

To maximize these benefits, it is important to select the right partitioning strategy.

Pecan’s partition strategy

  • Pecan determines a partitioning strategy for each table separately.

  • When importing Parquet files, Pecan decides how to divide the data based on the way Parquet files are partitioned on the customer side - a process called partition discovery.

  • Pecan allows you to partition your data stored on Pecan's side regardless of your organization’s partitioning strategy.

Partitioning your data independently

You can partition your data in Pecan regardless of whether your data is partitioned in your organization and what is your partitioning strategy.

Pecan allows you to partition your data horizontally, meaning - divide your table into subsets based on the values of a specific column.

Pecan supports independent partitioning by columns of type Timestamp, String, and Date.

We’ll use the following table as an example. The table holds three columns, each of a type Pecan supports for partitioning.

Note: a real table might hold additional columns with different types. As long as you have a column of one of these data types, you will be able to use them for partitioning.

transaction_time [Timestamp]

username [String]

install_date [Date]

2022-10-06 16:14:26

cdaintrey0

2022-28-05

2022-10-06 5:19:43

gpickett1

2022-28-05

2022-07-08 4:42:33

gpickett1

2022-28-05

2022-07-08 11:40:26

nprobbin3

2022-17-06

2022-07-08 8:29:48

sguice4

2022-17-06

2022-07-08 3:07:18

mgotthard7

2022-19-07

2022-22-07 11:20:13

mgotthard7

2022-19-07

2022-24-10 04:35:51

mgotthard7

2022-19-07

2022-24-10 02:02:54

vsundin8

2022-22-08

2022-22-10 02:32:49

ltreagust9

2022-22-08

Partitioning by timestamp data type

Columns of type timestamp can be used to split your data into partitions (subset), where each partition will hold one day of data.
For example, if you select transaction_time as your partitioning column, the data will be organized as follows:

partition

transaction_time [Timestamp]

username [String]

install_date [Date]

2022-17-06

2022-07-08 3:07:18

mgotthard7

2022-19-07

2022-07-08 4:42:33

gpickett1

2022-28-05

2022-19-07

2022-07-08 8:29:48

sguice4

2022-17-06

2022-07-08 11:40:26

nprobbin3

2022-17-06

2022-10-06 5:19:43

gpickett1

2022-28-05

2022-22-08

2022-10-06 16:14:26

cdaintrey0

2022-28-05

2022-22-07 11:20:13

mgotthard7

2022-19-07

2022-28-05

2022-22-10 02:32:49

ltreagust9

2022-22-08

2022-24-10 02:02:54

vsundin8

2022-22-08

2022-24-10 04:35:51

mgotthard7

2022-19-07

Partitioning by date data type

Columns of type date can be used to split your data into partitions (subset), where each partition will hold one day of data.
For example, if you select install_date as your partitioning column, the data will be organized as follows:

partition

transaction_time [Timestamp]

username [String]

install_date [Date]

2022-17-06

2022-07-08 8:29:48

sguice4

2022-17-06

2022-07-08 11:40:26

nprobbin3

2022-17-06

2022-19-07

2022-07-08 3:07:18

mgotthard7

2022-19-07

2022-22-07 11:20:13

mgotthard7

2022-19-07

2022-24-10 04:35:51

mgotthard7

2022-19-07

2022-22-08

2022-22-10 02:32:49

ltreagust9

2022-22-08

2022-24-10 02:02:54

vsundin8

2022-22-08

2022-28-05

2022-07-08 4:42:33

gpickett1

2022-28-05

2022-10-06 5:19:43

gpickett1

2022-28-05

2022-10-06 16:14:26

cdaintrey0

2022-28-05

Partitioning by string data type

Columns of type string can be used to split your data into partitions (subset). In this case, every unique value in the column would create its own partition.
For example, if you select username as your partitioning column, the data will be organized as follows:

partition

transaction_time [Timestamp]

username [String]

install_date [Date]

cdaintrey0

2022-10-06 16:14:26

cdaintrey0

2022-28-05

gpickett1

2022-07-08 4:42:33

gpickett1

2022-28-05

2022-10-06 5:19:43

gpickett1

2022-28-05

ltreagust9

2022-22-10 02:32:49

ltreagust9

2022-22-08

mgotthard7

2022-07-08 3:07:18

mgotthard7

2022-19-07

2022-22-07 11:20:13

mgotthard7

2022-19-07

2022-24-10 04:35:51

mgotthard7

2022-19-07

nprobbin3

2022-07-08 11:40:26

nprobbin3

2022-17-06

sguice4

2022-07-08 8:29:48

sguice4

2022-17-06

vsundin8

2022-24-10 02:02:54

vsundin8

2022-22-08

Partitioning by 2 data types

In cases like an especially large table, you might want to use more than one level of partitioning.
Pecan allows partitioning by up to two data types.
Note the order of the columns is important. To make sure you maximize the benefits of this feature, Pecan recommends the first column to be the one of the two that will result in a smaller amount of partitions.

Following the above examples, between partition_time (4 partitions) and username (7 partitions), it would be best to select partition_time as the first partition column.

How to define an independent partitioning strategy

You can decide to partition your data once you import it for the first time or apply partitioning to data you've already imported - the process is the same.

First, go to the connections tab, and select the connection with the table you'd like to partition.

For each table you’d like to partition:

  1. Click on the three dots next to the table

  2. Select “settings”

  3. Select the “Advanced” tab

  4. Under “select partitioning columns”, select the columns by which you’d like to partition.

Remember: when using two columns, the order has an impact.

How to define a custom partition strategy

If you’d like to customize the way your data is partitioned, open a support ticket with details on how you wish your data to be organized, and the Pecan team will handle it for you.

Did this answer your question?