Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides data compression and encoding schemes to handle complex data in bulk.

Pecan enables you to connect to Parquet files that are hosted in Amazon S3 cloud storage service. Since Parquet tables are compressed and organized by column rather than row, this saves storage space, speeds up analytics queries, and eliminate manual data-types review.

Below are the steps required to enable Pecan to access your Parquet files. If you ned help with performing these steps or obtaining the correct details and credentials, be sure to consult with your internal IT or DevOps team.

Prerequisite steps

  1. Before adding a Parquet connection, split your tables into smaller files of up to 50MB each. Each file should be exported to a folder of the same name in your S3 root directory.

    Here’s an example of how your folder structure might look…

    Your root directory may be the bucket itself or any subfolder within it – let’s say it’s called “my-bucket”. Within it, you would create a root folder – for demonstrative purposes, let’s call it “data-for-pecan”.

    This would produce the following root directory of s3://my-bucket/data-for-pecan/.

    Then, each table should be givits own folder. Let’s say, for example, you have four tables: “orders”, “payments”, “customers” and “products”.

    Creating those subfolders would lead to the following four S3 directories:
    - s3://my-bucket/data-for-pecan/orders
    - s3://my-bucket/data-for-pecan/payments
    - s3://my-bucket/data-for-pecan/customers
    - s3://my-bucket/data-for-pecan/products

    In the end, each folder should contain the relevant Parquet files (or sub-folders containing the relevant Parquet files).

    Note: Pecan does not support folder names, column names or filenames that have spaces or special characters in them.

  2. Next, you’ll need to create an IAM user with Read and Write permissions in your AWS account so Pecan can see where your bucket is sitting, read files from it, and write to it. (For security reasons, creating an IAM user is preferable to expanding permissions for the AWS account root user.) Here’s how to do it:

    1. Log in to AWS Identity and Access Management (IAM) and create an IAM user, which will generate a new IAM access key and secret key that you will provide to Pecan in Step 3. To learn more, see Creating an IAM user in your AWS account.

      Important: this is your only opportunity to view or download your secret access key, so make sure to save or download it to a safe and secure place.

    2. Attach the relevant IAM policy to the user so Pecan has “programmatic access” to make API calls to your AWS bucket. To do so, copy and paste the below JSON text to your policy console in AWS.

      Note that you will need to change the “BUCKET_NAME” placeholder in Lines 11, 22 and 31 to your actual bucket name. To learn more, read about Policies and permissions in IAM.

      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "PecanKeyReadPermissions",
                  "Effect": "Allow",
                  "Action": [
                      "s3:Get*",
                      "s3:List*"
                  ],
                  "Resource": "arn:aws:s3:::<BUCKET_NAME>/<READ_ONLY_FOLDER>/*"
              },
              {
                  "Sid": "PecanKeyWritePermissions",
                  "Effect": "Allow",
                  "Action": [
                      "s3:Get*",
                      "s3:List*",
                      "s3:Put*",
                      "s3:Delete*"
                  ],
                  "Resource": "arn:aws:s3:::<BUCKET_NAME>/<READ_WRITE_FOLDER>/*"
              },
              {
                  "Sid": "PecanBucketPermissions",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetBucketLocation",
                      "s3:ListBucket"
                  ],
                  "Resource": "arn:aws:s3:::<BUCKET_NAME>"
              }
          ]
      }

How to configure an S3 Parquet file connection

  1. Log in to Pecan, select the “Connections” tab, and click Add connection.

  2. Select “Parquet file” and complete the following fields:

    • Connection name – this is how you’ll identify the connection when creating and working with models on the platform. Names should be unique and reflect the data source and what’s stored in it. Valid characters include letters, numbers and underscores. Connection names can’t be changed once created. Example: “parquet_paid_downloads_fall_2021”

    • Connection type – Pecan supports both read and write connections to Amazon S3. Select "Read" if the connection will be used to import Parquet files to Pecan. Select "Write" if it will be used to export Pecan predictions into Parquet files.

    • AWS IAM access key – this is the IAM access key for your S3 bucket, which will have been generated in Step 2A above. (Example: “AKIAIOSFODNN7EXAMPLE”)

    • AWS IAM secret key – this is the IAM secret key for your S3 bucket, which should have been saved or downloaded at the time your IAM access key was created. (Example: “wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY”). If you can’t find it or don’t have access to your password, you’ll need to ask your administrator to reset it. To learn how, see Managing passwords for IAM users.

    • S3 root directory - this is the root directory path of your files in S3. It should contain all of your table directories. For more details, see Step 1 above (example: “https://my-bucket.s3.us-west-2.amazonaws.com”)

  3. Now, click Test connection to make sure everything is working correctly. Then click Create connection to complete the setup. (For more information, see Testing and creating a data connection.)

Did this answer your question?