AWS S3 Parquet files
Raziel Einhorn avatar
Written by Raziel Einhorn
Updated over a week ago

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides data compression and encoding schemes to handle complex data in bulk.

Pecan enables you to connect to Parquet files that are hosted in Amazon S3 cloud storage service. Since Parquet tables are compressed and organized by columns rather than rows, this saves storage space, speeds up analytics queries, and eliminates manual data-types review.

Below are the steps required to enable Pecan to access your Parquet files. If you need help with performing these steps or obtaining the correct details and credentials, be sure to consult with your internal IT or DevOps team.

Prerequisites

File structure

Before adding a Parquet connection, make sure your bucket is organized in a way that supports importing multiple files:

  • Your root directory may be the bucket itself or any subfolder within it – let’s say it’s called “my-bucket”. Within it, you may create a root folder – for demonstrative purposes, let’s call it “data-for-pecan”.

  • Make sure each table is represented by its own folder; Pecan deduces the table name from the folder name.

  • Make sure your folder names, column names, or filenames don’t include spaces or special characters in them.

All the files under a folder will be merged into one table in Pecan. There are no prerequisites on the format of the file names beyond the restrictions above.

Here’s an example of a valid structure your folder structure might look:

my-bucket
└── data-for-pecan
└── orders
└── orders1.parquet
└── orders2.parquet
└── orders_old.parquet
└── customers
└── us_customers.parquet
└── eu_customers.parquet
└── jp_customers.parquet
└── payments
└── 2023-01-01.parquet
└── payments_old_export.parquet

If you’re using partitions in your bucket, make sure they are organized in sub-folders as described in this article.

IAM

IAM is a more secure alternative to using keys for authentication.

To use it, you’ll need to create an IAM user with Read and Write permissions in your AWS account, so Pecan can see where your bucket is sitting, read files from it, and write to it. Here’s how to do it:

  1. Log in to AWS Identity and Access Management (IAM) and create an IAM user, which will generate a new IAM access key and secret key that you will provide to Pecan in Step 3. To learn more, see Creating an IAM user in your AWS account.

    Important: this is your only opportunity to view or download your secret access key, so make sure to save or download it to a safe and secure place.

  2. Attach the relevant IAM policy to the user so Pecan has “programmatic access” to make API calls to your AWS bucket. To do so, copy and paste the below JSON text to your policy console in AWS.

    Note that you will need to change the “BUCKET_NAME” placeholder in Lines 11, 22, and 31 to your actual bucket name. To learn more, read about Policies and permissions in IAM.

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "PecanKeyReadPermissions",
    "Effect": "Allow",
    "Action": [
    "s3:Get*",
    "s3:List*"
    ],
    "Resource": "arn:aws:s3:::<BUCKET_NAME>/<READ_ONLY_FOLDER>/*"
    },
    {
    "Sid": "PecanKeyWritePermissions",
    "Effect": "Allow",
    "Action": [
    "s3:Get*",
    "s3:List*",
    "s3:Put*",
    "s3:Delete*"
    ],
    "Resource": "arn:aws:s3:::<BUCKET_NAME>/<READ_WRITE_FOLDER>/*"
    },
    {
    "Sid": "PecanBucketPermissions",
    "Effect": "Allow",
    "Action": [
    "s3:GetBucketLocation",
    "s3:ListBucket"
    ],
    "Resource": "arn:aws:s3:::<BUCKET_NAME>"
    }
    ]
    }

  3. Finally, you’ll need to define Pecan as a trusted entity. To do it, make sure to add this snippet as a trusted entity under the IAM role:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
    "AWS": "arn:aws:iam::685686065164:root"
    },
    "Action": "sts:AssumeRole"
    }
    ]
    }

How to configure an S3 Parquet file connection

  1. Log in to Pecan, select the “Connections” tab, and click Add connection.

  2. Select “Parquet file” and complete the following fields:

    • Connection name – this is how you’ll identify the connection when creating and working with models on the platform. Names should be unique and reflect the data source and what’s stored in it. Valid characters include letters, numbers, and underscores. Connection names can’t be changed once created. Example: “parquet_paid_downloads_fall_2021”

    • Connection type – Whether the connection is to be used to bring data into Pecan or to send predictions out of Pecan.

    • AWS IAM access key – this is the IAM access key for your S3 bucket, which will have been generated in Step 2A above. (Example: “AKIAIOSFODNN7EXAMPLE”).
      If you use IAM roles to authenticate with your AWS, this field should be left empty.

    • AWS IAM secret key – this is the IAM secret key for your S3 bucket, which should have been saved or downloaded at the time your IAM access key was created. (Example: “wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY”). If you can’t find it or don’t have access to your password, you’ll need to ask your administrator to reset it. To learn how, see Managing Passwords for IAM users.
      If you use IAM roles to authenticate with your AWS, this field should be left empty.

    • AWS IAM role - the role you created for Pecan.
      If you use access and secret key to authenticate with your AWS then you can leave this field empty.

    • S3 root directory - this is the root directory path of your files in S3. It should contain all of your table directories. For more details, see Step 1 above (example: “https://my-bucket.s3.us-west-2.amazonaws.com”)

    • Partition format - the format of the partitioned tables (in case it is different than the standard described in Apache Spark’s article on partition discovery.

  3. Now, click Test connection to make sure everything is working correctly. Then click Create connection to complete the setup. (For more information, see Testing and creating a data connection.)

Got an error? Here are some common issues

  • When using Pandas or specific ETL to save data to Parquet files, dates and timestamps might be saved in a way that is not compatible with Apache Spark (which Pecan uses in its infrastructure).
    To overcome this, make sure your saved format precision is no more than a millisecond level.
    You can do it by either making sure the format only includes milliseconds and above. If you use Pandas, this snippet might help:

    for col in pandas_df.columns:
    if (str(pandas_df[col].dtype) == 'datetime64[ns]'):
    pandas_df[col] = pandas_df[col].astype('datetime64[ms]')

Did this answer your question?