Delta Lake tables provide built-in safeguards to prevent data corruption, support schema evolution for changing data structures, and track data changes over time. When stored in AWS S3, they are compressed and organized by columns, which optimizes storage space, speeds up analytics queries, and eliminates the need for manual data type verification.
Pecan enables you to connect to Delta Lake tables that are hosted in Amazon S3 cloud storage service. Since Delta Lake tables are compressed and organized by columns rather than rows, this saves storage space, speeds up analytics queries, and eliminates manual data-types review.
Below are the steps required to enable Pecan to access your Delta Lake tables. If you need help with performing these steps or obtaining the correct details and credentials, be sure to consult with your internal IT or DevOps team.
Prerequisites
Generate Delta Lake Table
Delta Lake table is a directory that contains Parquet tables and metadata tables.
Here’s a Python code example on how to generate a Delta Lake table:
!pip install deltalake
import pandas as pd
from deltalake.writer import write_deltalake
delta_output_path = # Delta Lake output directory (can be local or cloud, e.g., "s3://bucket/path")
df = # Pandas DataFrame
# Write to Delta Lake format
write_deltalake(
table_or_uri=delta_output_path,
data=df,
mode="overwrite",
)
Find more ways to create Delta Lake tables in Delta Lake documentation.
IAM
IAM is a more secure alternative to using keys for authentication.
To use it, you’ll need to create an IAM user with Read and Write permissions in your AWS account, so Pecan can see where your bucket is sitting, read tables from it, and write to it. Here’s how to do it:
Log in to AWS Identity and Access Management (IAM) and create an IAM user, which will generate a new IAM access key and secret key that you will provide to Pecan in Step 3. To learn more, see Creating an IAM user in your AWS account.
Important: this is your only opportunity to view or download your secret access key, so make sure to save or download it to a safe and secure place.
Attach the relevant IAM policy to the user so Pecan has “programmatic access” to make API calls to your AWS bucket. To do so, copy and paste the below JSON text to your policy console in AWS.
Note that you will need to change the “BUCKET_NAME” placeholder in Lines 11, 22, and 31 to your actual bucket name. To learn more, read aboutPolicies and permissions in IAM.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PecanKeyReadPermissions",
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>/<READ_ONLY_FOLDER>/*"
},
{
"Sid": "PecanKeyWritePermissions",
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*",
"s3:Put*",
"s3:Delete*"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>/<READ_WRITE_FOLDER>/*"
},
{
"Sid": "PecanBucketPermissions",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::<BUCKET_NAME>"
}
]
}Finally, you’ll need to define Pecan as a trusted entity. To do it, make sure to add this snippet as a trusted entity under the IAM role:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::685686065164:root"
},
"Action": "sts:AssumeRole"
}
]
}
How to configure an S3 Delta Lake file connection
Log in to Pecan, select the “Connections” tab, and click Add connection.
Select “Delta Lake file” and complete the following fields:
Connection name | This is how you’ll identify the connection when creating and working with models on the platform. Names should be unique and reflect the data source and what’s stored in it. Valid characters include letters, numbers, and underscores. Connection names can’t be changed once created. Example: “Delta Lake_paid_downloads_fall_2021” |
Connection type | Whether the connection is to be used to bring data into Pecan or to send predictions out of Pecan. |
AWS IAM access key | The IAM access key for your S3 bucket, which will have been generated in Step 2A above. (Example: “AKIAIOSFODNN7EXAMPLE”).If you use IAM roles to authenticate with your AWS, this field should be left empty. |
AWS IAM secret key | The IAM secret key for your S3 bucket, which should have been saved or downloaded at the time your IAM access key was created. (Example: “wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY”). If you can’t find it or don’t have access to your password, you’ll need to ask your administrator to reset it. To learn how, see Managing Passwords for IAM users. If you use IAM roles to authenticate with your AWS, this field should be left empty. |
AWS IAM role | The role you created for Pecan. |
S3 root directory | The root directory path of your tables in S3. It should contain all of your table directories. For more details, see Step 1 above (example: “https://my-bucket.s3.us-west-2.amazonaws.com”) |
Partition format | The format of the partitioned tables (in case it is different than the standard described in Apache Spark’s article on partition discovery. |
3. Now, click Test connection to make sure everything is working correctly. Then click Create connection to complete the setup. (For more information, see Testing and creating a data connection.