CSV files can be used for model training and predictions just like any other data source supported by Pecan. CSV (comma-separated values) format, which allows you to save data in the form of plain text, is commonly used by applications as a medium for transferring data between systems without the need for a dedicated interface.
Pecan allows you to upload local CSV files directly to the platform, instead of using. This method is useful when:
You’d like to start building a model quickly, without connecting Pecan to your data source.
You’d like to incorporate data that's been created and modified manually (e.g. Microsoft Excel tables).
This method is not useful when you want to schedule a model to generate automatic predictions based on new incoming data. For such purpose it is recommended to use Pecan’s built-in connector to import CSV files hosted on Amazon S3.
This article explains the formatting requirements for CSV files that are uploaded to Pecan, how Pecan parses such files, and how to upload CSV files.
CSV formatting requirements
To be interpreted correctly by Pecan, your CSV files must adhere to the following requirements:
Column header | The first row should be a header since it will be considered as such by Pecan. |
Delimiters | Comma (,) |
Character encoding | UTF-8 |
Quote character | quote (“) |
Escape character | backslash (\) |
File size | Up to 1 GB per file. |
File name | file name (excluding extension / suffix) shall include letter characters, numbers and underscore. |
Here’s an example of how a compatible CSV file would look in a text editor, once aligned with formatting requirements:
date, username, purchae_usd, age 2017-09-21, mark877, 19, 24 2017-09-21, aust1n, 34, 45 2017-09-21, posit1ve, 18, 50 2017-09-21, nutty8, 33, 19
How to upload local CSV files to Pecan
Log into Pecan and go to the “Connections” screen.
Click New connection and select “Upload file”:
Use the Drag & drop area or click it to select your file.
The platform supports files up to 1GB. If you have files larger than that, you can simply split them into smaller files and then upload them separately. Note each file has to include the headers in its first row.
Once file has been uploaded, the window will be closed and you will be transferred to the table view.
In Pecan, all uploaded files are organized under a connection named “my_files”.
After the file is uploaded, Pecan prepares the file and makes sure it can be used for querying. This step might take a few minutes, depends on the file size.
Once the processing step is completed, your file should be available for querying in the editor, under “my_files”
How Pecan parses CSV files
CSV files, unlike Parquet and Delta files, do not store column data types. Therefore, when a CSV file is uploaded, Pecan attempts to infer the most suitable data type for certain columns based on their values.
Pecan follows Apache Spark’s default Date and Timestamp types:
“Date” data type |
|
“Timestamp” data type |
|
Even if your data contains date or timestamp columns in a different format, Pecan may still be able to recognize them correctly. Otherwise, Pecan will identify the column data type as String.
To make sure your data is identified correctly, you may convert the file before uploading it by using standard tools such as Microsoft Excel or Google Sheets (see example instructions here). Alternatively, you can convert the data type directly in the platform by using the “to_date()” command in your blueprint queries.
Columns that Pecan is not able to recognize will be parsed as a “String” data type.
In addition, the following values will be recognized as Null values:
Null value | empty string |
Note: you can always cast your data directly in your queries. Spark provides several functions for this purpose, including:
You can find a full list of all the functions supported by Spark here.
Modify a column’s data type
If Pecan has inferred a column’s type differently than what you intended, it is possible to manually set the type of a column by following these steps:
Click the “edit data type” button on the right side of the row that represents your uploaded file.
Once clicked, the file row will expand to show all the columns and their current data types.
Select the desired type for each column using the drop-down selection.Once done, click “save changes”. Pecan will begin casting the file. Depending on the file size, it might take up to several minutes for the process to complete. Note the file might not be accessible through the editor while the casting is in process.
Once done, the list of columns will be updated to show the new casted columns and you would be able to access them using the editor.