A CSV (comma-separated values) file allows you to save data in the form of plain text. Applications commonly use it as a medium for transferring data between systems without needing a dedicated interface.
Pecan allows you to upload local CSV files directly to the platform.
This method is useful when:
You’d like to build a model quickly without connecting Pecan to your data source.
You’d like to incorporate data that has been created and modified manually (e.g., Microsoft Excel tables).
This method is not applicable when you want to schedule a model to generate automatic predictions based on new incoming data.
This article explains the formatting requirements for CSV files uploaded to Pecan, how Pecan parses such files, and how to upload CSV files.
There are several standards for CSV and many possible implementations. Pecan relies on the RFC-4180 standard for text/CSV and industry best practices for its requirements.
CSV formatting requirements
To be interpreted correctly by Pecan, your CSV files must adhere to the following requirements:
Column header names | The first row should be a header, and it will be considered as such by Pecan. |
Delimiters | Comma (,) |
Character encoding | UTF-8 |
Quote character | quote (“) |
Escape character | double quotes (") |
File size | Up to 1 GB per file. |
File name | file name (excluding extension/suffix) shall include letter characters, numbers, and underscore. |
Here’s an example of how a compatible CSV file would look in a text editor once aligned with formatting requirements:
date,username,purchae_usd,age
2017-09-21,mark877,19,24
2017-09-21,aust1n,34,45
2017-09-21,posit1ve,18,50
2017-09-21,nutty8,33,19
How to upload local CSV files to Pecan
Log into Pecan and go to the “Connections” screen.
Click New connection and select “Upload file”:
Use the drag-and-drop area or click it to select your file.
The platform supports files up to 1GB. If you have files larger than that, you can split them into smaller files and then upload them separately.
Remember, each file must include the headers in its first row.
Once the file has been uploaded, the window will be closed, and you will be transferred to the table view.
In Pecan, all uploaded files are organized under a connection named “my_files”.
After the file is uploaded, Pecan prepares the file and makes sure it can be used for querying. This step might take a few minutes, depending on the file size.
Once the processing step is completed, your file should be available for querying in the editor, under “my_files”
How Pecan parses CSV files
Unlike Parquet and Delta files, CSV files do not store column data types. Therefore, when a CSV file is uploaded, Pecan attempts to infer the most suitable data type for certain columns based on their values.
Pecan follows Apache Spark’s default Date and Timestamp types:
“Date” data type |
|
“Timestamp” data type |
|
Even if your data contains date or timestamp columns in a different format, Pecan may still be able to recognize them correctly. Otherwise, Pecan will identify the column data type as String.
To ensure your data is identified correctly, you may convert the file before uploading it, using standard tools such as Microsoft Excel or Google Sheets (see example instructions here).
Alternatively, you can convert the data type directly in the editor by using the “to_date()” command in your notebook queries.
Columns that Pecan cannot recognize will be parsed as a “String” data type.
In addition, the following values will be recognized as Null values:
Null value | empty field |
To clarify, an empty field should not contain any character, including quotes or double-quotes:
2017-09-21,,19,24 # VALID
2017-09-21,"",34,45 # INVALID
2017-09-21,'',18,50 # INVALID
Note: You can always cast your data directly in your queries. Spark provides several functions for this purpose, including:
You can find a complete list of all the functions Spark supports here.
Modify a column’s data type
This section is only for when you upload a CSV file under your Connections tab. Editing a column's data type is coming soon to the 5-minute model feature.
If Pecan has inferred a column’s type differently than what you intended, it is possible to manually set the type of a column by following these steps:
Click the “edit data type” button on the right side of the row that represents your uploaded file.
Once clicked, the file row will expand to show all the columns and their current data types.
Select the desired type for each column using the drop-down selection.
Once done, click “save changes”. Pecan will begin casting the file. Depending on the file size, the process might take up to several minutes to complete. Note the file might not be accessible through the editor while the casting is in process.
Once done, the list of columns will be updated to show the new casted columns, and you can access them using the editor.
What if my CSV file is bigger than 1GB?
A 1GB CSV file is quite large and likely contains a HUGE amount of information.
Is this just one table exceeding the size limit? If so, it might include very old rows that could confuse the model or aren’t relevant to your predictions. We’d recommend trimming the table to focus on more recent, meaningful data, which would also help reduce the file size.
For datasets of this scale, we strongly suggest connecting your data warehouse to Pecan. This is much more efficient for handling large amounts of data and provides schema information, which helps Pecan process your data more effectively and ensures a smoother experience overall.
If connecting to a data warehouse isn’t an option, you can upload multiple smaller files and use a UNION
command in your predictive notebook to join them back together. However, this isn’t ideal and can be more time-consuming.