Pecan's Data Science: A Peek Behind The Scenes

We developed an intelligent automated pipeline to enable Pecan to construct high-performance machine-learning models, tailored to the specific data and task presented.

How does Pecan execute automated feature engineering?

Pecan employs a rigorous process for automated feature engineering by delving deeply into data analysis. The system tailors its feature engineering techniques based on the column's content type:

Continuous Numerical Variables
For these, Pecan autonomously devises statistical features encompassing average, standard deviation (std), minimum (min), maximum (max), and mode. Furthermore, it generates intricate features like the coefficients from a linear fit of a particular entity's historical values.
Categorical Variables
Pecan discerns and retrieves prevalent historical categories for the given entity. Depending on the data distribution, it encodes this data in various ways. The encoding strategies implemented encompass one-hot encoding, ordinal encoding, and target encoding, among others.
Dates
Pecan recognizes patterns and significant events in date data, extracting features like the day of the week, month, seasonality patterns, and relative distances between dates to capture the essence of temporal information

Denoising Autoencoders and other Unsupervised methods such as Clustering (e.g. for identification of lookalikes, etc.) are used for Feature Engineering Enhancement. Feature Selection is performed using Permutation Tests and Shapley values.

How does Pecan determine feature selection and significance?

Pecan employs standard feature selection methods, including variance threshold and correlation coefficients. Pecan integrates advanced methods like permutation importance and SHAP values based on these foundational techniques.

What modeling techniques does Pecan use?

Pecan uses state-of-the-art modeling techniques to rapidly experiment, test, and select the best and most accurate models using Bayesian Optimization methods over the hyperparameter space.

Pecan’t best-in-class methods/algorithms include:

Time Series LSTM
ARIMA
Prophet
Tree-based Models (e.g. LGBM, CATBOOST, etc.)

How does Pecan choose its modeling algorithms?

Pecan employs cutting-edge modeling approaches, enabling swift experimentation, evaluation, and identification of the most precise models across the hyperparameter spectrum. Typically, Tree-based Models (LGBM, CATBOOST) are chosen, given their proven superior performance in tabular modeling. The hyperparameter optimization is conducted on a validation set, which is set to be 10% of the training data. In addition to hyperparameter optimization, based on the task and data distribution, multiple loss functions are evaluated (i.e. log loss, Tweedie, and more)

Optimization metric

An important aspect of the modeling procedure is determining the metric for optimization. Users possess the flexibility to modify the default optimization metric. For more information, please see this article.

Uploading CSV files to Pecan

Pecan's Glossary