All Collections
Getting Started
Pecan's Automated Pipeline: ML Models & Data Security
Pecan's Automated Pipeline: ML Models & Data Security

Learn about Pecan's state-of-the-art pipeline for tailored ML models and automated feature engineering - while keeping your data safe.

Ori Sagi avatar
Written by Ori Sagi
Updated over a week ago

We developed an intelligent automated pipeline to enable Pecan to construct high-performance machine-learning models, tailored to the specific data and task presented.

How does Pecan execute automated feature engineering?

Pecan employs a rigorous process for automated feature engineering by delving deeply into data analysis. The system tailors its feature engineering techniques based on the column's content type:

  • Continuous Numerical Variables
    For these, Pecan autonomously devises statistical features encompassing average, standard deviation (std), minimum (min), maximum (max), and mode. Furthermore, it generates intricate features like the coefficients from a linear fit of a particular entity's historical values.

  • Categorical Variables
    Pecan discerns and retrieves prevalent historical categories for the given entity. Depending on the data distribution, it encodes this data in various ways. The encoding strategies implemented encompass one-hot encoding, ordinal encoding, and target encoding, among others.

  • Dates
    Pecan recognizes patterns and significant events in date data, extracting features like the day of the week, month, seasonality patterns, and relative distances between dates to capture the essence of temporal information

How does Pecan determine feature selection and significance?

Pecan employs standard feature selection methods, including variance threshold and correlation coefficients. Pecan integrates advanced methods like permutation importance and SHAP values based on these foundational techniques.

How does Pecan choose its modeling algorithms?

Pecan employs cutting-edge modeling approaches, enabling swift experimentation, evaluation, and identification of the most precise models across the hyperparameter spectrum. Typically, Tree-based Models (LGBM, CATBOOST) are chosen, given their proven superior performance in tabular modeling. The hyperparameter optimization is conducted on a validation set, which is set to be 10% of the training data. In addition to hyperparameter optimization, based on the task and data distribution, multiple loss functions are evaluated (i.e. log loss, Tweedie, and more)

Optimization metric

An important aspect of the modeling procedure is determining the metric for optimization. Users possess the flexibility to modify the default optimization metric. For more information, please see this article.

Security and Privacy

Is my data safe?

We enforce strict policies to ensure that data from one customer model is never shared with another, providing absolute isolation between customer datasets. This guarantee maintains the distinctiveness and confidentiality of each client's data, ensuring that there is no cross-contamination or inadvertent leakage of information and that a customer's data is never utilized as enrichment for another customer's AI model, maintaining complete data privacy and isolation.

What about PII (Personally identifiable information)?

To create a great model, Pecan doesn't need any PII whatsoever - and you have complete control over exactly which tables and which columns in them Pecan will have access to.

How does Pecan keep my data secure?

Pecan has an annual audit for ISO 27001 and SOC 2 Type 2 compliance.
​Read more here.

Did this answer your question?