All Collections
Creating a Model
Recommended data volume for machine learning model
Recommended data volume for machine learning model
Linor Ben-El avatar
Written by Linor Ben-El
Updated over a week ago

Data volume is a critical factor when it comes to training a robust and reliable machine learning model. The richness and complexity of your data can significantly influence the model's ability to find hidden patterns in your data, learn them, and then make accurate predictions.

In Pecan, we advise training a model with at least 1,000 entities and more than 10 attribute columns.

Number of Entities

Each entity in your dataset represents a unique opportunity for the model to learn.

With fewer than 1,000 entities, the model may struggle to capture the depth of behaviors and patterns in the data, potentially leading to overfitting or weak predictive power.

In other words, more entities equate to more learning opportunities and a more generalized and robust model.

Attribute Columns

Similarly, attribute columns provide the model with different "perspectives" on each entity. Each attribute column is a potential predictor the model can use to make predictions. Having fewer than 10 attribute columns may limit the model's "perspective," potentially leading to oversimplified models that don't fully leverage the data's predictive potential.

But remember, not just any data will do. The attribute columns must be valid and reliable:

Avoid Using attribute columns with high cardinality

Columns with a very high number of unique values, known as high cardinality, can complicate the model's training process. These columns can introduce a level of complexity that might not be beneficial for the model's learning, and as such, Pecan may drop these columns.

Avoid Using attribute columns with high null-rate

Similarly, columns with a high proportion of null or missing values offer limited informative value to the model and might introduce bias. Therefore, Pecan may also exclude these columns from the model.

It's important to remember that while more data can lead to more accurate models, the quality of that data is equally vital. Review your dataset critically, ensuring it's both sufficiently voluminous and reliable.

Should you have any questions or concerns about your dataset's volume or validity, please do not hesitate to reach out to the Pecan team. Our priority is to guide you toward successful model training, providing insights and recommendations tailored to your specific needs.

Did this answer your question?