How to Identify Bad Datasets Before Building Models?

Greetings, everybody!

In my experience with machine learning models, it is not uncommon for me to begin modeling after doing extensive data cleansing, transformations, and processing to discover that the dataset is inadequate. A lot of effort is wasted before this is even realized, which is really annoying.

I get that conducting extensive EDA is beneficial, but I was hoping someone could tell me if there are any unusual things I can do to find problems with the dataset before I start creating models.

Any advice would be much appreciated.

3 Likes

Can you expand a bit on what you mean by datasets not being good enough?

3 Likes

Imagine a customer has asked you to construct a model to forecast a target variable using a certain dataset. We put in a lot of time and effort into cleaning and enriching the data, but the customer still thinks it’s good enough for predictions, so we’re back at square one. Before we begin the project, I’d want to know if the dataset is sufficient for us to go forward.

2 Likes

Establishing a goal is the first step. A target variable and its associated starting attributes should be included in this.

Creating or generating the target variable in a supervised job is not an easy or even doable activity if it is not already accessible.

To ascertain if the characteristics are legitimate to utilize and whether they may be sufficient for the prediction, it is essential to have knowledge about the relevant domains and the company. Track features and avoid exposing targets.

You may now go on with creating an EDA. However, from a holistic perspective, it is important to consider more than just the data analysis perspective while doing so. EDAs that fail to take into account the data’s actual significance in relation to the project scope, data quality, and data obtention are meaningless.

You may now go on to developing the project and creating the relevant analysis. Iterating and seeking out more data sources, attempting feature engineering, etc., are steps to do when a model exhibits poor performance.

There will be a lot of room for imagination and experimentation during the process.

2 Likes

You have an army of professors and TAs doing the cleaning and sanitizing for you at university, but in the real world, it’s up to you. Data is a reflection of the realities of poor measuring instruments, human errors in measurement, biases, and so on.

1 Like

It technically uhh isn’t possible for all ml tasks and for it to be exact… i think some tasks like classification is slightly more solved and for slme others, not so much. For example, feel free to look into influence functions…