Using predictions as data quality checks is indeed an intriguing concept. While there’s a concern about potential data leakage, it’s crucial to evaluate the idea thoroughly.
Background:
In this scenario, a kNN-based model estimates the value of used cars based on historical sales data, where the accuracy heavily relies on the quality of the closest sale records. The data is collected from various sources, leading to variations in data quality despite rigorous cleaning efforts. However, there’s a lack of programmatic means to identify problematic records.
The model’s validation involves computing estimates for each training record with itself excluded from the dataset.
Here’s the proposed approach:
During the initial validation, create a log of neighbors used in computing each estimate.
For a given record X, obtain the list of other sales that used it as a neighbor.
Remove X from the dataset and recompute estimates for the sales in that list.
If the average error decreases beyond a threshold T when excluding record X, flag it as questionable.
Regarding terminology, this method aligns with the concept of iterative model refinement based on validation results. However, it may not have a specific name or precedent within the literature. As for the threshold T, its determination typically involves experimentation and may vary depending on the dataset’s characteristics and the desired level of data quality assurance.
Using predictions to identify problems early improves quality control. It need reliability to keep people’s trust. Collaboration and consistent model improvement are essential.
hallo folks,predictive quality in manufacturing uses machine learning and manufacturing data to detect defects in real time and quickly identify the root cause.
Your approach of utilising predictions to assess data quality is sound. You enhance accuracy by leveraging the model’s performance to identify and delete outliers.
Data leaking is a risk, particularly with kNN models, although omitting each record during its own estimation helps to mitigate this.
Choosing the appropriate threshold (T) requires experimentation. Begin with the standard deviation from the mean error and adapt to meet your needs.
This iterative procedure, like feature selection and model tuning, has no special name but is useful for enhancing your model and data. Remember to log each step in order to revise your threshold and measure progress.