Selecting a “good” training dataset

My previous post was about selecting a meaningful training dataset. How do you know that the dataset once selected is good? Let me offer two considerations: Quality and Appropriateness. Quality, in my opinion, relates to correctness of data - both the ground truth (label) as well input features. e.g. If you are building a model for predicting product category, do check if the category field is being correctly captured. May be the category is set by an individual; but if wrongly set, somehow finds it’s way to the correct customer agent. Now if the agent does not follow a practice of resetting the category to the correct one, you have wrong ground truth while training your model. Another aspect is appropriateness. Let’s say there are two fields on your current lead scoring form. One which is filled based on some business rules when the lead first came in (perceived score), and another which is filled in at the time of closing the lead based on reality (actual score). If you are asked to build a predictor for lead score, what is the appropriate ground truth - The score that business rules predicted, or the score that was finally assigned? Answer really depends on the business need. Keeping “goodness” of training data in mind will take you one step closer towards practical AI. #abhayPracticalAI #ArtificialIntelligence #AI

Postingpad - Abhay's portfolio and blogs

Selecting a “good” training dataset

Recent Posts

Comments