top of page
Writer's pictureAbhay Kulkarni

Selecting a “good” training dataset

My previous post was about selecting a meaningful training dataset. How do you know that the dataset once selected is good? Let me offer two considerations: Quality and Appropriateness. Quality, in my opinion, relates to correctness of data - both the ground truth (label) as well input features. e.g. If you are building a model for predicting product category, do check if the category field is being correctly captured. May be the category is set by an individual; but if wrongly set, somehow finds it’s way to the correct customer agent. Now if the agent does not follow a practice of resetting the category to the correct one, you have wrong ground truth while training your model. Another aspect is appropriateness. Let’s say there are two fields on your current lead scoring form. One which is filled based on some business rules when the lead first came in (perceived score), and another which is filled in at the time of closing the lead based on reality (actual score). If you are asked to build a predictor for lead score, what is the appropriate ground truth - The score that business rules predicted, or the score that was finally assigned? Answer really depends on the business need. Keeping “goodness” of training data in mind will take you one step closer towards practical AI. #abhayPracticalAI#ArtificialIntelligence#AI

28 views0 comments

Recent Posts

See All

Prediction end point - streaming or batch?

We have discussed a few aspects of defining a prediction end point. Here is one more. Always work with your business users to understand...

Defining prediction API - some tips

The end result of a requirement gathering exercise for AI is often the end point definition for the desired prediction rule. Defining the...

Comments


bottom of page