I’m developing a model to determine who might buy a product. Out of my one million consumers, 10% bought the product in the previous 12 months. I’m concerned that if I classify the remaining 90% as non-purchasers(0), the model may mistakenly conclude that they are genuinely bad cases when in reality they may simply be potential customers.
Here, is categorization the best course of action? Which methods work best when dealing with clients who haven’t made a purchase yet? Would an approach such as positive-unlabeled (PU) learning or semi-supervised learning be more suitable? Or are techniques like novelty detection or clustering a better choice?
I’m eager to hear your insights! Please share any comparable experiences you have had with the same issue.
This is a vague question that comes up frequently in business.
It seems your problem isn’t well-defined. You need to clarify what constitutes a “potential customer.” For example, if you have data on Apple customers who had a MacBook 20 years ago and track their purchases over the next year, you might find that 10% bought an iPhone while 90% did not. However, if you extend the observation period to 2024, you might discover that 90% of those original MacBook owners eventually bought an iPhone.
This example shows that the definition of a “potential customer” can change over time. Many potential buyers may become actual buyers given enough time. The challenge is to define a clear timeframe and criteria to differentiate potential buyers from actual buyers. Without this clarity, your predictive model’s results will be inconsistent and context-dependent.
Thank you for pointing out the importance of defining potential customers dynamically over time. Clear criteria and context are essential. This complexity underscores the need for a well-defined strategy in developing predictive models for identifying future buyers. If you were tackling this problem, how would you approach it?
On r/data science or r/statistics, you might receive more helpful answers. Your task is to determine the statistical definition of your issue.
Poisson regression will be promptly suggested by someone there. You forecast the rate of purchases (which, for 90% of customers, is supposedly so little that no purchases are detected) rather than “Will this customer purchase”.
I believe it deftly handles a few points raised by others:
Manage changing window sizes: Will this buyer purchase in the upcoming day, month, or year? The total of poisons is simply another Poisson, therefore no issue.
manages several purchases made by the same client (data that a classifier would have to discard)
manages varying “arrivals” of customers I’m not sure whether you have an issue with this, but what if a user only recently created an account?
I think unsupervised learning or one-class learning might be useful here. For unsupervised learning, you could identify distinguishing features between purchasers and non-purchasers, possibly using clustering or feature importance methods. Alternatively, consider a one-class learning approach, where you train a model to understand the behavior of purchasers and then identify which individuals from the other group exhibit similar behavior.
There might not be a single best solution, so experimenting with classification and other methods discussed here could be necessary. It would be great to hear updates on your progress or any new approaches you discover!
If you had the purchase records for each customer, you might consider using a collaborative filtering approach. This would generate a matrix indicating how likely each customer is to buy the product. By sorting customers based on their likelihood to purchase, you can then target the top N clients for marketing efforts.
You could reframe this problem by predicting the time until the next purchase. In this approach, you treat the most recent periods without a purchase as censored, meaning a purchase might occur in the future but hasn’t been observed yet.