-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new users and items to an existing model or removing old ones #28
Comments
Hi Sebastian - sorry for the delayed response on my end! I think this idea sounds like a great addition to the library! The first part of this idea, expanding a model to new users, is something I made some rough pseudocode for in the blog post here, but I think it sounds like a great idea to formalize that into the collie/collie/model/cold_start_matrix_factorization.py Lines 53 to 54 in 0d62ae1
For the removing users idea - that seems like it should be pretty easy to incorporate into the model. I think it makes sense to just zero-out those embeddings completely for either the user(s) or item(s) to remove, unless we find a clever and efficient way to do the re-indexing for both the If you're up for it, it'd be fantastic if you could contribute this into Collie. If not, we can keep this issue open and it'll be something I try to work on when I get some free time. Cheers! |
First of all, sorry for my really late response to this, the github mail notification got lost between my other mails. About the GDPR concern, I think (but I'm not a lawyer) that item embeddings that were affected by the deleted users don't represent the user. Even if there was a product that was affected by only one user, making them to be very close together, and in a very exceptional case having the same representation in the embedding space so that their cosine similarity is exactly 1, if you delete the user embedding and it's interactions, there will not be any way to infer that the product representation was influenced by the deleted user. |
I think that makes a lot of sense, but it's also worth doing some more extensive analysis when I have time. I think having this functionality in the library makes a lot of sense and could be really helpful! |
@nathancooperjones I don't have as much time as I used to but this looks worth tackling. Having looked at the ColdStart model briefly it appears that the solution for adding a new user or a new item to the BasePipeline is to use cosine similarity to find items/users that are similar in the metadata space and get a rec list based on some combination of those known items/users, is that correct? Looking around statsexchange I found a couple of other possibilities that may not be implementable given Collie's structure/setup but figured I'd throw them in here as well if o get your thoughts in case they may also be options: |
@ahuds001 if anyone can take on this work, it is you - thanks for volunteering to look into this!!! 🎉 At my previous company, we used Collie to generate recommendations. The way we generated cold start recommendations was similar to how you mentioned above, but with a slight twist. We trained a normal Collie model with known users and items - when we had a new user/item come in, we looked at its metadata and found the Say we have a new user and, through some heuristics (e.g. for MovieLens, this could be similar location, similar age, similar favorite movie genre, etc.), we determine that the new user is similar to User A, B, and D in our existing model. If User A had a user embedding of If you don't have additional metadata available to determine similar users/items, I think the technique I used in my initial blog post here seems similar to what I think those StackOverflow posts described - for a new user, optimize the model on a single row keeping the item embeddings fixed. I think this is a good way to add new users/items in the model without requiring a full retrain or additional metadata involved. Giant information dump here (sorry) - what do you think about these ideas? |
I'm implementing the same approach explained by @nathancooperjones both for cold/warm start users and items, and so far it is one of the best approaches I figured out to bootstrap users and embeddings. If you have access to the first searches made by the cold users, you can get the users embeddings that are nearer to the search results with which the new user have interacted, average them and have a new embedding for this new user, that then you can use for the next round of partial training of your model, without needing to randomly initialize this new "warm" user. If using matrix factorization, both averaging the item embeddings or the users embeddings should lead to fairly good results since they are in the same vector space and both groups of embeddings should be "near" each other (haven't done any experiments to prove this, but seems like a fun test to try) |
I guess I can't share exact numbers or details on this experiment, but at my previous company, both offline metrics and a live A/B test showed this method for handling cold start users/items lead to significantly improved results! |
So at a high level this all makes sense, I just need to get my hands dirty. It sounds we would need to have different solutions depending on the Collie model type. Would we still want this live in the BasePipeline or should it live independently within each model type as the data available would be different? It sounds like the latter would make more sense and there would be 2 versions, one for the Hybrid models and one for the basic MatrixFactorization model. Also I am not even considering the removal of users/items just yet, that should probably be a different PR? |
I'm not sure how a cold start solution could work without the additional metadata (given in What are you thinking for this? |
Hmm... I'm missing something, likely just lack of familiarity. Let me start tinkering and I'll come back with some questions in a week or two. |
Problem Description
Ideal Solution
The text was updated successfully, but these errors were encountered: