-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eweine/add subset option #27
Conversation
@eweine Yes the log-likelihood involves simple sums, so you should be able to compute the log-likelihood separately for each sample, and then combine the two. So there is a bug in your code somewhere. Perhaps set Note that in |
@pcarbo I fixed the bug. It was not an issue with the loglik calculation, but rather with how I was constructing the final FF matrix. The only current issue with the PR is that the following hyperparameters are hardcoded:
when projecting the data onto the initial fit. I don't think that these will make a huge difference, but I'm not sure if I should leave them or give the user the option to set them. |
Great! It would be nice if these parameters should be adjusted; for example in some cases you might prefer |
Ok @pcarbo I've added an additional option for the number of projection iterations. What do you think? |
@eweine I'm trying to understand exactly what this is feature doing. Would it be correct to say that the U matrix is estimated using all rows of Y and a subset of the columns of Y? Would that be a more concise (and perhaps more clear) way to describe this option? Also, when you say it is "faster", could you say precisely how? Does it reduce the complexity of each iteration, or does it speed up convergence, or both? Regarding the specific interface, I'm wondering if it would be more generally useful if you instead implemented an option "project_cols", which is empty by default. For example, consider the case when the user has a training set and a test set; one could use this "project_cols" option to assess the quality of the fit in the test set. Also, this makes me think it would be nice to output at the end the per-column (and perhaps per-row) log-likelihood, or implement this calculation in a separate function. |
That's a correct description. I think something like this could improve clarity.
I think that's a good idea. Certainly the complexity of iteration will be reduced (we're operating on a smaller dataset and we know from the paper that the complexity of each iteration is linear in the dimension of the input data). I would presume that subsetting would also increase the speed of convergence because you are optimizing fewer parameters, but it is hard to know for sure.
That's an interesting idea. From my perspective the main point of this option is to increase the speed of computations on a very large dataset. And, as currently implemented, the user doesn't have to worry about which columns to select for subsetting. It's hard for me to imagine a scenario where the user is assessing the fit of just U in this way, but I could be wrong.
That could be useful. One final thought: it occurred to me that some additional input checking might be necessary on the "training set" (i.e. the subset of the columns of the input Y). In particular, if after subsetting one of the rows or columns has all 0 entries, I'm concerned that the optimization could become unstable. |
This could be useful for cross-validation. This option could be implemented as follows: if it is a single number, it is interpreted as the proportion of samples kept; if it is more than number, it is the exact columns to be kept. (This is reminscient of the first argument to the "sample" function.)
Good point. |
@pcarbo I've added an additional check in the code to prevent issues with the optimization that may arise when an entire row or column is 0 in the subset matrix. I agree that the feature you're suggesting could be useful for cross-validation, but I think it would be more useful if we also including row or column specific log-likelihoods. That's a bigger change, so I'd prefer to leave that to another branch / after we push a new version of the package to CRAN. Does that sound okay? Let me know if there are any other changes you'd like me to make. |
Sounds good, Eric. |
Hey @pcarbo I'm attempting to build a method here to optimize U and V over only a fraction of columns of Y (cells), and then to project the remaining cells onto U. I thought I did this correctly, but for some reason the log-likelihood of the complete model is not matching what I would expect. You can see this below in a test script that I am committing.
Could you take a look?
Thanks!