Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eweine/add subset option #27

Merged
merged 7 commits into from
Aug 23, 2024
Merged

Eweine/add subset option #27

merged 7 commits into from
Aug 23, 2024

Conversation

eweine
Copy link
Collaborator

@eweine eweine commented Jul 29, 2024

Hey @pcarbo I'm attempting to build a method here to optimize U and V over only a fraction of columns of Y (cells), and then to project the remaining cells onto U. I thought I did this correctly, but for some reason the log-likelihood of the complete model is not matching what I would expect. You can see this below in a test script that I am committing.

Could you take a look?

Thanks!

@eweine eweine requested a review from pcarbo July 29, 2024 16:15
@pcarbo
Copy link
Member

pcarbo commented Jul 30, 2024

@eweine Yes the log-likelihood involves simple sums, so you should be able to compute the log-likelihood separately for each sample, and then combine the two. So there is a bug in your code somewhere. Perhaps set const = 0 for now (since this is just a constant and should not change during the optimization) and see if you can isolate the bug. (Or perhaps the bug is in the calculation of this constant?)

Note that in fastTopics I perform the log-likelihood calculations at the row-level (a loglik for each row of X, where row = sample/cell), so that in that case it is trivial to combine the loglik calculations. You could do the same in fastglmpca.

@eweine
Copy link
Collaborator Author

eweine commented Aug 1, 2024

@pcarbo I fixed the bug. It was not an issue with the loglik calculation, but rather with how I was constructing the final FF matrix. The only current issue with the PR is that the following hyperparameters are hardcoded:

      num_iter = 1000,
      line_search = TRUE,
      alpha = .01,
      beta = .25

when projecting the data onto the initial fit. I don't think that these will make a huge difference, but I'm not sure if I should leave them or give the user the option to set them.

@pcarbo
Copy link
Member

pcarbo commented Aug 1, 2024

Great! It would be nice if these parameters should be adjusted; for example in some cases you might prefer num_iter to be smaller? That shouldn't be hard to do?

@eweine
Copy link
Collaborator Author

eweine commented Aug 4, 2024

Great! It would be nice if these parameters should be adjusted; for example in some cases you might prefer num_iter to be smaller? That shouldn't be hard to do?

Ok @pcarbo I've added an additional option for the number of projection iterations. What do you think?

@pcarbo
Copy link
Member

pcarbo commented Aug 6, 2024

@eweine I'm trying to understand exactly what this is feature doing. Would it be correct to say that the U matrix is estimated using all rows of Y and a subset of the columns of Y? Would that be a more concise (and perhaps more clear) way to describe this option?

Also, when you say it is "faster", could you say precisely how? Does it reduce the complexity of each iteration, or does it speed up convergence, or both?

Regarding the specific interface, I'm wondering if it would be more generally useful if you instead implemented an option "project_cols", which is empty by default. For example, consider the case when the user has a training set and a test set; one could use this "project_cols" option to assess the quality of the fit in the test set.

Also, this makes me think it would be nice to output at the end the per-column (and perhaps per-row) log-likelihood, or implement this calculation in a separate function.

@eweine
Copy link
Collaborator Author

eweine commented Aug 6, 2024

@eweine I'm trying to understand exactly what this is feature doing. Would it be correct to say that the U matrix is estimated using all rows of Y and a subset of the columns of Y? Would that be a more concise (and perhaps more clear) way to describe this option?

That's a correct description. I think something like this could improve clarity.

Also, when you say it is "faster", could you say precisely how? Does it reduce the complexity of each iteration, or does it speed up convergence, or both?

I think that's a good idea. Certainly the complexity of iteration will be reduced (we're operating on a smaller dataset and we know from the paper that the complexity of each iteration is linear in the dimension of the input data). I would presume that subsetting would also increase the speed of convergence because you are optimizing fewer parameters, but it is hard to know for sure.

Regarding the specific interface, I'm wondering if it would be more generally useful if you instead implemented an option "project_cols", which is empty by default. For example, consider the case when the user has a training set and a test set; one could use this "project_cols" option to assess the quality of the fit in the test set.

That's an interesting idea. From my perspective the main point of this option is to increase the speed of computations on a very large dataset. And, as currently implemented, the user doesn't have to worry about which columns to select for subsetting. It's hard for me to imagine a scenario where the user is assessing the fit of just U in this way, but I could be wrong.

Also, this makes me think it would be nice to output at the end the per-column (and perhaps per-row) log-likelihood, or implement this calculation in a separate function.

That could be useful.

One final thought: it occurred to me that some additional input checking might be necessary on the "training set" (i.e. the subset of the columns of the input Y). In particular, if after subsetting one of the rows or columns has all 0 entries, I'm concerned that the optimization could become unstable.

@pcarbo
Copy link
Member

pcarbo commented Aug 6, 2024

It's hard for me to imagine a scenario where the user is assessing the fit of just U in this way, but I could be wrong.

This could be useful for cross-validation.

This option could be implemented as follows: if it is a single number, it is interpreted as the proportion of samples kept; if it is more than number, it is the exact columns to be kept. (This is reminscient of the first argument to the "sample" function.)

It occurred to me that some additional input checking might be necessary on the "training set" (i.e. the subset of the columns of the input Y).

Good point.

@eweine
Copy link
Collaborator Author

eweine commented Aug 22, 2024

@pcarbo I've added an additional check in the code to prevent issues with the optimization that may arise when an entire row or column is 0 in the subset matrix.

I agree that the feature you're suggesting could be useful for cross-validation, but I think it would be more useful if we also including row or column specific log-likelihoods. That's a bigger change, so I'd prefer to leave that to another branch / after we push a new version of the package to CRAN. Does that sound okay?

Let me know if there are any other changes you'd like me to make.

@pcarbo
Copy link
Member

pcarbo commented Aug 23, 2024

Sounds good, Eric.

@pcarbo pcarbo merged commit 840a016 into main Aug 23, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants