Eweine/add subset option #27

eweine · 2024-07-29T16:15:11Z

Hey @pcarbo I'm attempting to build a method here to optimize U and V over only a fraction of columns of Y (cells), and then to project the remaining cells onto U. I thought I did this correctly, but for some reason the log-likelihood of the complete model is not matching what I would expect. You can see this below in a test script that I am committing.

Could you take a look?

Thanks!

pcarbo · 2024-07-30T12:53:45Z

@eweine Yes the log-likelihood involves simple sums, so you should be able to compute the log-likelihood separately for each sample, and then combine the two. So there is a bug in your code somewhere. Perhaps set const = 0 for now (since this is just a constant and should not change during the optimization) and see if you can isolate the bug. (Or perhaps the bug is in the calculation of this constant?)

Note that in fastTopics I perform the log-likelihood calculations at the row-level (a loglik for each row of X, where row = sample/cell), so that in that case it is trivial to combine the loglik calculations. You could do the same in fastglmpca.

eweine · 2024-08-01T16:59:39Z

@pcarbo I fixed the bug. It was not an issue with the loglik calculation, but rather with how I was constructing the final FF matrix. The only current issue with the PR is that the following hyperparameters are hardcoded:

      num_iter = 1000,
      line_search = TRUE,
      alpha = .01,
      beta = .25

when projecting the data onto the initial fit. I don't think that these will make a huge difference, but I'm not sure if I should leave them or give the user the option to set them.

pcarbo · 2024-08-01T18:13:17Z

Great! It would be nice if these parameters should be adjusted; for example in some cases you might prefer num_iter to be smaller? That shouldn't be hard to do?

eweine · 2024-08-04T19:38:52Z

Great! It would be nice if these parameters should be adjusted; for example in some cases you might prefer num_iter to be smaller? That shouldn't be hard to do?

Ok @pcarbo I've added an additional option for the number of projection iterations. What do you think?

pcarbo · 2024-08-06T16:00:12Z

@eweine I'm trying to understand exactly what this is feature doing. Would it be correct to say that the U matrix is estimated using all rows of Y and a subset of the columns of Y? Would that be a more concise (and perhaps more clear) way to describe this option?

Also, when you say it is "faster", could you say precisely how? Does it reduce the complexity of each iteration, or does it speed up convergence, or both?

Regarding the specific interface, I'm wondering if it would be more generally useful if you instead implemented an option "project_cols", which is empty by default. For example, consider the case when the user has a training set and a test set; one could use this "project_cols" option to assess the quality of the fit in the test set.

Also, this makes me think it would be nice to output at the end the per-column (and perhaps per-row) log-likelihood, or implement this calculation in a separate function.

eweine · 2024-08-06T16:24:34Z

@eweine I'm trying to understand exactly what this is feature doing. Would it be correct to say that the U matrix is estimated using all rows of Y and a subset of the columns of Y? Would that be a more concise (and perhaps more clear) way to describe this option?

That's a correct description. I think something like this could improve clarity.

Also, when you say it is "faster", could you say precisely how? Does it reduce the complexity of each iteration, or does it speed up convergence, or both?

I think that's a good idea. Certainly the complexity of iteration will be reduced (we're operating on a smaller dataset and we know from the paper that the complexity of each iteration is linear in the dimension of the input data). I would presume that subsetting would also increase the speed of convergence because you are optimizing fewer parameters, but it is hard to know for sure.

Regarding the specific interface, I'm wondering if it would be more generally useful if you instead implemented an option "project_cols", which is empty by default. For example, consider the case when the user has a training set and a test set; one could use this "project_cols" option to assess the quality of the fit in the test set.

That's an interesting idea. From my perspective the main point of this option is to increase the speed of computations on a very large dataset. And, as currently implemented, the user doesn't have to worry about which columns to select for subsetting. It's hard for me to imagine a scenario where the user is assessing the fit of just U in this way, but I could be wrong.

Also, this makes me think it would be nice to output at the end the per-column (and perhaps per-row) log-likelihood, or implement this calculation in a separate function.

That could be useful.

One final thought: it occurred to me that some additional input checking might be necessary on the "training set" (i.e. the subset of the columns of the input Y). In particular, if after subsetting one of the rows or columns has all 0 entries, I'm concerned that the optimization could become unstable.

pcarbo · 2024-08-06T20:30:50Z

It's hard for me to imagine a scenario where the user is assessing the fit of just U in this way, but I could be wrong.

This could be useful for cross-validation.

This option could be implemented as follows: if it is a single number, it is interpreted as the proportion of samples kept; if it is more than number, it is the exact columns to be kept. (This is reminscient of the first argument to the "sample" function.)

It occurred to me that some additional input checking might be necessary on the "training set" (i.e. the subset of the columns of the input Y).

Good point.

eweine · 2024-08-22T22:28:14Z

@pcarbo I've added an additional check in the code to prevent issues with the optimization that may arise when an entire row or column is 0 in the subset matrix.

I agree that the feature you're suggesting could be useful for cross-validation, but I think it would be more useful if we also including row or column specific log-likelihoods. That's a bigger change, so I'd prefer to leave that to another branch / after we push a new version of the package to CRAN. Does that sound okay?

Let me know if there are any other changes you'd like me to make.

pcarbo · 2024-08-23T15:08:20Z

Sounds good, Eric.

Eric Weine added 3 commits July 24, 2024 14:09

added initial commit of projection

7c27e4e

added test code

e611ccc

removed projection script

5c42575

eweine requested a review from pcarbo July 29, 2024 16:15

Eric Weine added 2 commits August 1, 2024 11:54

fixed bug in projection method

7222d45

made code more memory efficient

681e5c1

added option for number of projection iterations

c6f9b4f

added additional check in subset method

49ebe86

pcarbo merged commit 840a016 into main Aug 23, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eweine/add subset option #27

Eweine/add subset option #27

eweine commented Jul 29, 2024

pcarbo commented Jul 30, 2024

eweine commented Aug 1, 2024

pcarbo commented Aug 1, 2024

eweine commented Aug 4, 2024

pcarbo commented Aug 6, 2024 •

edited

Loading

eweine commented Aug 6, 2024

pcarbo commented Aug 6, 2024

eweine commented Aug 22, 2024

pcarbo commented Aug 23, 2024

Eweine/add subset option #27

Eweine/add subset option #27

Conversation

eweine commented Jul 29, 2024

pcarbo commented Jul 30, 2024

eweine commented Aug 1, 2024

pcarbo commented Aug 1, 2024

eweine commented Aug 4, 2024

pcarbo commented Aug 6, 2024 • edited Loading

eweine commented Aug 6, 2024

pcarbo commented Aug 6, 2024

eweine commented Aug 22, 2024

pcarbo commented Aug 23, 2024

pcarbo commented Aug 6, 2024 •

edited

Loading