Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing likelihoods #766

Closed
DanielRobertNicoud opened this issue Dec 12, 2023 · 10 comments · Fixed by #769
Closed

Computing likelihoods #766

DanielRobertNicoud opened this issue Dec 12, 2023 · 10 comments · Fixed by #769

Comments

@DanielRobertNicoud
Copy link

For some application of BAMBI on which I am working, I would need to do the following.

I have a BAMBI model with some parameter $\theta$, a prior $p(\theta)$, and a likelihood $p(y\mid x,\theta)$. After getting some training data, fitting gives me parameters samples $\theta_{1:T}$. Given a new observation $(x^\star,y^\star)$, I need to compute $p(y^\star\mid x^\star, \theta_t)$ for all $t$ (and actually at multiple possible values of $y^\star$). This should be as fast as possible, as I need this computation for a rather large number of points.

As far as I can tell, there is no easy way of doing this. What I have resorted to until now is to implement this by hand depending on the model specification, but it lacks flexibility. Is anyone aware of a proper way to accomplish what I described above exploiting BAMBI directly? If not, would it be possible to implement it in the package?

Thanks all in advance!

@zwelitunyiswa
Copy link

zwelitunyiswa commented Dec 12, 2023 via email

@DanielRobertNicoud
Copy link
Author

@zwelitunyiswa Unfortunately, as far as I know this doesn't work. I am not looking for sampling the posterior predictive at an unseen point $x^\star$. What I want is to compute the likelihood of a given (new) data point relative to the samples for the parameters coming from fitting on some training set.

@DanielRobertNicoud
Copy link
Author

I think this could be easily (?) implemented using the pymc.compute_log_likelihood function.

@tomicapretto
Copy link
Collaborator

@DanielRobertNicoud I want to check if I understand your problem. Using a training dataset you get draws from a posterior distribution. Then you have, let's say, a test dataset where you know both the value of the predictors and the response. Using the draws of the posterior you got with the training data, you want to compute the likelihood for the values of the predictors and responses in the test set. Is my understanding correct?

If that's correct, I think it's not extremely simple but it should be possible. I'll wait for your input before trying something.

@DanielRobertNicoud
Copy link
Author

@tomicapretto Yes, that seems correct. As a simplified example, suppose I am doing a Bayesian linear regression $y = x^T\beta + \epsilon$ with $\epsilon\sim N(0,\sigma^2)$ with $\sigma$ known. Fitting the model will give me samples $\beta_1,\ldots,\beta_T$. Then given a new point $(x^\star,y^\star)$ what I want to compute is $\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{y^\star - (x^\star)^T\beta_t}{\sigma}\right)^2\right)$ for all $t$. Of course, I'd like to do this in such a way that any model and likelihood function is supported.

I am trying to implement this with raw pymc, if I manage I can share a minimal working example if that helps!

@DanielRobertNicoud
Copy link
Author

DanielRobertNicoud commented Dec 13, 2023

@tomicapretto The following works in pure pymc (as you can see, I test against manually computed values).

import pymc
import pandas as pd
import numpy as np
import arviz as az
import scipy.stats as sps

# generate some gaussian data for training
x_train = pd.DataFrame(np.random.normal(size=(200, 1)), columns=["x1"])
x_train["intercept"] = 1.
y_train = pd.Series(1 + .5 * x_train.x1 + np.random.normal(scale=0.1, size=200), name="y")

# single testing point
x0 = .5
x_test = pd.DataFrame({"x1": [x0], "intercept": [1.]})
y_test = pd.Series(np.zeros(x_test.shape[0]), name="y")

coords = {"coeffs": x_train.columns}

with pymc.Model(coords=coords) as model:
    # data containers
    X = pymc.MutableData("X", x_train)
    y = pymc.MutableData("y", y_train)
    # priors
    b = pymc.Normal("b", mu=0, sigma=1, dims="coeffs")
    sigma = pymc.HalfCauchy("sigma", beta=10)
    # linear model
    mu = pymc.math.dot(X, b)
    # likelihood
    likelihood = pymc.Normal("obs", mu=mu, sigma=sigma, observed=y)
    
    # fit
    idata = pymc.sample(return_inferencedata=True, tune=200, draws=20, chains=1)
    
    # testing
    pymc.set_data({"X": x_test.to_numpy(), "y": y_test.to_numpy()})
    out = pymc.compute_log_likelihood(
        az.InferenceData(posterior=idata.posterior),   # keep min amount possible of info
        var_names=["obs"]
    )

display([
    np.log(sps.norm(loc=0., scale=sigma).pdf(intercept + x0 * beta))
    for (beta, intercept), sigma in zip(
        idata.posterior.b.to_numpy().reshape(-1, 2),
        idata.posterior.sigma.to_numpy().flatten()
    )
])
display(out.log_likelihood.obs)

@tomicapretto
Copy link
Collaborator

Thanks for the detailed example, I can't check it right now but I'll check it soon-ish.

@DanielRobertNicoud
Copy link
Author

@tomicapretto Hi, did you have time to take a look at this?

@tomicapretto
Copy link
Collaborator

Hi @DanielRobertNicoud sorry for the delay! With holidays and work, I couldn't find time to work on Bambi. Now I could make a gap and decided to review this issue. Turns out it was not hard to add a compute_log_likelihood method to the Model class. Have a look at #769 and let me know if it works for your use case. You'll need to install Bambi from the branch in that PR.

@tomicapretto
Copy link
Collaborator

Hi @DanielRobertNicoud did you have time to check it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants