Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComponentArray instead of NamedArray for mode estimation results? #2363

Open
ElOceanografo opened this issue Oct 11, 2024 · 6 comments
Open

Comments

@ElOceanografo
Copy link
Contributor

When doing an MLE or MAP optimization, the resulting ModeResult.values is a NamedArray, from NamedArrays.jl. If the model contains any array-valued parameters, this can be a real pain to work with: each individual element is indexed with its own unique symbol, requiring the user to do some annoying data munging to get them all together. For instance, to extract a matrix called x that is supposed to have size 20 x 10:

...
opt = maximum_likelihood(mymodel)
x_opt = reshape(opt.values[first.(split.(string.(names(opt.values)[1]), "[")) .== "x"], 20, 10)
...

Maybe there's a better way to do this that I don't know about, but wouldn't it be simpler to return the results as a ComponentArray, a data structure designed for just this use case? That would let you simply do:

x_opt = opt.values.x

Is there any compelling reason to prefer a NamedArray here?

@mhauru
Copy link
Member

mhauru commented Oct 11, 2024

Thanks for raising this. I agree that the return value of mode estimation could and should be improved. ComponentArrays would be a good candidate. There are some other elements of the return value that we would also like to change, mentioned in #2232.

Putting this on our todo list, though unsure when we'll get to it.

@ElOceanografo
Copy link
Contributor Author

If there's interest in the meantime, I'd be willing to make a PR for this feature. It doesn't look like it requires many code changes and would be a huge increase in usability, at least for me...

@torfjelde
Copy link
Member

The issue with ComponentArrays.jl is that it effectively takes a NamedTuple approach. This has the benefit that it's very fast to do something like x_opt.x (+ you can infer the type it returns, so you can do more interesting things than just returning Vector{Float64} without ruining type inference), but it's really not the ideal thing to do when you have tons of variables. As an example, if I wrote the following model

@model function demo()
    x = Vector(undef, 100)
    for i in eachindex(x)
        x[i] ~ Normal()
    end
end

If you "naively" convert the variables here as they occur in the model, x[1], x[2], etc., you'll end up with a "namedtuple" representation with 100 entries, which is really not great. NamedArrays.jl does however NOT do this, but instead uses a less performant but more "flexible" on the index side since it just uses an array of strings.

In short, they each have respective pros and cons, and we should probably use a "customized" solution for this.

An alternative route @mhauru is to maybe just use VarNamedVector for this.

@ElOceanografo
Copy link
Contributor Author

I'm not sure I quite understand--the whole point would be to get that array x out as an array, not as 100 individual variables named x[1], x[2], etc.

Are you saying that in a model like this,

@model function demo()
    a ~ Normal()
    b ~ Normal()
    ...
    z ~ Normal()
    a1 ~ Normal()
    ...
    z1 ~ Normal()
    ...
    a100 ~ Normal()
    ...
    z100 ~ Normal()
end

where there are many, many individually named variables (2,600 in this case) the performance of a NamedTuple with entries for all of them will suffer?

Ultimately, it doesn't actually matter whether this hypothetical return structure is a ComponentArray or something else, as long as it doesn't require the user to do the kind of annoying and error-prone manual extraction I showed in my first example.

@torfjelde
Copy link
Member

where there are many, many individually named variables (2,600 in this case) the performance of a NamedTuple with entries for all of them will suffer?

Yep.

But a more likely scenario is the model I mentioned above, i.e.

model function demo()
    x = Vector(undef, 100)
    for i in eachindex(x)
        x[i] ~ Normal()
    end
end

The problem is that x[i] is the only thing that is "seen" by Turing.jl. Turing.jl doesn't "care" about the fact that x is a Vector; for example, the following model would result in exactly the same variables x[1], x[2], ... in the chain:

model function demo()
    x = Matrix(undef, 1, 100)
    for i in eachindex(x)
        x[i] ~ Normal()
    end
end

because Matrix also supports linear indexing.

That is,

  1. Turing.jl doesn't know the "original" shape of the variable; Turing.jl only sees the result of the expression on the left-hand side of ~ statement.
  2. Using ComponentArrays.jl wouldn't fix this, as you would still have to figure out the reshaping yourself.

Buuuut we can make this experience quite a bit less painful, which I very much agree with you @ElOceanografo is something we should do. And one way to do this is to use our custom structure VarNamedVector that was recently introduced interally to store the values instead. That should make it less awkward to interact with this thing + cases such as x ~ filldist(Normal(), 2, 2) would result in a structure where you could do opt[@varname(x)] and get back a Matrix:)

@ElOceanografo
Copy link
Contributor Author

Thanks, that example clarifies the issue.

Even getting everything back as a flat vector would be an improvement (and I bet would cover most use cases, vectors are more common in stats models than arrays). If VarNamedVector lets us get arrays back in their original shape, so much the better.

My only other comment is that this interface, whatever it ends up being, should ideally be consistent with the interface for MCMC chains. In fact, a group method for ModeResults is pretty easy:

function Turing.group(opt::Turing.Optimisation.ModeResult, varname)
    basenames = first.(split.(string.(names(opt.values)[1]), "["))
    idx = findall(x -> x == string(varname), basenames)
    return opt.values[idx]
end

I also was using this function in a script recently:

function named2component(v::Turing.NamedArrays.NamedVector)
    v = opt.values
    symbols = names(v)[1]
    strings = first.(split.(string.(symbols), "["))
    uniquestrings = unique(strings)
    nt = NamedTuple([Symbol(s) => vec(v[strings .== s]) for s in uniquestrings])
    return ComponentArray(nt)
end

These will both flatten 2D or higher-dimensional arrays, of course. Still, might be useful to other people reading this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants