Fix nan gradients in analytical likelihood #468

digicosmos86 · 2024-06-20T19:59:35Z

No description provided.

AlexanderFengler

Looks good, this is mostly about iterating conceptually, not code quality.

src/hssm/likelihoods/analytical.py

AlexanderFengler · 2024-06-26T00:59:47Z

src/hssm/likelihoods/analytical.py

-        LOGP_LB,
+    tt = negative_rt * epsilon + (1 - negative_rt) * rt
+
+    p = pt.maximum(ftt01w(tt, a, z_flipped, err, k_terms), pt.exp(LOGP_LB))


quick note,
it seems like we are only passing k_terms here, not actually computing k_terms.
I think we had agreed to do that way back on another iteration of trying to fix issues with this likelihood, and I think it's fine, but in this case we should make the default a bit higher than 7.

Just playing around here. Not actually changing

AlexanderFengler · 2024-06-26T01:04:43Z

src/hssm/likelihoods/analytical.py

-        - (v_flipped**2 * rt / 2.0)
-        - 2.0 * pt.log(a),
+        - (v_flipped**2 * tt / 2.0)
+        - 2.0 * pt.log(pt.maximum(epsilon, a))


reflecting on this a bit,
I think this maximum business is actually corrupting the gradients, so we should just a priori restrict a > epsilon (via prior essentially?).

on the other hand, apart from initialization (which 1. our strategies should already avoid, 2. we generally can impact) a should basically never come close to 0, so this should basically never be the culprit...

But this did help a bit, for some reason...

AlexanderFengler · 2024-06-26T01:14:52Z

src/hssm/likelihoods/analytical.py

-        - (v_flipped**2 * rt / 2.0)
-        - 2.0 * pt.log(a),
+        - (v_flipped**2 * tt / 2.0)
+        - 2.0 * pt.log(pt.maximum(epsilon, a))
    )

    checked_logp = check_parameters(logp, a >= 0, msg="a >= 0")


in the spirit of above, this check could be a>0 but honestly we shouldn't really ever get there.

Same as above

AlexanderFengler · 2024-06-26T01:17:19Z

src/hssm/likelihoods/analytical.py

@@ -220,7 +199,7 @@ def logp_ddm(
    z: float,
    t: float,
    err: float = 1e-15,
-    k_terms: int = 20,
+    k_terms: int = 7,
    epsilon: float = 1e-15,


I don't know what was used for testing / is used as actual value for inference, but I guess it is this default?

The epsilon for the rt part should rather be on the order of 1e-3, or even 1e-2.

If we are reusing the same epsilon in multiple places, we should probably separate it out.

Was playing around. It seems that changing k_terms to 7 did not improve speed or computation

AlexanderFengler · 2024-06-26T01:24:10Z

src/hssm/likelihoods/analytical.py

@@ -262,15 +241,17 @@ def logp_ddm(
    z_flipped = pt.switch(flip, 1 - z, z)  # transform z if x is upper-bound response
    rt = rt - t

-    p = pt.maximum(ftt01w(rt, a, z_flipped, err, k_terms), pt.exp(LOGP_LB))
+    negative_rt = rt <= epsilon


Ok reflecting on this a bit, the logic that we want should probably look something like:

flag all rts lower than epsilon

go through with ftt01w

then set all flagged rts to LOGB_LB

This should actually cut the gradient for problematic rts.
Potentially we put this as a logp_ddm_2 and compare results / gradients.
Alternatively, if any rt breaches epsilon, directly send logp to -infty (this is probably not preferable).

We were doing this. I think the problem is that the gradient is computed anyway and the over/underflow was still happening

AlexanderFengler · 2024-07-09T01:51:06Z

@digicosmos86 is this stale for now?

digicosmos86 · 2024-07-09T12:33:48Z

@digicosmos86 is this stale for now?

There doesn't seem to be a solution for really small RTs in the denominator, which can blow up

frankmj · 2024-07-09T12:42:51Z

ah well maybe this is related to my "RT hack" I proposed as an interim solution for cases where *t* is low. (Which I attributed to the possibility that the sampler would end up proposing values of t that hit the lower bound and lead to unstable gradients, but I know that pymc is supposed to deal with such boundaries smoothly under the hood - so maybe the issue is just the RTs in the denominator being small). In that case the RT hack would still work (ie under the hood befor fitting just add a constant value to all RTs (say 0.5), which should only shift the t parameter, and then report t_new = t - 0.5).

…

On Tue, Jul 9, 2024 at 8:34 AM Paul Xu ***@***.***> wrote: @digicosmos86 <https://github.com/digicosmos86> is this stale for now? There doesn't seem to be a solution for really small RTs in the denominator, which can blow up — Reply to this email directly, view it on GitHub <#468 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG7TFGTTCFH2PBQAF525KLZLPKEDAVCNFSM6AAAAABJUTJUAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXGU3DGMRWHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

digicosmos86 · 2024-07-09T13:47:36Z

@frankmj I ran a few more tests and the RT-hack did do the trick. It might be hard for us to implement this trick in our code though, mostly because people use arviz functions instead of the convenience functions that we provide, which could give us some control over the output. We could note this in our documentation somewhere about this trick so that the users can implement this themselves so that they have full control

frankmj · 2024-07-09T14:12:32Z

Great. maybe one simple solution would be to simply add a link function with t= t'-const ?

…

On Tue, Jul 9, 2024 at 9:47 AM Paul Xu ***@***.***> wrote: @frankmj <https://github.com/frankmj> I ran a few more tests and the RT-hack did do the trick. It might be hard for us to implement this trick in our code though, mostly because people use arviz functions instead of the convenience functions that we provide, which could give us some control over the output. We could note this in our documentation somewhere about this trick so that the users can implement this themselves so that they have full control — Reply to this email directly, view it on GitHub <#468 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG7TFDEGDMAENAMCYVOMCDZLPSY5AVCNFSM6AAAAABJUTJUAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJXG44TKNRVGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

digicosmos86 · 2024-07-09T14:24:07Z

@frankmj That's a great idea! I also noticed that the RT-hack only worked when float64 is used, indicating some other issues that we might have with numerical stability. I'll look deeper into this

cpaniaguam

A few things here I might end up picking up myself.

src/hssm/likelihoods/analytical.py

cpaniaguam · 2024-08-19T05:59:42Z

src/hssm/likelihoods/analytical.py

+    _a = 2 * pt.sqrt(2 * np.pi * rt) * err < 1
+    _b = 2 + pt.sqrt(-2 * rt * pt.log(2 * pt.sqrt(2 * np.pi * rt) * err))
+    _c = pt.sqrt(rt) + 1


The fundamental operation is pt.sqrt(rt). It's better to do this first and reuse the result to avoid computing it again.

For numerical stability, it's better to group the constant factor C = 2 * pt.sqrt(2 * np.pi) * err and compare each member of sqrt_rt = pt.sqrt(rt) against 1/C.

Sure! Feel free to change this

cpaniaguam · 2024-08-19T06:07:37Z

src/hssm/likelihoods/analytical.py

-    ks = 2 + pt.sqrt(-2 * rt * pt.log(2 * np.sqrt(2 * np.pi * rt) * err))
-    ks = pt.max(pt.stack([ks, pt.sqrt(rt) + 1]), axis=0)
-    ks = pt.switch(2 * pt.sqrt(2 * np.pi * rt) * err < 1, ks, 2)
+    _a = 2 * pt.sqrt(2 * np.pi * rt) * err < 1


What would a better name for this boolean array be, maybe mask or sieve?

Should pt.lt be used here as done elsewhere in this PR?

It's actually equivalent but I was just playing around

cpaniaguam · 2024-08-19T06:15:58Z

src/hssm/likelihoods/analytical.py

+    _b = 2 + pt.sqrt(-2 * rt * pt.log(2 * pt.sqrt(2 * np.pi * rt) * err))
+    _c = pt.sqrt(rt) + 1
+    _d = pt.max(pt.stack([_b, _c]), axis=0)
+    ks = _a * _d + (1 - _a) * 2


Because _a is boolean, I think it's better to treat it as such and use pt.switch.

Suggested change

ks = _a * _d + (1 - _a) * 2

ks = pt.switch(mask, _d, 2) # having renamed `_a` to `mask`, for example

Please see comment below

cpaniaguam · 2024-08-19T06:22:01Z

src/hssm/likelihoods/analytical.py

+    _b = 1.0 / (np.pi * pt.sqrt(rt))
+    _c = pt.sqrt(-2 * pt.log(np.pi * rt * err) / (np.pi**2 * rt))
+    _d = pt.max(pt.stack([_b, _c]), axis=0)
+    kl = _a * _b + (1 - _a) * _b


_c and _d are not used. Should _d be used in the second term instead of _b? Otherwise kl will be _b.

Suggested change

kl = _a * _b + (1 - _a) * _b

kl = pt.switch(mask, _b, _d)

Please see comment below

src/hssm/likelihoods/analytical.py

cpaniaguam · 2024-08-19T07:06:48Z

src/hssm/likelihoods/analytical.py

-    logp = pt.where(
-        rt <= epsilon,
-        LOGP_LB,
+    tt = negative_rt * epsilon + (1 - negative_rt) * rt


Suggested change

tt = negative_rt * epsilon + (1 - negative_rt) * rt

tt = pt.switch(negative_rt, epsilon, rt)

This actually is done on purpose. pt.switch can cause some weird errors

cpaniaguam · 2024-08-19T07:13:29Z

src/hssm/likelihoods/analytical.py

        + (
            (a * z_flipped * sv) ** 2
            - 2 * a * v_flipped * z_flipped
-            - (v_flipped**2) * rt
+            - (v_flipped**2) * tt
        )
-        / (2 * (sv**2) * rt + 2)
-        - 0.5 * pt.log(sv**2 * rt + 1)
-        - 2 * pt.log(a),
+        / (2 * (sv**2) * tt + 2)
+        - 0.5 * pt.log(sv**2 * tt + 1)
+        - 2 * pt.log(pt.maximum(epsilon, a)),


Evaluate separately providing a meaningful name.

We are probably not going to keep this one. I just tried this to see if we keep the log positive we can get somewhere. It helps a bit it seems, but the culprit is not this one

Co-authored-by: Carlos Paniagua <[email protected]>

digicosmos86 · 2024-08-20T17:43:04Z

@cpaniaguam Thanks for the suggestions! I committed all excluding those involving pt.switch. I thought pt.switch is a more readable alternative, but it caused some switch_sink errors that shows up only when float64 is used. Actually removing switch Ops allowed me to sample with float64 with out errors.

Please feel free to take this further. This PR wasn't final - was just a placeholder for some of my experiments

AlexanderFengler · 2024-08-20T18:00:22Z

@digicosmos86 let's use this PR to switch to float64 overall?

Also, the latest state of affairs with all changes in this PR is that it's still breaking right?

digicosmos86 · 2024-08-20T19:14:46Z

You are correct. It is still broken. This PR is kind of my mess though. I'd rather start a new one and just switch out all the switch ops, which should get us over the float64 issue

AlexanderFengler · 2024-08-20T19:44:13Z

@digicosmos86 I am good with that approach.

digicosmos86 · 2024-08-21T14:30:49Z

Since this is still in the works, I am going to convert it to a draft PR

AlexanderFengler · 2024-08-21T22:35:08Z

@digicosmos86 to be closed now that the other PR is up?

digicosmos86 added 3 commits June 20, 2024 15:54

update precommit

6248a4d

update pyproject.toml

5e15bbe

update pyproject.toml

9d16604

digicosmos86 linked an issue Jun 20, 2024 that may be closed by this pull request

nan grads when running find_MAP() on analytic, ddm #456

Open

AlexanderFengler reviewed Jun 26, 2024

View reviewed changes

cpaniaguam reviewed Aug 19, 2024

View reviewed changes

digicosmos86 and others added 4 commits August 20, 2024 13:30

Update src/hssm/likelihoods/analytical.py

dcf7709

Co-authored-by: Carlos Paniagua <[email protected]>

Update src/hssm/likelihoods/analytical.py

7d0bd89

Co-authored-by: Carlos Paniagua <[email protected]>

Update src/hssm/likelihoods/analytical.py

d61f247

Co-authored-by: Carlos Paniagua <[email protected]>

Update src/hssm/likelihoods/analytical.py

4774a55

Co-authored-by: Carlos Paniagua <[email protected]>

digicosmos86 marked this pull request as draft August 21, 2024 14:30

digicosmos86 mentioned this pull request Aug 21, 2024

Fix float32 vs float64 issue #568

Merged

digicosmos86 closed this in #568 Aug 22, 2024

	ks = _a * _d + (1 - _a) * 2
	ks = pt.switch(mask, _d, 2) # having renamed `_a` to `mask`, for example

	tt = negative_rt * epsilon + (1 - negative_rt) * rt
	tt = pt.switch(negative_rt, epsilon, rt)

Fix nan gradients in analytical likelihood #468

Fix nan gradients in analytical likelihood #468

Conversation

digicosmos86 commented Jun 20, 2024

AlexanderFengler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexanderFengler commented Jul 9, 2024

digicosmos86 commented Jul 9, 2024

frankmj commented Jul 9, 2024 via email

digicosmos86 commented Jul 9, 2024

frankmj commented Jul 9, 2024 via email

digicosmos86 commented Jul 9, 2024

cpaniaguam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

digicosmos86 commented Aug 20, 2024

AlexanderFengler commented Aug 20, 2024

digicosmos86 commented Aug 20, 2024

AlexanderFengler commented Aug 20, 2024

digicosmos86 commented Aug 21, 2024

AlexanderFengler commented Aug 21, 2024