Self attention for pooling linear classifier #28

aayux · 2019-01-06T18:30:16Z

Add a BiAttentionPoolingClassifier (self attention for pooling linear classifier) as in Attention is all you need following the discussion with @sebastianruder in Teams.

I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to self.layers is of the right shape (but I'm not quite sure as of now).

I'll shift all the stuff to Colab for testing and see if it's any help.

@sebastianruder

This PR will introduce a `BiAttentionPoolingClassifier` as in [Attention is all you need](https://arxiv.org/abs/1706.03762) following the discussion with @sebastianruder in Teams. I ran out of memory on my 1060 while testing the attention module, but was able to at least verify that it is functionally correct. Some changes might be required to ensure that the tensor passed to `self.layers` is of the right shape (but I'm not quite sure as of now). I'll shift all the stuff to Collab for testing and see if it's any help.

sebastianruder

Thanks for this! Looks good to me. Didn't expect that you'd go with multi-head attention right away (was thinking of regular attention), but that should be fine.

aayux · 2019-01-14T13:23:23Z

The OOM issue persists even on Colab with 11GB of GPU memory.

RuntimeError: CUDA out of memory. Tried to allocate 8.41 GiB (GPU 0; 11.17 GiB total capacity; 10.26 GiB already allocated; 518.56 MiB free; 80.50 MiB cached)

It appears that I have run into a memory leak.

tpietruszka · 2019-01-23T23:52:25Z

I am beginning to implement various options of attention on top of ulmfit, so obviously I've looked at this code. I do not really understand how it is used here.

I thought that attention would be used along the sequence length, on different RNN outputs, kind of instead of the mean/max pooling and taking the last ouput.
As mentioned, I thought about using attention instead of pooling, reducing the dimensions of the network. Here it is used with key=value=query - so if I understand correctly, it preserves the dimensionality, calculating the representations of each item in context of all the other items? I guess I just don't understand, is there an intuitive explanation of what it does?
I thought about using attention in the way I have described above. In such a case, I think the query tensor should be learnable (or multiple tensors for multiple heads). Since this is a classification scenario, the mechanism I want to achieve is: attention returning the most relevant RNN outputs for the classification task at hand (instead of taking a mean/max...). Does it make sense?

aayux · 2019-02-04T18:48:19Z

@tpietruszka

One intuitive reason why I think this could be helpful is that the way we had planned on using XNLI was by concatenating the premise and the hypothesis -- so it is possible that we learn some premise-to-hypothesis attention through it. What do you think?

Of course, the dimensionality is preserved but I don't think that's a big problem.

I agree that a more "meaningful" way of applying attention is to attend on the hidden layer outputs from the forward and backward LMs. In fact, applying attention to the concatenation of the pooling outputs was somewhat foolish of me.

What I'll do instead is only attend to a concatenation of the forward and backward LM outputs and also reduce the number of attention heads (which should solve the memory problem). I'll work on it this weekend and update.

Feel free to add to this PR if you have ideas on improving. If you'd like to try a different experiment with attention, that's great too!

tpietruszka · 2019-02-12T21:27:50Z

@Dust0x I think all approaches are worth testing...

Recently I have been experimenting with different variants of attention, applied to the LM outputs before pooling, on the imdb task. I've pushed 2 variants to a small (for now messy) repo ulmfit_experiments - maybe it could be of help somehow.

Some observations:

whatever I do, I seem to end up with accuracy between 94 and 95. Yes, both uni- and bi-directional models. It is quite frustrating.
I think attention might be helpful where there is less labeled examples, but it needs further testing
one possible interpretation of the fact that the classifier head's architecture does not change much: the 'bottleneck' is the language model, not the classifier head. But then again adding bidirectionality should help, but it does not.
in early versions I also had a GPU memory leak. It seems to be solved now, not sure how. I think it was related to some parameters not being correctly registered as a module (and I guess not de-allocated when appropriate).

Please let me know if you have any thoughts on the subject

aayux · 2019-02-14T20:52:21Z

The memory "leak" was my own fault. I changed the way I was using attention and that fixed it.

Self attention module seems to be working okay on the tests I ran locally, I'll start the bench-marking now. @sebastianruder @PiotrCzapla are there any specific datasets that you would like to see the results on?

@tpietruszka it's very odd that you should get the same accuracy on IMDb across all your experiments. Is it possible that somewhere the classification head is hard-coded to BiPoolingLinearClassifier and it's defaulting to it every time? I'm only suggesting this because this was something that came up when I was experimenting too, and of course it's possible that you have already checked.

aayux added 2 commits January 6, 2019 23:58

Make callable through bicls_head

8fe2891

aayux requested review from PiotrCzapla and sebastianruder January 6, 2019 18:33

sebastianruder approved these changes Jan 7, 2019

View reviewed changes

Fixed self attention, local tests passed

03c4a2c

aayux changed the title ~~[WIP] Self attention for pooling linear classifier~~ Self attention for pooling linear classifier Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self attention for pooling linear classifier #28

Self attention for pooling linear classifier #28

aayux commented Jan 6, 2019 •

edited

Loading

sebastianruder left a comment

aayux commented Jan 14, 2019

tpietruszka commented Jan 23, 2019

aayux commented Feb 4, 2019

tpietruszka commented Feb 12, 2019

aayux commented Feb 14, 2019 •

edited

Loading

Self attention for pooling linear classifier #28

Are you sure you want to change the base?

Self attention for pooling linear classifier #28

Conversation

aayux commented Jan 6, 2019 • edited Loading

sebastianruder left a comment

Choose a reason for hiding this comment

aayux commented Jan 14, 2019

tpietruszka commented Jan 23, 2019

aayux commented Feb 4, 2019

tpietruszka commented Feb 12, 2019

aayux commented Feb 14, 2019 • edited Loading

aayux commented Jan 6, 2019 •

edited

Loading

aayux commented Feb 14, 2019 •

edited

Loading