You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was reading the code and noticed that the initialization of the MultiheadAttention module in the decoder layer (context attention) doesn't seem to include the relative position embedding arguments, which are present in the self-attention part.
Since the relative position embedding typically replaces the standard sinusoidal positional encoding, I'm concerned that the absence of relative position embeddings in the encoder-decoder attention might cause discrepancies, as the decoder's hidden state would lack necessary positional information for understanding word relationships across sequences.
Could this affect the accuracy of the encoder-decoder attention, since the hidden state from the decoder may struggle to recognize adjacent words by their positions?
Are there any experiments or results exploring the impact of using relative position embeddings in the context attention?
Is there any observed synergistic effect between using standard fixed positional embeddings (like the sinusoidal one) and relative positional embeddings together?
Any clarification or experimental insights on this would be greatly appreciated!
FYI, codes on the context attention and self-attention:
I was reading the code and noticed that the initialization of the MultiheadAttention module in the decoder layer (context attention) doesn't seem to include the relative position embedding arguments, which are present in the self-attention part.
Since the relative position embedding typically replaces the standard sinusoidal positional encoding, I'm concerned that the absence of relative position embeddings in the encoder-decoder attention might cause discrepancies, as the decoder's hidden state would lack necessary positional information for understanding word relationships across sequences.
Could this affect the accuracy of the encoder-decoder attention, since the hidden state from the decoder may struggle to recognize adjacent words by their positions?
Are there any experiments or results exploring the impact of using relative position embeddings in the context attention?
Is there any observed synergistic effect between using standard fixed positional embeddings (like the sinusoidal one) and relative positional embeddings together?
Any clarification or experimental insights on this would be greatly appreciated!
FYI, codes on the context attention and self-attention:
OpenNMT-py/onmt/decoders/transformer.py
Lines 325 to 335 in 97111d9
The text was updated successfully, but these errors were encountered: