Masking layer vs attention_mask parameter in MultiHeadAttention


I use MultiHeadAttention layer in my transformer model (my model is very similar to the named entity recognition models). Because my data comes with different lengths, I use padding and attention_mask parameter in MultiHeadAttention to mask padding. If I would use the Masking layer before MultiHeadAttention, will it have the same effect as attention_mask parameter? Or should I use both: attention_mask and Masking layer?


The Tensoflow documentation on Masking and padding with keras may be helpful.
The following is an excerpt from the document.

When using the Functional API or the Sequential API, a mask generated
by an Embedding or Masking layer will be propagated through the
network for any layer that is capable of using them (for example, RNN
layers). Keras will automatically fetch the mask corresponding to an
input and pass it to any layer that knows how to use it.

tf.keras.layers.MultiHeadAttention also supports automatic mask propagation in TF2.10.0.

Improved masking support for tf.keras.layers.MultiHeadAttention.

  • Implicit masks for query, key and value inputs will automatically be
    used to compute a correct attention mask for the layer. These padding
    masks will be combined with any attention_mask passed in directly when
    calling the layer. This can be used with tf.keras.layers.Embedding
    with mask_zero=True to automatically infer a correct padding mask.
  • Added a use_causal_mask call time arugment to the layer. Passing
    use_causal_mask=True will compute a causal attention mask, and
    optionally combine it with any attention_mask passed in directly when
    calling the layer.

Answered By – satojkovic

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published