...
Transformer architecture :
Attention mechanism : (Attention is all I need paper) - weights = softmax(Key, query, value)
BERT model
XL-Net (by microsoft) - BERT and GPT-3 works better in general
...
...
Transformer architecture :
Attention mechanism : (Attention is all I need paper) - weights = softmax(Key, query, value)
BERT model
XL-Net (by microsoft) - BERT and GPT-3 works better in general
...