News

The softmax function used in Transformer’s attention mechanism tends to distribute attention scores across all tokens, even those that are not relevant to the task.