📚 Auto-publish: Add/update 6 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 12s

Generated on: Tue Sep 23 06:20:36 UTC 2025
Source: md-personal repository
This commit is contained in:
Automated Publisher
2025-09-23 06:20:36 +00:00
parent 7cd5bd6558
commit 2b2203c6f7
6 changed files with 4 additions and 3 deletions

View File

@@ -40,7 +40,7 @@ The dimensions of the weight matrices are as follows:
### 3. Deconstructing Multi-Head Attention (MHA)
The core innovation of the Transformer is Multi-Head Attention. It allows the model to weigh the importance of different tokens in the sequence from multiple perspectives simultaneously.
![](/images/transformer-s-core-mechanics/.png)
![S3 File](/images/transformer-s-core-mechanics/c7fe4af2633840cfbc81d7c4e3e42d0c.png)
#### 3.1. The "Why": Beyond a Single Attention
A single attention mechanism would force the model to average all types of linguistic relationships into one pattern. MHA avoids this by creating `h` parallel subspaces. Each "head" can specialize, with one head learning syntactic dependencies, another tracking semantic similarity, and so on. This creates a much richer representation.