📚 Auto-publish: Add/update 6 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 12s
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 12s
Generated on: Tue Sep 23 06:20:36 UTC 2025 Source: md-personal repository
This commit is contained in:
@@ -40,7 +40,7 @@ The dimensions of the weight matrices are as follows:
|
||||
### 3. Deconstructing Multi-Head Attention (MHA)
|
||||
|
||||
The core innovation of the Transformer is Multi-Head Attention. It allows the model to weigh the importance of different tokens in the sequence from multiple perspectives simultaneously.
|
||||

|
||||

|
||||
#### 3.1. The "Why": Beyond a Single Attention
|
||||
A single attention mechanism would force the model to average all types of linguistic relationships into one pattern. MHA avoids this by creating `h` parallel subspaces. Each "head" can specialize, with one head learning syntactic dependencies, another tracking semantic similarity, and so on. This creates a much richer representation.
|
||||
|
||||
|
Reference in New Issue
Block a user