📚 Auto-publish: Add/update 6 blog posts
All checks were successful
Hugo Publish CI / build-and-deploy (push) Successful in 14s

Generated on: Thu Oct  2 08:42:39 UTC 2025
Source: md-personal repository
This commit is contained in:
Automated Publisher
2025-10-02 08:42:39 +00:00
parent ca873828aa
commit 7ef6ce1987
6 changed files with 6 additions and 5 deletions

View File

@@ -40,7 +40,7 @@ The dimensions of the weight matrices are as follows:
### 3. Deconstructing Multi-Head Attention (MHA)
The core innovation of the Transformer is Multi-Head Attention. It allows the model to weigh the importance of different tokens in the sequence from multiple perspectives simultaneously.
![S3 File](http://localhost:4998/attachments/image-c64b0f9df1e4981c4ecdb3b60e8bc78c426ffa68.png?client=default&bucket=obsidian)
![S3 File](/images/transformer-s-core-mechanics/c7fe4af2633840cfbc81d7c4e3e42d0c.png)
#### 3.1. The "Why": Beyond a Single Attention
A single attention mechanism would force the model to average all types of linguistic relationships into one pattern. MHA avoids this by creating `h` parallel subspaces. Each "head" can specialize, with one head learning syntactic dependencies, another tracking semantic similarity, and so on. This creates a much richer representation.