Self attention: Merge Query matrix and Key matrix into a single covariance matrix? #517

wenlong2 · 2025-02-05T16:26:39Z

wenlong2
Feb 5, 2025

When compute the context vector in the attention algorithm, three weight matrices were introduced. It has discussed in #454 that the value matrix W_V is not necessary. For the rest two, query matrix and key matrix, keeping two of them seems not necessary, either. The context vector can be expressed as X*W_q*W_K^T*X^T*X*W_Vwhere * is for matrix multiplication. Is it possible to merge the part W_q*W_K^T as a single covariance matrix S, so the context vector become X*S*X^T*X*W_V? This merge could potentially reduce nuisance parameters and improve computational performance.

rasbt · 2025-02-06T17:10:03Z

rasbt
Feb 6, 2025
Maintainer

That's actually really, really good question. I think you mean one can rewrite

$$ QK^T = (XW_q)(XW_K)^T = X W_q W_K^T X^T = X S X^T, \quad \text{with } S = W_q W_K^T $$

Correct? I think this would work but then the transformation becomes the same for keys and queries, and it would not be possible to distinguish them.

So basically it looks like

Separate:

$$ Q = X W_q, \quad K = X W_k, \quad \text{and} \quad QK^T = X W_q W_k^T X^T. $$

and merged:

$$ S = W_q W_k^T \quad \text{so that} \quad QK^T = X S X^T. $$

Come out as the same end result, the training dynamics would be different. Where in the first case the weight parameters are updated separately, and in the second case you lose that distinction and lose degrees of freedom.

But you are welcome to try this in Chapter 5 for example and compare the training losses with and without the merging.

0 replies

wenlong2 · 2025-02-06T21:32:44Z

wenlong2
Feb 6, 2025
Author

Thanks for the reply. Yes, this is exactly what I mean. The training dynamics would be different, as there will be no keys K or queries Q any more. The thing is, do we really need them? Optimizing two matrices (W_q and W_k) than one matrix (S) may involve quite a lot "junk parameters", which would not help the model accuracy in any way.
I am still reading and studying your wonderful book, and I will try this out when I get to Chapter 5.

1 reply

rasbt Feb 8, 2025
Maintainer

involve quite a lot "junk parameters", which would not help the model accuracy in any way.

That's the big question :). I would say they help. But I am curious to see your experiments!

wenlong2 · 2025-02-14T01:27:34Z

wenlong2
Feb 14, 2025
Author

Here is the experiment results using the Chapter 5 example:

The loss curves look promising but a larger training text sample will help confirm the loss trend better.
The merged version actually takes a little longer time to run (473 sec vs 358 sec, on a single CPU) as I haven't found a good way to vectorize the multihead attention computation with a merged matrix.
Below shows the code to replace the MultiHeadAttention class, which used the a wrapper similar to the example given in Chapter 3.

class CA_alt1(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.S = nn.Linear(d_in, d_in, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))
    def forward(self, x):
        b, num_tokens, d_in = x.shape
        values = self.W_value(x)
        attn_scores = self.S(x) @ x.transpose(1,  2)
        attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / values.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context_vec = attn_weights @ values
        return context_vec
    
class MHA_alt1(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), "d_out must be divisible by num_heads"
        head_dim = d_out // num_heads
        self.heads = nn.ModuleList(
            [CA_alt1(d_in, head_dim, context_length, dropout, qkv_bias) for _ in range(num_heads)]
        )
    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

and in TransformerBlock switch MultiHeadAttention to MHA_alt1:

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MHA_alt1(

9 replies

rasbt Feb 14, 2025
Maintainer

Some more analyses why the covariance matrix approach is more expensive and potentially ($(473 sec - 358 sec)/358 sec \times 100%$, i.e. 32%) slower in your observation .

Approach 1: Form $Q$ and $K$ then multiply

Compute $Q = X W_q$:
- This is $(n \times d_\text{in})(d_in \times d_\text{out})$, which costs $\mathcal{O}(n d_\text{in} d_\text{out})$.
Compute $K = X W_k$:
- Same cost: $\mathcal{O}(n d_\text{in} d_\text{out})$.
Compute $QK^T$:
- $Q$ is $n \times d_\text{out}$, $K$ is $n \times d_{out}$ (so $K^T$ is $d_out \times n$).
- The multiplication $(n \times d_\text{out})(d_\text{out} \times n)$ costs $\mathcal{O}(n^2 d_\text{out})$.

Total cost: $\mathcal{O}(n\ d_{\text{in}}\ d_{\text{out}} + n\ d_{\text{in}}\ d_{\text{out}} + n^2\ d_{\text{out}}) = \mathcal{O}(n\ d_{\text{in}}\ d_{\text{out}} + n^2\ d_{\text{out}})$.

Approach 2: Use covariance matrix $S = W_q W_k^T$ then do $X S X^T$

Compute $X S$:
- This is $(n \times d_\text{in})(d_\text{in} \times d_\text{in})$, costing $\mathcal{O}(n d_\text{in}^2)$.
Compute $(X S) X^T$:
- $(X S)$ is $n \times d_\text{in}$ and $X^T$ is $d_\text{in} \times n$, so $\mathcal{O}(n^2 d_\text{in})$.

Total cost: $\mathcal{O}( n d_\text{in}^2 + n^2 d_\text{in})$.

So approach 1 is cheaper because typically we have $d_{\text{in}} > d_{\text{out}}$.

Some concrete numbers

To give a concrete example, in GPT-2 124M we use $d_\text{in} = 768 \quad \text{ and } \quad d_\text{out} = d_\text{in} / n_\text{heads} = 768 / 12 = 64$.

Based on that, for Approach 1 the cost is $\mathcal{O}(49,152 n + 64 n^2)$.

For Approach 2 the cost is $\mathcal{O}(589,824 n + 768 n^2)$.

wenlong2 Feb 14, 2025
Author

Thank you for the analysis @rasbt ! It seems

$W_q$ and $W_k$ are not always square. Please let me use @yashbhalgat 's notation here with d_in = H = number of input embedding = number of features, and d_out = D = number of output embedding = number of projection dimension. The above computation holds when H = D = d.
In the second approach, step 1 does not exist as we don't need $W_q$ and $W_k$ any more. We simply initialize a random $S$ matrix and train it directly. The total cost of matrix manipulation in the second approach is the sum of step 2 and step 3.

rasbt Feb 14, 2025
Maintainer

Ah thanks. I see that in your code now. So if with d_in = d_out = d, I don't know why the coveriance matrix approach is slower. It's Friday afternoon, need to think about this more carefully with a fresh brain later/this weekend.

wenlong2 Feb 14, 2025
Author

I think with a number of heads = 12, the chapter 5 example is equivalent to $d_\mathrm{out}$ = $d_\mathrm{in}$ / 12. The matrix $S$ is always $d_\mathrm{in} \times d_\mathrm{in}$, while $W_q$ and $Q_k$ are $d_\mathrm{in} \times d_\mathrm{out}$, which is smaller than $S$ by a factor of 12.

rasbt Feb 14, 2025
Maintainer

Ah yes, I was focused on the single-head attention case for some reason 🤦‍♂️.

I updated it accordingly with separate $d_\text{in}$ and $d_\text{out}$, and it makes sense now. Thanks!

yashbhalgat · 2025-02-14T14:40:12Z

yashbhalgat
Feb 14, 2025

This is such an interesting discussion!
@wenlong2 @rasbt we can look at this idea theoretically from a "matrix rank" perspective. If we compare the rank of the attention matrix in both the standard attention versus the proposed covariance-style attention:

Standard Attention:

Input X: [L × H] matrix (L tokens, H features)
Query/Key weights W_Q, W_K: [H × D] matrices
Attention matrix = XW_Q(W_K)^TX^T: [L × L]
Due to the projection through dimension D, the rank is bounded by min(L, D)
Typically D < H, so this enforces a low-rank structure (rank = D)

Proposed Covariance-Style Attention:

Input X: [L × H] matrix
Learned covariance S: [H × H] matrix
Attention matrix = XSX^T: [L × L]
The rank of XSX^T would be bounded by min(rank(X), rank(S)), i.e. min(L, H)
Since S can be full rank, this allows for higher-rank attention patterns

So, basically, while the standard attention mechanism explicitly constrains the rank through the projection dimension D, the covariance formulation allows for potentially richer attention patterns up to rank min(L, H). However, it's quite interesting now... maybe the low-rank constraint in standard attention is beneficial as an inductive bias during training, even though it limits the theoretical expressiveness.

Would be very interested to hear your thoughts!

3 replies

rasbt Feb 14, 2025
Maintainer

Thanks for the detailed comment. And regarding the hypothesis:

maybe the low-rank constraint in standard attention is beneficial as an inductive bias during training, even though it limits the theoretical expressiveness.

that's a good point.

I think there was a paper showing that weight decay increases low-rank (https://arxiv.org/abs/2410.23819). Generally weight decay can be good, but it can also be harmful if it is too much and that it was harmful for modeling performance.

On the other hand, it may depend on the task and LLM. I.e., too high of a rank may also be a hindrance as it has too many degrees of freedom.

I think there were a few more works that indirectly looked into the rank throughout various parts of transformer models but I don't recall them off the top of my head.

wenlong2 Feb 14, 2025
Author

Thanks for the comment and theoretical analysis @yashbhalgat ! This helps me a lot.
I may be wrong, but it seems the degree of freedom depends on the choice of D. Given there are two weight matrices (W_Q and W_K) in the standard attention,

when D < H/2, then the covariance-style attention contains more information
when D > H/2, covariance-style attention and standard attention contains the same amount of information, and the W_Q, W_K matrices contains some "duplicate / junk parameters" (I think)

When I initially post this thread, I didn't realize that D<H is a typical choice but somehow thought D=H...

yashbhalgat Feb 14, 2025

Hi @wenlong2 , I believe the number of degrees of freedom are measured by the rank of a system (not by the number of parameters). E.g. a vector [k, 2k, 3k] has 3 parameters but only 1 degree of freedom.

But yes, the degree of freedom in the standard attention depends on D. As long as D < H, covariance-style attention would have more degrees of freedom (higher rank).

Although, having more degrees of freedom does not necessarily mean more information. Measuring "Information" would not be that simple I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self attention: Merge Query matrix and Key matrix into a single covariance matrix? #517

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Self attention: Merge Query matrix and Key matrix into a single covariance matrix? #517

wenlong2 Feb 5, 2025

Replies: 4 comments · 13 replies

rasbt Feb 6, 2025 Maintainer

wenlong2 Feb 6, 2025 Author

rasbt Feb 8, 2025 Maintainer

wenlong2 Feb 14, 2025 Author

rasbt Feb 14, 2025 Maintainer

Approach 1: Form $Q$ and $K$ then multiply

Approach 2: Use covariance matrix $S = W_q W_k^T$ then do $X S X^T$

Some concrete numbers

wenlong2 Feb 14, 2025 Author

rasbt Feb 14, 2025 Maintainer

wenlong2 Feb 14, 2025 Author

rasbt Feb 14, 2025 Maintainer

yashbhalgat Feb 14, 2025

rasbt Feb 14, 2025 Maintainer

wenlong2 Feb 14, 2025 Author

yashbhalgat Feb 14, 2025

wenlong2
Feb 5, 2025

Replies: 4 comments 13 replies

rasbt
Feb 6, 2025
Maintainer

wenlong2
Feb 6, 2025
Author

rasbt Feb 8, 2025
Maintainer

wenlong2
Feb 14, 2025
Author

rasbt Feb 14, 2025
Maintainer

wenlong2 Feb 14, 2025
Author

rasbt Feb 14, 2025
Maintainer

wenlong2 Feb 14, 2025
Author

rasbt Feb 14, 2025
Maintainer

yashbhalgat
Feb 14, 2025

rasbt Feb 14, 2025
Maintainer

wenlong2 Feb 14, 2025
Author