Status: ✨ State-of-the-Art (3M Params)
Validation Loss: 1.2551 (vs Baseline 1.4673, vs Gated SDPA 1.3698)
Parameter Overhead: ~7% (+262k params @ 3M scale vs baseline)
Neon016 introduces a fourth projection vector, Intent (
Unlike standard attention which only decides where to look (via
This simple addition proved to be the single most effective architectural change in our experiments, outperforming:
- Scaling up MLP (neon026, 1.355)
- Gated SDPA (neon010, 1.370)
- Calculated Intent (neon052, 1.345)
- Multi-Latent Attention (neon006, 1.547)
Standard Attention: $$ Output = \text{Softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V $$
Neon016 (Intent Attention): $$ Gate = \sigma(I) $$ $$ Output = Gate \odot \left( \text{Softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V \right) $$
Where:
-
$I$ is a learned projection of the input$x$ . -
$\sigma$ is the Sigmoid function (output range$[0, 1]$ ). -
$\odot$ is element-wise multiplication.
In models/neon016.py, the linear projection is modified to produce 4 chunks instead of 3:
# 1. Four Projections (Q, K, V, I)
self.c_attn = nn.Linear(d_model, 4 * d_model, bias=False)
q, k, v, intent = self.c_attn(x).split(C, dim=2)
# 2. Standard Q/K Processing (RoPE, Norm)
# Note: Intent is NOT normalized or rotated. It is raw content.
q, k = self.q_norm(q), self.k_norm(k)
q = apply_rotary_emb(q, freqs_cos, freqs_sin)
k = apply_rotary_emb(k, freqs_cos, freqs_sin)
# 3. Standard Attention
attn_out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
# 4. Result Gating
# Apply sigmoid to Intent to get a [0,1] gate
y = torch.sigmoid(intent) * attn_out For a model with
-
$Q, K, V, I$ all have shape(B, T, 4, 64). - The gating happens per-head, per-dimension.
- This means the model can choose to "silence" specific features from a head's output if they are not relevant to the current intent, even if the attention mechanism attended to them.
Standard attention combines "Search" and "Retrieval".
- Search (Q, K): Find relevant past tokens.
- Retrieval (V): Extract their values.
However, standard attention forces you to always take the sum of retrieved values. If the search finds "weak matches" or "irrelevant strong matches" (distractors), the noise is added to the residual stream.
Intent (
- Q/K/V: "Find the best match you can."
- I (Intent): "Regardless of what you found, do I actually want this type of information right now?"
For example, if the model is currently predicting a verb, but an attention head specializes in retrieving adjectives, the Intent vector can simply set the gate to
This dynamic gating is much more powerful than a static "head weight" because it changes per token.
| Model | Mechanism | Val Loss | Notes |
|---|---|---|---|
| neon016 | Learned I ( |
1.255 | Independent projection allows disjoint feature learning. |
| neon010 | Gated by Q ( |
1.370 | Q is constrained by positional matching duties. |
| neon052 | Gated by Calc ( |
1.345 | Calc intent is correlated with Q/K/V. |
| neon005 | Baseline (No Gate) | 1.467 | No filtering capability. |
The "Orthogonality" Hypothesis: The reason neon016 succeeds where calculated intent (neon052) fails is likely because the optimal "Intent" signal is orthogonal to the Q, K, and V signals.
-
$Q$ needs to encode position/relation. -
$V$ needs to encode content. -
$I$ needs to encode task state (e.g. "I am looking for a name").
Forcing
Parameter Cost:
- Adds 1 linear projection (
$d_{model} \times d_{model}$ ). - For
d_model=256: Adds$65,536$ params per layer. - Total for 4 layers: ~262k parameters.
-
Overhead: ~7% increase over baseline (2.88M
$\to$ 3.15M).
Efficiency:
- The computational cost is negligible (one extra matmul and element-wise mult).
- The gain in loss (-0.21 vs baseline) is equivalent to doubling the model size in standard scaling laws.
- This makes neon016 extremely efficient.