maxfierrog
diff --git a/‎content/posts/tensors-signals-kernels/index.md‎
Lines changed: 131 additions & 10 deletions b/‎content/posts/tensors-signals-kernels/index.md‎
Lines changed: 131 additions & 10 deletions
diff --git a/‎content/posts/tensors-signals-kernels/mark-wilson-1e90-1990.jpeg‎
334 KB b/‎content/posts/tensors-signals-kernels/mark-wilson-1e90-1990.jpeg‎
334 KB
diff --git a/‎content/posts/tensors-signals-kernels/mark-wilson-1e90-1990.png‎
936 KB b/‎content/posts/tensors-signals-kernels/mark-wilson-1e90-1990.png‎
936 KB
diff --git a/‎content/posts/tensors-signals-kernels/mark-wilson-e67109-2012.jpeg‎
417 KB b/‎content/posts/tensors-signals-kernels/mark-wilson-e67109-2012.jpeg‎
417 KB
diff --git a/‎content/posts/tensors-signals-kernels/mark-wilson-e67109-2012.png‎
844 KB b/‎content/posts/tensors-signals-kernels/mark-wilson-e67109-2012.png‎
844 KB
@@ -1005,6 +1005,10 @@ Bookeeping heterogeneous tensor contractions is a big practical problem. In part
 
 An appealing model of tensor operations was offered by [Sir Roger Penrose](https://en.wikipedia.org/wiki/Roger_Penrose) in 1971 within the illustrated writeup [Applications of Negative-Dimensional Tensors](https://www.mscs.dal.ca/%7Eselinger/papers/graphical-bib/public/Penrose-applications-of-negative-dimensional-tensors.pdf). There, he provided a first theory of abstract tensor networks which he called Abstract Tensor Systems (ATS), which came with a coordinate-free system for representing homogeneous tensors and contractions. This system became known as [Penrose graphical notation](https://en.wikipedia.org/wiki/Penrose_graphical_notation). It is delightful for any abstract treatment of tensors (like our own so far).
 
+{{< hcenter >}}
+{{< figure src="roger-penrose.png" width="256" caption="Sir Roger Penrose (born August 8, 1931)" >}}
+{{< /hcenter >}}
+
 {{% hint title="3.39. Example" %}}
 
 In Penrose graphical notation, individual tensors are represented as nodes in a graph sometimes distinguished by geometric shapes for ease of reference. The rank of the tensor being represented is indicated by its number of outgoing edges. The system differentiates a "cartesian" case by the availability of a bijection $\Phi : V \to V^*$. (We have been assuming this -- see 3.36). In the cartesian case, edge direction does not matter. Otherwise, a tensor of type $(a, b)$ will have $a$ upwards pointing edges and $b$ downward pointing edges. Contractions are set by connecting corresponding edges.
@@ -1017,10 +1021,6 @@ Above we see a cartesian Penrose diagram, representing a contraction involving t
 
 {{% /hint %}}
 
-{{< hcenter >}}
-{{< figure src="roger-penrose.png" width="256" caption="Sir Roger Penrose (born August 8, 1931)" >}}
-{{< /hcenter >}}
-
 By ignoring any details needed to determine tensor instances, Penrose notation enjoys better ergonomics for working with abstract tensors. But before Penrose (and the study of abstract tensors spaces itself), the early appearances of these objects first appeared in service of fields like [differential geometry](https://en.wikipedia.org/wiki/Differential_geometry), only being baptized as "tensors" at a later time by physicists.
 
 {{% hint title="3.40. Example" %}}
@@ -1090,7 +1090,11 @@ provides a tensor $g_c : (v, w) \mapsto \langle \mathcal{J} \mathcal{\Phi}(v), \
 
 {{% /hint %}}
 
-In the above example,  $T_p M$ has no abstract canonical isomorphism to its own dual, as the inner product (which is precisely the tensor $g_p$ as described) depends on the geometry of $\mathcal{M}$. This is an example of the statement of 3.36, showing that a purely abstract treatment of tensor spaces is not always productive. But before people approached tensors abstractly, coordinate-based approaches were the norm. 
+{{< hcenter >}}
+{{< figure src="mark-wilson-1e90-1990.png" width="512" caption="Mark Wilson, \'1e90\' (1990)" >}}
+{{< /hcenter >}}
+
+In 3.40, the spaces $T_p M$ have no abstract canonical isomorphism to their own dual, as the inner product (which is precisely the tensor $g_p$ as described) depends on the geometry of $\mathcal{M}$. This is an example of the statement of 3.36, showing that a purely abstract treatment of tensor spaces is not always productive. But before people approached tensors abstractly, coordinate-based approaches were the norm. 
 
 The dominant model for tensor operations in coordinates is indisputably [Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation). Coincidentally, Penrose introduced ATS in Einstein notation. It is a data-oriented system where indices corresponding to each argument in $V$ and $V^\*$ of a tensor in scalar-valued map form (see $(8)$ and $(10)$) are tracked. Credit for its creation is given to differential geometer [Ricci-Curbastro](https://en.wikipedia.org/wiki/Gregorio_Ricci-Curbastro), but it was made popular by [Einstein](https://en.wikipedia.org/wiki/Albert_Einstein) when he published a novel use of it in physics with his [field equations](https://en.wikipedia.org/wiki/Einstein_field_equations) in 1915.
 
@@ -1113,10 +1117,10 @@ $$
 \text{entry}_m(i, j) = m_{i,j}.
 $$
 
-From a computational perspective, this is a unique proxy for $m$ when considering maps in $\mathcal{L}(V, W)$ (via 3.9). For example, the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) $\\| m \\|\_F = \sum_{(i, j)} |m_{i,j}|^2$ makes use of this notation. While this made sense for a linear map of form $V \to W$, issues arise for tensors of other forms. Consider $g_c$ from 3.40, 
+From a computational perspective, this is a unique proxy for $m$ when considering maps in $\mathcal{L}(V, W)$ (via 3.9). For example, the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) $\\| m \\|\_F^2 = \sum_{(i, j)} |m_{i,j}|^2$ makes use of this notation. While this made sense for a linear map of form $V \to W$, issues arise for tensors of other forms. Consider $g_c$ from 3.40, 
 
 $$
-g_{(\theta, \phi)}(v, w) \mapsto
+g_{(\theta, \phi)}(v, w) = 
 \begin{bmatrix}
 v_\phi & v_\theta
 \end{bmatrix}
@@ -1127,7 +1131,7 @@ r^2 & 0 \\
 \begin{bmatrix}
 w_\phi \\
 w_\theta
-\end{bmatrix},
+\end{bmatrix}
 \;\; \text{s.t.} \;\;
 M_\mathcal{B}(g_{({\theta}, {\phi})}) = 
 \begin{bmatrix}
@@ -1136,16 +1140,133 @@ r^2 & 0 \\
 \end{bmatrix}.
 $$
 
-If given only the matrix $M_\mathcal{B}(g_{({\theta}, {\phi})})$, there would be no way of knowing it represents the map of (inner-product) form $g_{(\theta, \phi)} : P \times P \to \mathbb{R}$ (and not an operator in $\mathcal{L}(V, V)$) precisely because it is canonical to interpret it as a map in $\mathcal{L}(V, V)$. To emphasize, this is canonical simply to make matrix-vector multiplication work as usual.
+If given only the matrix $M_\mathcal{B}(g_{({\theta}, {\phi})})$, there would be no way of knowing it represents the map of (inner-product) form $g_{(\theta, \phi)} : P \times P \to \mathbb{R}$ (and not an operator in $\mathcal{L}(V, V)$) precisely because it is canonical to interpret it as a map in $\mathcal{L}(V, V)$. This interpretation is only canonical because matrix-vector multiplication is defined with 
+
+$$
+(Av)_i = \sum_{j} A_{ij} v_j.
+$$
+
+Hence, one must revisit matrix multiplication to have a basis-induced canonical isomorphism between tensors and matrices (see 3.19). To this end, we will adopt the tensor contraction in coordinates as matrix multiplication for tensors. A rank-$n$ tensor will be represented as a matrix with $n$ axes (such that $n$ indices identify an entry), such that for a tensor $t$ of type $(a, b)$, 
+
+$$
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a}}
+$$
+
+denotes an entry of its matrix representation. Taking the tensor product of $t$ with a new tensor $g$ of type $(c, d)$ is very simple, as each entry of the product (by the same logic) requires $a + c$ upper and $b + d$ lower indices to be found. So we simply denote their tensor product
+
+$$
+t \otimes g = 
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a} \, g_{\delta_1 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}}.
+$$
+
+To perform contractions as specified in 3.35, one does a weighted sum across a pair of indices, one upper and one lower (which may belong to different tensors in a tensor product). For example, one could contract $\alpha_a$ with the index $\delta_1$ in the product $t \otimes g$ to obtain
+
+$$
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1}} \, 
+g_{\delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}} =
+\sum_k \,
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1} \, k} \,
+g_{k \, \delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}}.
+$$
+
+Note that in the homogeneous case, one can always do this for any upper-lower index pair. The heterogeneous case just requires knowing which upper indices can be contracted with which lower indices (which can be done by tracking which index pairs correspond to duals of the same vector space). Finally, tensors of equal type may be summed entrywise,
 
-Hence, one must revisit matrix multiplication to have a basis-induced canonical isomorphism between tensors and matrices (see 3.19). To this end, we will adopt the tensor contraction in coordinates as matrix multiplication for tensors. We will treat 3.19 as a given; a rank-$n$ tensor will be represented as a matrix with $n$ axes (such that $n$ indices identify an entry).
+$$
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a}}} +
+{\large q_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a}}} = 
+{\large h_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a}}},
+$$
 
+with the understanding that the tensor product is distributive over tensor sums. This completes the system that Ricci-Curbastro outlined, allowing an axiomatic treatment for tensors in coordinates. The convention to eschew summation notation in contractions (only implying it when a symbol appears twice as an index) was introduced and popularized by Einstein (effectively his only contribution to [Ricci calculus](https://en.wikipedia.org/wiki/Ricci_calculus)):
+
+$$
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1} \, k} \,
+g_{k \, \delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}} =
+\sum_k \,
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1} \, k} \,
+g_{k \, \delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}}.
+$$
+
+Note that at any point we can find an entry in the matrix representation by replacing all indices with values. This means that Einstein notation is simply element-wise treatment of tensor matrices, with careful consideration of contraction compatibility by use of upper and lower indices. Here, "trades" translate to index raising or lowering, where (depending on context) one may be allowed to
+
+$$
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a}} \xrightarrow{\text{raise } \beta_b}
+{\large t_{\beta_1 \, \cdots \, \beta_{b - 1}}^{\alpha_1 \, \cdots \, \alpha_a \, \beta_b}} \xrightarrow{\text{lower } \beta_b}
+{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a}}.
+$$
+
+If allowed to raise or lower indices (which is context-dependent as seen in 3.36 and 3.40), a contraction can be done on two lower or upper indices. So summation (and legality of index lowering or raising) remains implied in cases where $V \cong V^\*$ canonically (which Penrose called "cartesian" in ATS). For example,
+
+$$
+{\large t_{\beta_1 \, \cdots \, \beta_b \, k}^{\alpha_1 \, \cdots \, \alpha_{a}} \,
+g_{\delta_1 \, \cdots \, \delta_d \, k}^{\gamma \, \cdots \, \gamma_c }} =
+\sum_k \,
+{\large t_{\beta_1 \, \cdots \, \beta_b \, k}^{\alpha_1 \, \cdots \, \alpha_{a}} \,
+g_{\delta_1 \, \cdots \, \delta_d \, k}^{\gamma \, \cdots \, \gamma_c }}.
+$$
+
+{{% /hint %}}
+
+Notice that the "reduction" done by way of summation in a tensor contraction can use any operation with the effect of [folding](https://en.wikipedia.org/wiki/Fold_(higher-order_function)) a collection (returning a single value from multiple values of the same type), although this may break any and all algebraic invariants (depending on context). This could look like
+
+$$
+w_{abc}^{ik} = \text{reduce}_j \left( v_{a b c}^{i j k} \right).
+$$
+
+Since the user is giving up all algebraic gurantees, they may annihilate any index (upper or lower). This is simply treating the tensor matrix as data. But it shows that Einstein notation maintains its richness beyond pure tensor algebra, and is even generalizable to infinite-dimensional tensors via wise choices of $\text{reduce}$.
+
+{{% hint title="3.42. Example" %}}
+
+The cornerstone of the [transformer model](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) is generally regarded to be the [attention mechanism](https://en.wikipedia.org/wiki/Attention_(machine_learning)). This is no more than a tensor-valued function $\mathbf{A} : \mathbb{R}^{d \times n} \to \mathbb{R}^{d_v \times n}$ with tunable parameters, generally defined as
+
+$$
+\mathbf{A}(X) = \text{softmax}\left(\frac{(KX)^\top (QX)}{\sqrt{d_k}}\right) VX.
+$$
+
+Here, $X \in \mathbb{R}^{d \times n}$, and the matrices $Q \in \mathbb{R}^{d_k \times d}$ (query projection matrix), $K \in \mathbb{R}^{d_k \times d}$ (key projection matrix), and $V \in \mathbb{R}^{d_v \times d}$ (value projection matrix) are optimized. You rarely see it in Einstein notation as an equation of heterogeneous tensors, which helps when $X$ has many channels to mix (indexed by $c$),
+
+$$
+\mathbf{A}(X)_{t}^{ic} = \text{softmax} \left(\frac{g_{jm} K_p^j X_s^{pc} Q_q^m X_t^{qc}}{\sqrt{d_k}}\right) V_k^i X_s^{kc}.
+$$
+
+I replaced the row-wise dot product application $(KX)^\top (QX)$ with an explicit metric tensor $g_{jm}$, which brings a subtle geometric interpretation to the foreground just by way of notation. Observing this could have motivated us to add another metric tensor $m_{c_1 c_2}$ to capture channel-wise relationships while mixing,
+
+$$
+\mathbf{A}(X)_{t}^{ic} = \text{softmax} \left(\frac{g_{jm} m_{c_1 c_2} K_p^j X_s^{pc_1} Q_q^m X_t^{qc_2}}{\sqrt{d_k}}\right) V_k^i X_s^{kc}.
+$$
+
+With the [`einsum` NumPy API](https://numpy.org/doc/stable/reference/generated/numpy.einsum.html), we can write the above in Python still using Einstein notation. For simplicity, only a single-channel $X$ will be considered. Below, the strategy for determining the tensor `g` is left unspecified.
+
+{{< highlight python "lineNos=true, lineNoStart=1" >}}
+
+# einsum treats all indices as lower (or upper)
+# i.e. assumes raising/lowering is generally OK
+
+KX = np.einsum('jk,ks->js', K, X) # keys
+QX = np.einsum('ml,lt->mt', Q, X) # queries
+
+# attention scores (similarity under g)
+sim = np.einsum('jm,js,mt->st', g, KX, QX) 
+
+scores = sim / np.sqrt(Q.shape[0])
+weights = np.exp(scores - np.max(scores, axis=0, keepdims=True))
+weights = weights / np.sum(weights, axis=0, keepdims=True)
+
+VX = np.einsum('ij,jt->it', V, X) # values
+result = np.einsum('is,st->it', VX, weights)
+{{< /highlight >}}
 
 {{% /hint %}}
 
 #### Overview
 
+We began this section by organizing the concepts of matrices, linear maps, and vectors. We then took a look at linear forms and dual spaces, being careful of infinite-dimensional cases. Later, we expanded the concept of linearity to multilinear maps, and showed how the tensor product "linearizes" them. That way, we saw how the way we organized matrices, vectors, and linear maps early on applied to tensor product spaces in very natural ways. Finally, tensors allowed us to speak about these objects as unified under a single umbrella woven by tight isomorphisms.
+
+Although this trajectory was predominantly abstract, we were also exposed to practical notation and uses of tensors in coordinates, with examples from the early history of tensors in differential geometry and from modern applications in deep learning. These examples showed that tensors can be found at many points of the spectrum between raw data and abstract algebraic objects, and that their theory is robust to a wide range of uses -- some more disrespectful than others, but never enough to erase their basic qualities.
 
+{{< hcenter >}}
+{{< figure src="mark-wilson-e67109-2012.png" width="512" caption="Mark Wilson, \'e67109\' (2012)" >}}
+{{< /hcenter >}}
 
 ### Signals and Systems