Skip to content

Commit 231c868

Browse files
committed
finsihed tensors
1 parent c5a4f8d commit 231c868

11 files changed

Lines changed: 1275 additions & 33 deletions

File tree

content/posts/tensors-signals-kernels/index.md

Lines changed: 131 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1005,6 +1005,10 @@ Bookeeping heterogeneous tensor contractions is a big practical problem. In part
10051005

10061006
An appealing model of tensor operations was offered by [Sir Roger Penrose](https://en.wikipedia.org/wiki/Roger_Penrose) in 1971 within the illustrated writeup [Applications of Negative-Dimensional Tensors](https://www.mscs.dal.ca/%7Eselinger/papers/graphical-bib/public/Penrose-applications-of-negative-dimensional-tensors.pdf). There, he provided a first theory of abstract tensor networks which he called Abstract Tensor Systems (ATS), which came with a coordinate-free system for representing homogeneous tensors and contractions. This system became known as [Penrose graphical notation](https://en.wikipedia.org/wiki/Penrose_graphical_notation). It is delightful for any abstract treatment of tensors (like our own so far).
10071007

1008+
{{< hcenter >}}
1009+
{{< figure src="roger-penrose.png" width="256" caption="Sir Roger Penrose (born August 8, 1931)" >}}
1010+
{{< /hcenter >}}
1011+
10081012
{{% hint title="3.39. Example" %}}
10091013

10101014
In Penrose graphical notation, individual tensors are represented as nodes in a graph sometimes distinguished by geometric shapes for ease of reference. The rank of the tensor being represented is indicated by its number of outgoing edges. The system differentiates a "cartesian" case by the availability of a bijection $\Phi : V \to V^*$. (We have been assuming this -- see 3.36). In the cartesian case, edge direction does not matter. Otherwise, a tensor of type $(a, b)$ will have $a$ upwards pointing edges and $b$ downward pointing edges. Contractions are set by connecting corresponding edges.
@@ -1017,10 +1021,6 @@ Above we see a cartesian Penrose diagram, representing a contraction involving t
10171021

10181022
{{% /hint %}}
10191023

1020-
{{< hcenter >}}
1021-
{{< figure src="roger-penrose.png" width="256" caption="Sir Roger Penrose (born August 8, 1931)" >}}
1022-
{{< /hcenter >}}
1023-
10241024
By ignoring any details needed to determine tensor instances, Penrose notation enjoys better ergonomics for working with abstract tensors. But before Penrose (and the study of abstract tensors spaces itself), the early appearances of these objects first appeared in service of fields like [differential geometry](https://en.wikipedia.org/wiki/Differential_geometry), only being baptized as "tensors" at a later time by physicists.
10251025

10261026
{{% hint title="3.40. Example" %}}
@@ -1090,7 +1090,11 @@ provides a tensor $g_c : (v, w) \mapsto \langle \mathcal{J} \mathcal{\Phi}(v), \
10901090

10911091
{{% /hint %}}
10921092

1093-
In the above example, $T_p M$ has no abstract canonical isomorphism to its own dual, as the inner product (which is precisely the tensor $g_p$ as described) depends on the geometry of $\mathcal{M}$. This is an example of the statement of 3.36, showing that a purely abstract treatment of tensor spaces is not always productive. But before people approached tensors abstractly, coordinate-based approaches were the norm.
1093+
{{< hcenter >}}
1094+
{{< figure src="mark-wilson-1e90-1990.png" width="512" caption="Mark Wilson, \'1e90\' (1990)" >}}
1095+
{{< /hcenter >}}
1096+
1097+
In 3.40, the spaces $T_p M$ have no abstract canonical isomorphism to their own dual, as the inner product (which is precisely the tensor $g_p$ as described) depends on the geometry of $\mathcal{M}$. This is an example of the statement of 3.36, showing that a purely abstract treatment of tensor spaces is not always productive. But before people approached tensors abstractly, coordinate-based approaches were the norm.
10941098

10951099
The dominant model for tensor operations in coordinates is indisputably [Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation). Coincidentally, Penrose introduced ATS in Einstein notation. It is a data-oriented system where indices corresponding to each argument in $V$ and $V^\*$ of a tensor in scalar-valued map form (see $(8)$ and $(10)$) are tracked. Credit for its creation is given to differential geometer [Ricci-Curbastro](https://en.wikipedia.org/wiki/Gregorio_Ricci-Curbastro), but it was made popular by [Einstein](https://en.wikipedia.org/wiki/Albert_Einstein) when he published a novel use of it in physics with his [field equations](https://en.wikipedia.org/wiki/Einstein_field_equations) in 1915.
10961100

@@ -1113,10 +1117,10 @@ $$
11131117
\text{entry}_m(i, j) = m_{i,j}.
11141118
$$
11151119

1116-
From a computational perspective, this is a unique proxy for $m$ when considering maps in $\mathcal{L}(V, W)$ (via 3.9). For example, the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) $\\| m \\|\_F = \sum_{(i, j)} |m_{i,j}|^2$ makes use of this notation. While this made sense for a linear map of form $V \to W$, issues arise for tensors of other forms. Consider $g_c$ from 3.40,
1120+
From a computational perspective, this is a unique proxy for $m$ when considering maps in $\mathcal{L}(V, W)$ (via 3.9). For example, the [Frobenius norm](https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm) $\\| m \\|\_F^2 = \sum_{(i, j)} |m_{i,j}|^2$ makes use of this notation. While this made sense for a linear map of form $V \to W$, issues arise for tensors of other forms. Consider $g_c$ from 3.40,
11171121

11181122
$$
1119-
g_{(\theta, \phi)}(v, w) \mapsto
1123+
g_{(\theta, \phi)}(v, w) =
11201124
\begin{bmatrix}
11211125
v_\phi & v_\theta
11221126
\end{bmatrix}
@@ -1127,7 +1131,7 @@ r^2 & 0 \\
11271131
\begin{bmatrix}
11281132
w_\phi \\
11291133
w_\theta
1130-
\end{bmatrix},
1134+
\end{bmatrix}
11311135
\;\; \text{s.t.} \;\;
11321136
M_\mathcal{B}(g_{({\theta}, {\phi})}) =
11331137
\begin{bmatrix}
@@ -1136,16 +1140,133 @@ r^2 & 0 \\
11361140
\end{bmatrix}.
11371141
$$
11381142

1139-
If given only the matrix $M_\mathcal{B}(g_{({\theta}, {\phi})})$, there would be no way of knowing it represents the map of (inner-product) form $g_{(\theta, \phi)} : P \times P \to \mathbb{R}$ (and not an operator in $\mathcal{L}(V, V)$) precisely because it is canonical to interpret it as a map in $\mathcal{L}(V, V)$. To emphasize, this is canonical simply to make matrix-vector multiplication work as usual.
1143+
If given only the matrix $M_\mathcal{B}(g_{({\theta}, {\phi})})$, there would be no way of knowing it represents the map of (inner-product) form $g_{(\theta, \phi)} : P \times P \to \mathbb{R}$ (and not an operator in $\mathcal{L}(V, V)$) precisely because it is canonical to interpret it as a map in $\mathcal{L}(V, V)$. This interpretation is only canonical because matrix-vector multiplication is defined with
1144+
1145+
$$
1146+
(Av)_i = \sum_{j} A_{ij} v_j.
1147+
$$
1148+
1149+
Hence, one must revisit matrix multiplication to have a basis-induced canonical isomorphism between tensors and matrices (see 3.19). To this end, we will adopt the tensor contraction in coordinates as matrix multiplication for tensors. A rank-$n$ tensor will be represented as a matrix with $n$ axes (such that $n$ indices identify an entry), such that for a tensor $t$ of type $(a, b)$,
1150+
1151+
$$
1152+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a}}
1153+
$$
1154+
1155+
denotes an entry of its matrix representation. Taking the tensor product of $t$ with a new tensor $g$ of type $(c, d)$ is very simple, as each entry of the product (by the same logic) requires $a + c$ upper and $b + d$ lower indices to be found. So we simply denote their tensor product
1156+
1157+
$$
1158+
t \otimes g =
1159+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a} \, g_{\delta_1 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}}.
1160+
$$
1161+
1162+
To perform contractions as specified in 3.35, one does a weighted sum across a pair of indices, one upper and one lower (which may belong to different tensors in a tensor product). For example, one could contract $\alpha_a$ with the index $\delta_1$ in the product $t \otimes g$ to obtain
1163+
1164+
$$
1165+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1}} \,
1166+
g_{\delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}} =
1167+
\sum_k \,
1168+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1} \, k} \,
1169+
g_{k \, \delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}}.
1170+
$$
1171+
1172+
Note that in the homogeneous case, one can always do this for any upper-lower index pair. The heterogeneous case just requires knowing which upper indices can be contracted with which lower indices (which can be done by tracking which index pairs correspond to duals of the same vector space). Finally, tensors of equal type may be summed entrywise,
11401173

1141-
Hence, one must revisit matrix multiplication to have a basis-induced canonical isomorphism between tensors and matrices (see 3.19). To this end, we will adopt the tensor contraction in coordinates as matrix multiplication for tensors. We will treat 3.19 as a given; a rank-$n$ tensor will be represented as a matrix with $n$ axes (such that $n$ indices identify an entry).
1174+
$$
1175+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a}}} +
1176+
{\large q_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a}}} =
1177+
{\large h_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a}}},
1178+
$$
11421179

1180+
with the understanding that the tensor product is distributive over tensor sums. This completes the system that Ricci-Curbastro outlined, allowing an axiomatic treatment for tensors in coordinates. The convention to eschew summation notation in contractions (only implying it when a symbol appears twice as an index) was introduced and popularized by Einstein (effectively his only contribution to [Ricci calculus](https://en.wikipedia.org/wiki/Ricci_calculus)):
1181+
1182+
$$
1183+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1} \, k} \,
1184+
g_{k \, \delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}} =
1185+
\sum_k \,
1186+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_{a - 1} \, k} \,
1187+
g_{k \, \delta_2 \, \cdots \, \delta_d}^{\gamma \, \cdots \, \gamma_c}}.
1188+
$$
1189+
1190+
Note that at any point we can find an entry in the matrix representation by replacing all indices with values. This means that Einstein notation is simply element-wise treatment of tensor matrices, with careful consideration of contraction compatibility by use of upper and lower indices. Here, "trades" translate to index raising or lowering, where (depending on context) one may be allowed to
1191+
1192+
$$
1193+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a}} \xrightarrow{\text{raise } \beta_b}
1194+
{\large t_{\beta_1 \, \cdots \, \beta_{b - 1}}^{\alpha_1 \, \cdots \, \alpha_a \, \beta_b}} \xrightarrow{\text{lower } \beta_b}
1195+
{\large t_{\beta_1 \, \cdots \, \beta_b}^{\alpha_1 \, \cdots \, \alpha_a}}.
1196+
$$
1197+
1198+
If allowed to raise or lower indices (which is context-dependent as seen in 3.36 and 3.40), a contraction can be done on two lower or upper indices. So summation (and legality of index lowering or raising) remains implied in cases where $V \cong V^\*$ canonically (which Penrose called "cartesian" in ATS). For example,
1199+
1200+
$$
1201+
{\large t_{\beta_1 \, \cdots \, \beta_b \, k}^{\alpha_1 \, \cdots \, \alpha_{a}} \,
1202+
g_{\delta_1 \, \cdots \, \delta_d \, k}^{\gamma \, \cdots \, \gamma_c }} =
1203+
\sum_k \,
1204+
{\large t_{\beta_1 \, \cdots \, \beta_b \, k}^{\alpha_1 \, \cdots \, \alpha_{a}} \,
1205+
g_{\delta_1 \, \cdots \, \delta_d \, k}^{\gamma \, \cdots \, \gamma_c }}.
1206+
$$
1207+
1208+
{{% /hint %}}
1209+
1210+
Notice that the "reduction" done by way of summation in a tensor contraction can use any operation with the effect of [folding](https://en.wikipedia.org/wiki/Fold_(higher-order_function)) a collection (returning a single value from multiple values of the same type), although this may break any and all algebraic invariants (depending on context). This could look like
1211+
1212+
$$
1213+
w_{abc}^{ik} = \text{reduce}_j \left( v_{a b c}^{i j k} \right).
1214+
$$
1215+
1216+
Since the user is giving up all algebraic gurantees, they may annihilate any index (upper or lower). This is simply treating the tensor matrix as data. But it shows that Einstein notation maintains its richness beyond pure tensor algebra, and is even generalizable to infinite-dimensional tensors via wise choices of $\text{reduce}$.
1217+
1218+
{{% hint title="3.42. Example" %}}
1219+
1220+
The cornerstone of the [transformer model](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) is generally regarded to be the [attention mechanism](https://en.wikipedia.org/wiki/Attention_(machine_learning)). This is no more than a tensor-valued function $\mathbf{A} : \mathbb{R}^{d \times n} \to \mathbb{R}^{d_v \times n}$ with tunable parameters, generally defined as
1221+
1222+
$$
1223+
\mathbf{A}(X) = \text{softmax}\left(\frac{(KX)^\top (QX)}{\sqrt{d_k}}\right) VX.
1224+
$$
1225+
1226+
Here, $X \in \mathbb{R}^{d \times n}$, and the matrices $Q \in \mathbb{R}^{d_k \times d}$ (query projection matrix), $K \in \mathbb{R}^{d_k \times d}$ (key projection matrix), and $V \in \mathbb{R}^{d_v \times d}$ (value projection matrix) are optimized. You rarely see it in Einstein notation as an equation of heterogeneous tensors, which helps when $X$ has many channels to mix (indexed by $c$),
1227+
1228+
$$
1229+
\mathbf{A}(X)_{t}^{ic} = \text{softmax} \left(\frac{g_{jm} K_p^j X_s^{pc} Q_q^m X_t^{qc}}{\sqrt{d_k}}\right) V_k^i X_s^{kc}.
1230+
$$
1231+
1232+
I replaced the row-wise dot product application $(KX)^\top (QX)$ with an explicit metric tensor $g_{jm}$, which brings a subtle geometric interpretation to the foreground just by way of notation. Observing this could have motivated us to add another metric tensor $m_{c_1 c_2}$ to capture channel-wise relationships while mixing,
1233+
1234+
$$
1235+
\mathbf{A}(X)_{t}^{ic} = \text{softmax} \left(\frac{g_{jm} m_{c_1 c_2} K_p^j X_s^{pc_1} Q_q^m X_t^{qc_2}}{\sqrt{d_k}}\right) V_k^i X_s^{kc}.
1236+
$$
1237+
1238+
With the [`einsum` NumPy API](https://numpy.org/doc/stable/reference/generated/numpy.einsum.html), we can write the above in Python still using Einstein notation. For simplicity, only a single-channel $X$ will be considered. Below, the strategy for determining the tensor `g` is left unspecified.
1239+
1240+
{{< highlight python "lineNos=true, lineNoStart=1" >}}
1241+
1242+
# einsum treats all indices as lower (or upper)
1243+
# i.e. assumes raising/lowering is generally OK
1244+
1245+
KX = np.einsum('jk,ks->js', K, X) # keys
1246+
QX = np.einsum('ml,lt->mt', Q, X) # queries
1247+
1248+
# attention scores (similarity under g)
1249+
sim = np.einsum('jm,js,mt->st', g, KX, QX)
1250+
1251+
scores = sim / np.sqrt(Q.shape[0])
1252+
weights = np.exp(scores - np.max(scores, axis=0, keepdims=True))
1253+
weights = weights / np.sum(weights, axis=0, keepdims=True)
1254+
1255+
VX = np.einsum('ij,jt->it', V, X) # values
1256+
result = np.einsum('is,st->it', VX, weights)
1257+
{{< /highlight >}}
11431258

11441259
{{% /hint %}}
11451260

11461261
#### Overview
11471262

1263+
We began this section by organizing the concepts of matrices, linear maps, and vectors. We then took a look at linear forms and dual spaces, being careful of infinite-dimensional cases. Later, we expanded the concept of linearity to multilinear maps, and showed how the tensor product "linearizes" them. That way, we saw how the way we organized matrices, vectors, and linear maps early on applied to tensor product spaces in very natural ways. Finally, tensors allowed us to speak about these objects as unified under a single umbrella woven by tight isomorphisms.
1264+
1265+
Although this trajectory was predominantly abstract, we were also exposed to practical notation and uses of tensors in coordinates, with examples from the early history of tensors in differential geometry and from modern applications in deep learning. These examples showed that tensors can be found at many points of the spectrum between raw data and abstract algebraic objects, and that their theory is robust to a wide range of uses -- some more disrespectful than others, but never enough to erase their basic qualities.
11481266

1267+
{{< hcenter >}}
1268+
{{< figure src="mark-wilson-e67109-2012.png" width="512" caption="Mark Wilson, \'e67109\' (2012)" >}}
1269+
{{< /hcenter >}}
11491270

11501271
### Signals and Systems
11511272

334 KB
Loading
936 KB
Loading
417 KB
Loading
844 KB
Loading

0 commit comments

Comments
 (0)