Add more matrices concepts

jnonino · jnonino · commit 2b217c4098b4 · 2026-03-20T23:17:27.000Z
diff --git a/content/ai/math/algebra/matrices/index.en.md b/content/ai/math/algebra/matrices/index.en.md
@@ -532,3 +532,120 @@ $$
 For an \(m \times n\) matrix: \(\text{rank}(\mathbf{A}) \leq \min(m, n)\).
 
 When \(\text{rank}(\mathbf{A}) = \min(m,n)\), \(\mathbf{A}\) is **full rank**. Otherwise it is **rank-deficient** and maps a nonzero subspace of inputs to zero.
+
+### The Rank-Nullity theorem
+
+This theorem elegantly connects the three fundamental subspaces.
+
+For any matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\):
+
+$$
+\boxed{\text{rank}(\mathbf{A}) + \text{nullity}(\mathbf{A}) = n}
+$$
+
+where \(\text{nullity}(\mathbf{A}) = \dim(\text{null}(\mathbf{A}))\).
+
+{{< details title="Proof sketch" closed="true" >}}
+Let \(r = \text{rank}(\mathbf{A})\) and let \(\{\mathbf{v}_1, \ldots, \mathbf{v}_{n-r}\}\) be a basis for \(\text{null}(\mathbf{A})\). Extend this to a basis for all of \(\mathbb{R}^n\) by adding \(r\) vectors \(\{\mathbf{w}_1, \ldots, \mathbf{w}_r\}\) from the row space of \(\mathbf{A}\) (which is orthogonal to the null space). The images \(\{\mathbf{A}\mathbf{w}_1, \ldots, \mathbf{A}\mathbf{w}_r\}\) are linearly independent (can be shown) and span \(\text{col}(\mathbf{A})\). So \(\dim(\text{col}(\mathbf{A})) = r\). The total basis has \(r + (n - r) = n\) elements, matching \(\dim(\mathbb{R}^n) = n\).
+
+{{< callout type="info" >}}
+In plain English: the \(n\) input dimensions split cleanly into two complementary parts. Some (\(r\) dimensions, the row space) get mapped to nonzero outputs. The rest (\(n - r\) dimensions, the null space) get mapped to zero. These two parts partition the input space completely and without overlap — which is why their dimensions must sum to exactly \(n\).
+{{< /callout >}}
+{{< /details >}}
+
+{{< callout >}}
+In ML terms: if you have a weight matrix \(\mathbf{W}\) of shape \(\mathbf{W}\) with \(n > m\) (an "over-parameterized" layer), then \(\text{nullity}(\mathbf{W}) \geq n - m > 0\). There is a whole subspace of weight perturbations that produce zero change in the layer's output. This is one reason why over-parameterized models can be aggressively pruned and compressed, many weight directions literally do nothing.
+{{< /callout >}}
+
+### The inverse
+
+For a **square** matrix \(\mathbf{A} \in \mathbb{R}^{n \times n}\), the **inverse** \(\mathbf{A}^{-1}\) is the unique matrix satisfying:
+
+$$
+\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n
+$$
+
+The **invertible matrix theorem** states that the following conditions are all equivalent, *if any one holds, all hold, and the inverse exists*:
+
+- \(\text{rank}(\mathbf{A}) = n\) (full rank)
+- \(\det(\mathbf{A}) \neq 0\)
+- \(\text{null}(\mathbf{A}) = \{\mathbf{0}\}\) (trivial null space)
+- The columns of \(\mathbf{A}\) are linearly independent
+- The rows of \(\mathbf{A}\) are linearly independent
+- The equation \(\mathbf{A}\mathbf{x} = \mathbf{b}\) has a unique solution for every \(\mathbf{b} \in \mathbb{R}^n\)
+
+{{< callout type="important" >}}
+All six conditions are equivalent ways of saying the same thing, the transformation is fully reversible. If the matrix collapses any dimension, there is a nonzero null space, the determinant is zero, then you cannot undo the transformation. All roads lead to the same conclusion: invertibility is an all-or-nothing property.
+{{< /callout >}}
+
+Properties of the inverse:
+- *Double inverse*: \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\)
+- *Product inverse*: \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) (order reverses, just as with the transpose)
+- *Transpose-inverse commutativity*: \((\mathbf{A}^\top)^{-1} = (\mathbf{A}^{-1})^\top\)
+- *Determinant*: \(\det(\mathbf{A}^{-1}) = \frac{1}{\det(\mathbf{A})}\)
+
+#### Computing the inverse
+
+For \(2 \times 2\) matrices, e.g. \(\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\) with \(\det(\mathbf{A}) = ad - bc \neq 0\), the inverse has a closed-form formula:
+
+We seek \(\mathbf{A}^{-1} = \begin{bmatrix} p & q \\ r & s \end{bmatrix}\) such that \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\):
+
+$$
+\begin{aligned}
+  \begin{bmatrix} a & b \\ c & d \end{bmatrix}\begin{bmatrix} p & q \\ r & s \end{bmatrix} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \\
+  \begin{bmatrix} ap+br & aq+bs \\ cp+dr & cq+ds \end{bmatrix} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
+\end{aligned}
+$$
+
+From the first column: \(ap + br = 1\) and \(cp + dr = 0\). Solving:
+$$
+\begin{aligned}
+  p &= \frac{d}{ad-bc} \\
+  r &= \frac{-c}{ad-bc}
+\end{aligned}
+$$
+
+From the second column: \(aq + bs = 0\) and \(cq + ds = 1\). Solving:
+$$
+\begin{aligned}
+  q &= \frac{-b}{ad-bc} \\
+  s &= \frac{a}{ad-bc}
+\end{aligned}
+$$
+
+Therefore:
+$$
+\begin{aligned}
+   \mathbf{A}^{-1} &= \begin{bmatrix} p & q \\ r & s \end{bmatrix} \\
+   \mathbf{A}^{-1} &= \begin{bmatrix} \frac{d}{ad-bc} & \frac{-b}{ad-bc} \\ \frac{-c}{ad-bc} & \frac{a}{ad-bc} \end{bmatrix} \\
+   \mathbf{A}^{-1} &= \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}
+\end{aligned}
+$$
+
+Knowing that \(\det(\mathbf{A}) = ad - bc\), then:
+
+$$
+\boxed{\mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}}
+$$
+
+{{< callout type="info" >}}
+To invert a \(2 \times 2\) matrix, swap the diagonal entries, negate the off-diagonal entries, and divide everything by the determinant. The determinant appears in the denominator, which is exactly why a zero determinant makes the inverse undefined (division by zero).
+{{< /callout >}}
+
+For \(n > 2\), the inverse is computed in practice via [**Gauss-Jordan elimination**](https://en.wikipedia.org/wiki/Gaussian_elimination) on the augmented matrix \([\mathbf{A}\,|\,\mathbf{I}]\).
+
+## Machine Learning and AI perspective
+
+Matrix operations are not merely computational primitives, they are the conceptual vocabulary of modern Machine Learning research, for example:
+
+**Low-rank structure in fine-tuning.** [Hu et al. (2021)](https://arxiv.org/abs/2106.09685), which observes that weight update matrices \(\Delta\mathbf{W}\) during fine-tuning of large language models are empirically low-rank. Rather than storing the full \(\Delta\mathbf{W} \in \mathbb{R}^{d \times d}\), LoRA decomposes it as \(\Delta\mathbf{W} = \mathbf{B}\mathbf{A}\) where \(\mathbf{B} \in \mathbb{R}^{d \times r}\), \(\mathbf{A} \in \mathbb{R}^{r \times d}\), with \(r \ll d\). This is valid precisely because a rank-\(r\) matrix has a natural factorization into two thin matrices, the rank concept we just derived. LoRA reduces trainable parameters by over 10000x on GPT-3 class models, and the entire insight rests on rank intuition.
+
+## Common pitfalls and debugging
+
+1. **Confusing shape conventions across frameworks**. NumPy uses `(batch, features)` convention. PyTorch's `nn.Linear(in_features, out_features)` stores its weight matrix as shape `(out_features, in_features)` and computes `x @ W.T + b` internally. So \(\mathbf{W}\) is stored transposed relative to the mathematical convention. If you manually initialize weights, verify with a forward pass on dummy data before trusting gradients.
+
+2. **Inverting nearly singular matrices**. `np.linalg.inv()` returns a result even for near-singular matrices, the numbers will be astronomically large and numerically meaningless. Always prefer `np.linalg.solve(A, b)` over `np.linalg.inv(A) @ b` for solving linear systems. Check `np.linalg.cond(A)` before inverting; condition numbers above \(10^{12}\) are a red flag.
+
+3. **Testing `det == 0` for singularity**. In floating point, the determinant of a truly singular matrix is almost never exactly zero, it will be some very small number like `1e-17`. Do not use determinant as a singularity test in code. Use `np.linalg.matrix_rank(A) < n` or `np.linalg.cond(A) > threshold` instead.
+
+4. **Forgetting that matrix multiplication is non-commutative**. The most common algebraic error when implementing attention from scratch. \(\mathbf{AB} \neq \mathbf{BA}\). Even when both products are well-defined and have the same shape (square matrices), they will produce different results. When in doubt, track shapes explicitly and reason about the semantics of each multiplication.
diff --git a/content/ai/math/algebra/matrices/index.es.md b/content/ai/math/algebra/matrices/index.es.md
@@ -541,3 +541,120 @@ $$
 Para una matriz \(m \times n\): \(\text{rango}(\mathbf{A}) \leq \min(m, n)\).
 
 Cuando \(\text{rango}(\mathbf{A}) = \min(m,n)\), \(\mathbf{A}\) tiene **rango completo**. De lo contrario es **deficiente en rango** y mapea un subespacio no nulo de entradas a cero.
+
+### El teorema Rango-Nulidad
+
+Este teorema conecta elegantemente los tres subespacios fundamentales.
+
+Para cualquier matriz \(\mathbf{A} \in \mathbb{R}^{m \times n}\):
+
+$$
+\boxed{\text{rango}(\mathbf{A}) + \text{nulidad}(\mathbf{A}) = n}
+$$
+
+donde \(\text{nulidad}(\mathbf{A}) = \dim(\text{null}(\mathbf{A}))\).
+
+{{< details title="Esquema de demostración" closed="true" >}}
+Sea \(r = \text{rango}(\mathbf{A})\) y sea \(\{\mathbf{v}_1, \ldots, \mathbf{v}_{n-r}\}\) una base para \(\text{null}(\mathbf{A})\). Extiende esto a una base de \(\mathbb{R}^n\) añadiendo \(r\) vectores \(\{\mathbf{w}_1, \ldots, \mathbf{w}_r\}\) del espacio fila de \(\mathbf{A}\) (que es ortogonal al espacio nulo). Las imágenes \(\{\mathbf{A}\mathbf{w}_1, \ldots, \mathbf{A}\mathbf{w}_r\}\) son linealmente independientes y generan \(\text{col}(\mathbf{A})\). Por tanto \(\dim(\text{col}(\mathbf{A})) = r\). La base total tiene \(r + (n - r) = n\) elementos, coincidiendo con \(\dim(\mathbb{R}^n) = n\).
+
+{{< callout type="info" >}}
+En otras palabras: las \(n\) dimensiones de entrada se dividen limpiamente en dos partes complementarias. Algunas (\(r\) dimensiones, el espacio fila) se mapean a salidas no nulas. Las demás (\(n - r\) dimensiones, el espacio nulo) se mapean a cero. Estas dos partes particionan el espacio de entrada completamente y sin solapamiento, por eso sus dimensiones deben sumar exactamente \(n\).
+{{< /callout >}}
+{{< /details >}}
+
+{{< callout >}}
+En términos de ML, si tienes una matriz de pesos \(\mathbf{W}\) de forma \(m \times n\) con \(n > m\) (una capa "sobreparametrizada"), entonces \(\text{nulidad}(\mathbf{W}) \geq n - m > 0\). Existe todo un subespacio de perturbaciones de pesos que produce cambio cero en la salida de la capa. Ésta es una razón por la que los modelos sobreparametrizados pueden podarse y comprimirse agresivamente, muchas direcciones de peso literalmente no hacen nada.
+{{< /callout >}}
+
+### La inversa
+
+Para una matriz **cuadrada** \(\mathbf{A} \in \mathbb{R}^{n \times n}\), la **inversa** \(\mathbf{A}^{-1}\) es la única matriz que satisface:
+
+$$
+\mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}_n
+$$
+
+El **teorema de la matriz invertible** establece que las siguientes condiciones son todas equivalentes, *si cualquiera se cumple, todas se cumplen, y la inversa existe*:
+
+- \(\text{rango}(\mathbf{A}) = n\) (rango completo)
+- \(\det(\mathbf{A}) \neq 0\)
+- \(\text{null}(\mathbf{A}) = \{\mathbf{0}\}\) (espacio nulo trivial)
+- Las columnas de \(\mathbf{A}\) son linealmente independientes
+- Las filas de \(\mathbf{A}\) son linealmente independientes
+- La ecuación \(\mathbf{A}\mathbf{x} = \mathbf{b}\) tiene solución única para todo \(\mathbf{b} \in \mathbb{R}^n\)
+
+{{< callout type="important" >}}
+Las seis condiciones son formas equivalentes de decir lo mismo, la transformación es completamente reversible. Si la matriz colapsa alguna dimensión, hay un espacio nulo no trivial, el determinante es cero, ya no puedes deshacer la transformación. Todos los caminos llevan a la misma conclusión: la invertibilidad es una propiedad de todo o nada.
+{{< /callout >}}
+
+Propiedades de la inversa:
+- *Doble inversa*: \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\)
+- *Inversa del producto*: \((\mathbf{AB})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\) (el orden se invierte, igual que con la transpuesta)
+- *Conmutatividad transpuesta-inversa*: \((\mathbf{A}^\top)^{-1} = (\mathbf{A}^{-1})^\top\)
+- *Determinante*: \(\det(\mathbf{A}^{-1}) = \frac{1}{\det(\mathbf{A})}\)
+
+#### Cálculo de la inversa
+
+Para matrices \(2 \times 2\), como \(\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\) con \(\det(\mathbf{A}) = ad - bc \neq 0\), existe una fórmula explícita que se deriva de la definición de inversa y el cálculo del determinante.
+
+Buscamos \(\mathbf{A}^{-1} = \begin{bmatrix} p & q \\ r & s \end{bmatrix}\) tal que \(\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\):
+
+$$
+\begin{aligned}
+  \begin{bmatrix} a & b \\ c & d \end{bmatrix}\begin{bmatrix} p & q \\ r & s \end{bmatrix} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \\
+  \begin{bmatrix} ap+br & aq+bs \\ cp+dr & cq+ds \end{bmatrix} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
+\end{aligned}
+$$
+
+De la primera columna tenemos que: \(ap + br = 1\) y \(cp + dr = 0\). Resolviendo, encontramos que:
+$$
+\begin{aligned}
+  p &= \frac{d}{ad-bc} \\
+  r &= \frac{-c}{ad-bc}
+\end{aligned}
+$$
+
+De la segunda columna: \(aq + bs = 0\) y \(cq + ds = 1\). Resolviendo:
+$$
+\begin{aligned}
+  q &= \frac{-b}{ad-bc} \\
+  s &= \frac{a}{ad-bc}
+\end{aligned}
+$$
+
+Reemplazando:
+$$
+\begin{aligned}
+   \mathbf{A}^{-1} &= \begin{bmatrix} p & q \\ r & s \end{bmatrix} \\
+   \mathbf{A}^{-1} &= \begin{bmatrix} \frac{d}{ad-bc} & \frac{-b}{ad-bc} \\ \frac{-c}{ad-bc} & \frac{a}{ad-bc} \end{bmatrix} \\
+   \mathbf{A}^{-1} &= \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}
+\end{aligned}
+$$
+
+Sabiendo que \(\det(\mathbf{A}) = ad - bc\), entonces:
+
+$$
+\boxed{\mathbf{A}^{-1} = \frac{1}{\det(\mathbf{A})} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}}
+$$
+
+{{< callout type="info" >}}
+Para invertir una matriz \(2 \times 2\), intercambia las entradas diagonales, niega las entradas fuera de la diagonal, y divide todo por el determinante. El determinante aparece en el denominador, exactamente por eso un determinante nulo hace indefinida la inversa (división por cero).
+{{< /callout >}}
+
+Para \(n > 2\), la inversa se calcula en la práctica mediante [**eliminación de Gauss-Jordan**](https://es.wikipedia.org/wiki/Eliminaci%C3%B3n_de_Gauss-Jordan) sobre la matriz aumentada \([\mathbf{A}\,|\,\mathbf{I}]\).
+
+## Perspectiva de Machine Learning e IA
+
+Las operaciones matriciales no son meros primitivos computacionales, son el vocabulario conceptual de la investigación moderna en Machine Learning, por ejemplo:
+
+**Estructura de bajo rango en el ajuste fino**. [Hu et al. (2021)](https://arxiv.org/abs/2106.09685) introdujeron *LoRA (Low-Rank Adaptation)*, que observa que las matrices de actualización de pesos \(\Delta\mathbf{W}\) durante el ajuste fino de grandes modelos de lenguaje son empíricamente de bajo rango. En lugar de almacenar la \(\Delta\mathbf{W} \in \mathbb{R}^{d \times d}\) completa, LoRA la descompone como \(\Delta\mathbf{W} = \mathbf{B}\mathbf{A}\) donde \(\mathbf{B} \in \mathbb{R}^{d \times r}\), \(\mathbf{A} \in \mathbb{R}^{r \times d}\), con \(r \ll d\). Esto es válido precisamente porque una matriz de rango \(r\) tiene una factorización natural en dos matrices delgadas, el concepto de rango que acabamos de ver. LoRA reduce los parámetros entrenables más de 10000 veces en modelos de la escala de GPT-3.
+
+## Errores comunes y depuración
+
+1. **Confundir las convenciones de forma entre frameworks**. NumPy usa la convención `(batch, características)`. El `nn.Linear(in_features, out_features)` de PyTorch almacena su matriz de pesos con forma `(out_features, in_features)` y calcula `x @ W.T + b` internamente, por lo que \(\mathbf{W}\) se almacena transpuesta respecto a la convención matemática. Si inicializas pesos manualmente, verifica con una pasada hacia adelante sobre datos ficticios antes de confiar en los gradientes.
+
+2. **Invertir matrices casi singulares**. `np.linalg.inv()` devuelve un resultado incluso para matrices casi singulares, los números serán astronómicamente grandes y numéricamente sin sentido. Prefiere siempre `np.linalg.solve(A, b)` sobre `np.linalg.inv(A) @ b` para resolver sistemas lineales. Revisa `np.linalg.cond(A)` antes de invertir; los números de condición superiores a \(10^{12}\) son una señal de alerta.
+
+3. **Usar `det == 0` para detectar singularidad**. En punto flotante, el determinante de una matriz verdaderamente singular casi nunca es exactamente cero,será algún número muy pequeño como `1e-17`. No uses el determinante como prueba de singularidad en el código. En su lugar usa `np.linalg.matrix_rank(A) < n` o `np.linalg.cond(A) > umbral`.
+
+4. **Asumir que la multiplicación de matrices es conmutativa**. El error algebraico más común al implementar atención desde cero. \(\mathbf{AB} \neq \mathbf{BA}\). Incluso cuando ambos productos están bien definidos y tienen la misma forma (matrices cuadradas), producirán resultados diferentes. Ante la duda, rastrea las formas explícitamente y razona sobre la semántica de cada multiplicación.