jazz

vivCoding · vivCoding · commit f6f17d1ceda7 · 2024-02-03T23:42:00.000-05:00
diff --git a/eleventy.config.js b/eleventy.config.js
@@ -1,10 +1,12 @@
 // TODO consider using eleventy as the final builder/processor and using parcel only for js bundling
 // rather than having eleventy only for templates and parcel for bundling everything
 const syntaxHighlight = require("@11ty/eleventy-plugin-syntaxhighlight")
+const mdk = require("markdown-it-katex")
 
 /** @param {import("@11ty/eleventy").UserConfig} eleventyConfig */
 module.exports = function (eleventyConfig) {
   eleventyConfig.addPlugin(syntaxHighlight)
+  eleventyConfig.amendLibrary("md", (mdLib) => mdLib.use(mdk))
 
   // attempts to extracts a mini excerpt preview given a post
   // uses the first paragraph as the excerpt's content
diff --git a/package.json b/package.json
@@ -18,6 +18,7 @@
     "@parcel/plugin": "^2.11.0",
     "@shopify/prettier-plugin-liquid": "^1.4.0",
     "@tailwindcss/typography": "^0.5.10",
+    "markdown-it-katex": "^2.0.3",
     "parcel": "^2.11.0",
     "postcss": "^8.4.32",
     "prettier": "^3.1.1",
diff --git a/pages/_layouts/blog/post.liquid b/pages/_layouts/blog/post.liquid
@@ -24,6 +24,7 @@ tags: post
 
     <link rel="stylesheet" href="/styles/index.css">
     <link rel="stylesheet" href="/styles/syntax-highlight.css">
+    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css">
     <link rel="shortcut icon" type="image/x-icon" href="/assets/images/logo-bg.svg">
 
     <title>
diff --git a/pages/blog/2024-02-03_math-in-nn.md b/pages/blog/2024-02-03_math-in-nn.md
@@ -0,0 +1,314 @@
+---
+layout: blog/post.liquid
+title: "Back to basics: Neural network from scratch"
+date: 2024-02-03
+updated: 2024-02-03
+templateEngineOverride: md
+---
+
+For class, I had to implement a neural network from scratch, using only basic PyTorch Tensor operations. It was a great refresher on the math behind a neural network. Though, that doesn't make implementing and visualizing the math easier.
+
+> "You're good in math bro, you should learn machine learning bro"
+
+<img src="https://media1.tenor.com/m/-URYSckgL9sAAAAd/get-out-of-my-head-meme.gif" alt="cat crazy like me fr" width=200 style="">
+
+I write this to remind my future self on the intuition behind the math in the neural network, in case I want a refresher, or if I ever have to write a neural network from scratch again (when pigs start flying). I won't be discussing what each layer does or why we use certain layers, per se. I wanted to show how we derived the forward and backward passes (_especially_ the backward pass).
+
+## Setup
+
+The model we'll describe will be a classification model. It takes in input(s), and outputs a probability distribution describing how likely it is to be a specific class. It will consist of the following layers, in order:
+
+1. Linear
+2. ReLU
+3. Softmax
+
+For our loss function, we'll use the negative log likelihood (NLL) function.
+
+> **PLEASE NOTE** that this is for learning purposes. In practice, depending on what you want (say for classification), you may not want ReLU as the last layer.
+
+> For classification, the cross entropy loss is equivalent to having a softmax layer and using NLL loss. So, you can simplify softmax and NLL loss to be the cross entropy loss function. However, I prefer to keep them separate to make math easier to grasp.
+
+Typically, most guides describe the input $x$ to the model as a vector with $D_{in}$ input features. However, we usually train over multiple examples (a batch size $N$). Thus, in this guide, we'll define our input to our model as a matrix with dimensions $N \times D_{in}$. Each row in the input is one training example.
+
+Similarly, our model output will be a $N \times D_{out}$ matrix, where $D_{out}$ represents the number of output features (the number of classes we can predict).
+
+With the model's core dimensions defined, let's dive into the specifics of each layer and the forward pass of the neural network.
+
+> Note, I'll define variables that are probably not very common in other tutorials. Idc. These variables made sense to me when I was first learning it.
+
+## Linear layer
+
+The linear layer is defined as $h = Wx + b$ (though we may rearrange the matrices multiplication to match dimensions). We will use $h$ to represent the output of the linear layer.
+
+> I find it more intuitive to do $h = xW + b$, rather than having to do transposing and all that jazz. So I will do that. The actual order of matrix multiplication will depend on how you define dimensions and whatever
+
+If we had _one training example_, our matrix multiplication would look something like this:
+
+$$
+\begin{bmatrix}
+x_0 & ... & x_{D_{in}}
+\end{bmatrix}
+
+\begin{bmatrix}
+W_{0, 0} & ... & W_{D_{out}} \\
+... & ... & ... \\
+W_{D_{in},0} & ... & W_{D_{in}, D_{out}}
+\end{bmatrix}
+
++
+
+\begin{bmatrix}
+b_0 & ... & b_{D_{out}}
+\end{bmatrix}
+
+=
+\begin{bmatrix}
+h_0 & ... & h_{D_{out}}
+\end{bmatrix}
+$$
+
+But remember, we typically do training in batches and have multiple examples. Thus, our input $x$ has dimensions $N \times D_{in}$. Each row is one training example, and we have $N$ examples. Likewise, the output of the layer, $h$, will be a matrix of dimensions $N \times D_{out}$. Each row is the output for one example, and we have $N$ examples.
+
+$$
+\begin{bmatrix}
+h_{0, 0} & ... & h{0, D_{out}} \\
+... \\
+h_{N, 0} & ... & h_{N, D_{out}}
+\end{bmatrix}
+$$
+
+Our weights $W$ will have dimensions of $D_{in} \times D_{out}$. We use the same weights for each of our examples.
+
+$$
+\begin{bmatrix}
+W_{0, 0} & ... & W_{D_{out}} \\
+... & W_{i,j} & ... \\
+W_{D_{in},0} & ... & W_{D_{in}, D_{out}}
+\end{bmatrix}
+$$
+
+If we do the matrix multiplication $Wx$, we get a resultant matrix of dimensions $N \times D_{out}$, matching our $h$. Each row in the output is the result of multipling weights to the corresponding input row.
+
+The biases $b$ will have dimensions of $N \times D_{out}$. Keep in mind that we use the same bias values for all training examples. So really, $b$ is just a vector of with $D_{out}$ columns (just like we showed for the one training example), and expanded along the other dimension to have $N$ rows, so that we can add our biases to each training example.
+
+$$
+\begin{bmatrix}
+b_0 & ... & b_{D_{out}} \\
+... \\
+b_0 & ... & b_{D_{out}}
+\end{bmatrix}
+$$
+
+Putting it together:
+
+$$
+\begin{bmatrix}
+x_{0,0} & ... & x_{0,D_{out}} \\
+... \\
+x_{N,0} & ... & x_{N,D_{out}}
+\end{bmatrix}
+\begin{bmatrix}
+W_{0, 0} & ... & W_{D_{out}} \\
+... & W_{i,j} & ... \\
+W_{D_{in},0} & ... & W_{D_{in}, D_{out}}
+\end{bmatrix}
++
+\begin{bmatrix}
+b_0 & ... & b_{D_{out}} \\
+... \\
+b_0 & ... & b_{D_{out}}
+\end{bmatrix}
+=
+\begin{bmatrix}
+h_{0, 0} & ... & h_{0, D_{out}} \\
+... \\
+h_{N, 0} & ... & h_{N, D_{out}}
+\end{bmatrix}
+$$
+
+## ReLU layer
+
+ReLU is pretty easy. It's defined as:
+
+$$
+r = max(0, h)
+$$
+
+Here, $h$ is the output from the linear layer being fed into the input of our ReLU layer. And we'll define $r$ as the output of our ReLU layer.
+
+Essentially, what we're doing is taking _every single input_ $h_{i,j}$ from our input matrix and running it through our ReLU function, e.g. ensuring every value becomes $\geq 0$.
+
+The dimensions of our input $h$ is $N \times D_{out}$. ReLU doesn't do any matrix transformations, so it outputs a matrix $r$ of dimensions $N \times D_{out}$ also.
+
+## Softmax layer
+
+Softmax is the final layer of our neural network. We use softmax to make our prediction. For each training example, it returns a probability distribution for our output features (each possible class). More explanation a lil' later.
+
+The input to our softmax layer is the output $r$ from our ReLU layer, which has dimensions $N \times D_{out}$.
+
+For _one training example_ (as most tutorials show), the softmax is defined as:
+
+$$
+S_i = \frac{e^{r_c}}{\sum_k e^{r_k}}
+$$
+
+For clarification:
+
+$$
+\sum_k e^{r_k} = e^{r_0} + e^{r_1} + ... + e^{r_{D_{out}}}
+$$
+
+$$
+\begin{bmatrix}
+r_0 & ... & r_{D_{out}}
+\end{bmatrix}
+\Rightarrow
+
+\begin{bmatrix}
+\frac{e^{r_0}}{\sum_k e^{r_k}} & ... & \frac{e^{r_{D_{out}}}}{\sum_k e^{r_k}}
+\end{bmatrix}
+$$
+
+Typically, softmax takes in an input vector and outputs a vector of same dimensions. This output vector represents a probability distribution - in our classification model, it represents the predicted probabilities for each class. For each column $S_i$, in the output, $S_i$ represents the probability that the training example is part of class $c$. And since this is a probability distribution, all the values in the output add up to 1.
+
+For example, if softmax returned $[0.8, 0.15, 0.05]$, this means our model predicted that there is an 80% chance that the original input $x$ (not $r$) belongs to class index 0, and that there's a 5% chance that the input belongs to class index 2.
+
+However, this is just for _one training example_. To extend this for our model, where we have multiple training examples and $r$ has dimensions $N \times D_{out}$, we can define our softmax as:
+
+$$
+S_{i, j} = \frac{e^{r_{i, j}}}{\sum_k e^{r_{i, k}}}
+$$
+
+For clarification:
+
+$$
+\sum_k e^{r_{i, k}} = e^{r_{i, 0}} + e^{r_{i, 1}} + ... + e^{r_{i, D_{out}}}
+$$
+
+$$
+\begin{bmatrix}
+r_{0, 0} & ... & r_{0, D_{out}} \\
+... \\
+r_{N, 0} & ... & r_{N, D_{out}}
+\end{bmatrix}
+
+\Rightarrow
+
+\begin{bmatrix}
+\frac{e^{r_{0, 0}}} {\sum_k e^{r_{0, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{0, k}}} \\
+\\ ... \\ \\
+\frac{e^{r_{N, 0}}} {\sum_k e^{r_{N, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{N, k}}}
+\end{bmatrix}
+$$
+
+Running through our softmax layer gives us an output matrix $S$ of dimensions $N \times D_{out}$. Each row represents the predicted probability distribution for one example. As such, if you sum up all the values in one row, it'll add up to one.
+
+## Loss function
+
+Softmax is our final layer in our model. It returns a predicted probability distribution for each of our training examples. Now, we need to measure how well our model did.
+
+To do that, we will compute the negative log likelihood loss (NLL loss) for each training example. Let's look at only one for now.
+
+$$
+S = [0.8, 0.15, 0.05]
+$$
+
+Typically, our true probability distribution $y$ will look something like this for classification models:
+
+$$
+y = [1, 0, 0]
+$$
+
+Interpreting all this, this means that the input $x$ actually belongs class index 0 (it has 100% chance as shown in $y$). Our model made a prediction that the input has an 80% chance of being in index 0. So it's almost there.
+
+How do we measure how well our model did? We'll calculate a loss value, which is a scalar value. We'll use the NLL loss equation:
+
+$$
+\mathcal{L}(y, S) = -\sum_c y_c log(S_c)
+$$
+
+It basically sums up the log likelihood for each output feature $c$ (each possible class), and takes the negative. Why negative? Because we want to minimize the total loss.
+
+> Not gonna explain NLL right now, that's a different story
+
+To expand this for multiple training examples, we can do something similar:
+
+$$
+\mathcal{L}(y, S) = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})
+$$
+
+Basically, we calculate the average loss for all $N$ training examples.
+
+## Minimizing loss
+
+So we go forward in our neural network with our initial input $x$, get predicted distributions $S$, and finally calculate a final loss value $\mathcal{L}$.
+
+$$h = Wx + b$$
+$$r = max(0, h)$$
+$$S = \text{Softmax}(r)$$
+$$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$$
+
+Now it's time to update our parameters (our weights and biases). To do so, we'll calculate the _gradient_ of our loss function with respect to our weights or biases. Recall chain rule:
+
+$$
+\frac {\partial \mathcal{L}}{\partial W} =
+  \frac {\partial \mathcal{L}}{\partial S}
+  \frac {\partial S}{\partial r}
+  \frac {\partial r}{\partial h}
+  \frac {\partial h}{\partial W}
+$$
+
+$$
+\frac {\partial \mathcal{L}}{\partial b} =
+  \frac {\partial \mathcal{L}}{\partial S}
+  \frac {\partial S}{\partial r}
+  \frac {\partial r}{\partial h}
+  \frac {\partial h}{\partial b}
+$$
+
+Once we have the loss gradient, we can update each of our weights accordingly.
+
+$$W_{new} = W_{old} - \nabla_W \mathcal{L}$$
+$$b_{new} = b_{old} - \nabla_b \mathcal{L}$$
+
+To do this, we need to calculate each derivative in our chain rule. Let's go through each one.
+
+## Loss derivative
+
+Recall our loss function:
+
+$$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$$
+
+We need to calculate $\frac {\partial \mathcal{L}}{\partial S}$. But, let's observe what this derivative looks like.
+
+$S$ is an input with dimensions of $N \times D_{out}$. Our loss function is simply a scalar value. Thus, we need to compute the following matrix:
+
+$$
+\frac {\partial \mathcal{L}}{\partial S}
+=
+\begin{bmatrix}
+\frac {\partial \mathcal{L}}{\partial S_{0, 0}} & ... & \frac {\partial \mathcal{L}}{\partial S_{0, D_{out}}} \\
+... \\
+\frac {\partial \mathcal{L}}{\partial S_{N, 0}} & ... & \frac {\partial \mathcal{L}}{\partial S_{N, D_{out}}} \\
+\end{bmatrix}
+$$
+
+TODO
+
+## Softmax derivative
+
+TODO
+
+## ReLU derivative
+
+TODO
+
+## Linear derivative
+
+TODO
+
+## Conclusion
+
+Now we know neural network basic math 😎🎉🎉🎉
+
+<img src="https://media1.tenor.com/m/mJ_Og97j5WwAAAAC/chipi-chapa.gif" alt="chipi chipi chapa chapa cat" width=300>
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml