|
| 1 | +--- |
| 2 | +layout: blog/post.liquid |
| 3 | +title: "Back to basics: Neural network from scratch" |
| 4 | +date: 2024-02-03 |
| 5 | +updated: 2024-02-03 |
| 6 | +templateEngineOverride: md |
| 7 | +--- |
| 8 | + |
| 9 | +For class, I had to implement a neural network from scratch, using only basic PyTorch Tensor operations. It was a great refresher on the math behind a neural network. Though, that doesn't make implementing and visualizing the math easier. |
| 10 | + |
| 11 | +> "You're good in math bro, you should learn machine learning bro" |
| 12 | +
|
| 13 | +<img src="https://media1.tenor.com/m/-URYSckgL9sAAAAd/get-out-of-my-head-meme.gif" alt="cat crazy like me fr" width=200 style=""> |
| 14 | + |
| 15 | +I write this to remind my future self on the intuition behind the math in the neural network, in case I want a refresher, or if I ever have to write a neural network from scratch again (when pigs start flying). I won't be discussing what each layer does or why we use certain layers, per se. I wanted to show how we derived the forward and backward passes (_especially_ the backward pass). |
| 16 | + |
| 17 | +## Setup |
| 18 | + |
| 19 | +The model we'll describe will be a classification model. It takes in input(s), and outputs a probability distribution describing how likely it is to be a specific class. It will consist of the following layers, in order: |
| 20 | + |
| 21 | +1. Linear |
| 22 | +2. ReLU |
| 23 | +3. Softmax |
| 24 | + |
| 25 | +For our loss function, we'll use the negative log likelihood (NLL) function. |
| 26 | + |
| 27 | +> **PLEASE NOTE** that this is for learning purposes. In practice, depending on what you want (say for classification), you may not want ReLU as the last layer. |
| 28 | +
|
| 29 | +> For classification, the cross entropy loss is equivalent to having a softmax layer and using NLL loss. So, you can simplify softmax and NLL loss to be the cross entropy loss function. However, I prefer to keep them separate to make math easier to grasp. |
| 30 | +
|
| 31 | +Typically, most guides describe the input $x$ to the model as a vector with $D_{in}$ input features. However, we usually train over multiple examples (a batch size $N$). Thus, in this guide, we'll define our input to our model as a matrix with dimensions $N \times D_{in}$. Each row in the input is one training example. |
| 32 | + |
| 33 | +Similarly, our model output will be a $N \times D_{out}$ matrix, where $D_{out}$ represents the number of output features (the number of classes we can predict). |
| 34 | + |
| 35 | +With the model's core dimensions defined, let's dive into the specifics of each layer and the forward pass of the neural network. |
| 36 | + |
| 37 | +> Note, I'll define variables that are probably not very common in other tutorials. Idc. These variables made sense to me when I was first learning it. |
| 38 | +
|
| 39 | +## Linear layer |
| 40 | + |
| 41 | +The linear layer is defined as $h = Wx + b$ (though we may rearrange the matrices multiplication to match dimensions). We will use $h$ to represent the output of the linear layer. |
| 42 | + |
| 43 | +> I find it more intuitive to do $h = xW + b$, rather than having to do transposing and all that jazz. So I will do that. The actual order of matrix multiplication will depend on how you define dimensions and whatever |
| 44 | +
|
| 45 | +If we had _one training example_, our matrix multiplication would look something like this: |
| 46 | + |
| 47 | +$$ |
| 48 | +\begin{bmatrix} |
| 49 | +x_0 & ... & x_{D_{in}} |
| 50 | +\end{bmatrix} |
| 51 | +
|
| 52 | +\begin{bmatrix} |
| 53 | +W_{0, 0} & ... & W_{D_{out}} \\ |
| 54 | +... & ... & ... \\ |
| 55 | +W_{D_{in},0} & ... & W_{D_{in}, D_{out}} |
| 56 | +\end{bmatrix} |
| 57 | +
|
| 58 | ++ |
| 59 | +
|
| 60 | +\begin{bmatrix} |
| 61 | +b_0 & ... & b_{D_{out}} |
| 62 | +\end{bmatrix} |
| 63 | +
|
| 64 | += |
| 65 | +\begin{bmatrix} |
| 66 | +h_0 & ... & h_{D_{out}} |
| 67 | +\end{bmatrix} |
| 68 | +$$ |
| 69 | + |
| 70 | +But remember, we typically do training in batches and have multiple examples. Thus, our input $x$ has dimensions $N \times D_{in}$. Each row is one training example, and we have $N$ examples. Likewise, the output of the layer, $h$, will be a matrix of dimensions $N \times D_{out}$. Each row is the output for one example, and we have $N$ examples. |
| 71 | + |
| 72 | +$$ |
| 73 | +\begin{bmatrix} |
| 74 | +h_{0, 0} & ... & h{0, D_{out}} \\ |
| 75 | +... \\ |
| 76 | +h_{N, 0} & ... & h_{N, D_{out}} |
| 77 | +\end{bmatrix} |
| 78 | +$$ |
| 79 | + |
| 80 | +Our weights $W$ will have dimensions of $D_{in} \times D_{out}$. We use the same weights for each of our examples. |
| 81 | + |
| 82 | +$$ |
| 83 | +\begin{bmatrix} |
| 84 | +W_{0, 0} & ... & W_{D_{out}} \\ |
| 85 | +... & W_{i,j} & ... \\ |
| 86 | +W_{D_{in},0} & ... & W_{D_{in}, D_{out}} |
| 87 | +\end{bmatrix} |
| 88 | +$$ |
| 89 | + |
| 90 | +If we do the matrix multiplication $Wx$, we get a resultant matrix of dimensions $N \times D_{out}$, matching our $h$. Each row in the output is the result of multipling weights to the corresponding input row. |
| 91 | + |
| 92 | +The biases $b$ will have dimensions of $N \times D_{out}$. Keep in mind that we use the same bias values for all training examples. So really, $b$ is just a vector of with $D_{out}$ columns (just like we showed for the one training example), and expanded along the other dimension to have $N$ rows, so that we can add our biases to each training example. |
| 93 | + |
| 94 | +$$ |
| 95 | +\begin{bmatrix} |
| 96 | +b_0 & ... & b_{D_{out}} \\ |
| 97 | +... \\ |
| 98 | +b_0 & ... & b_{D_{out}} |
| 99 | +\end{bmatrix} |
| 100 | +$$ |
| 101 | + |
| 102 | +Putting it together: |
| 103 | + |
| 104 | +$$ |
| 105 | +\begin{bmatrix} |
| 106 | +x_{0,0} & ... & x_{0,D_{out}} \\ |
| 107 | +... \\ |
| 108 | +x_{N,0} & ... & x_{N,D_{out}} |
| 109 | +\end{bmatrix} |
| 110 | +\begin{bmatrix} |
| 111 | +W_{0, 0} & ... & W_{D_{out}} \\ |
| 112 | +... & W_{i,j} & ... \\ |
| 113 | +W_{D_{in},0} & ... & W_{D_{in}, D_{out}} |
| 114 | +\end{bmatrix} |
| 115 | ++ |
| 116 | +\begin{bmatrix} |
| 117 | +b_0 & ... & b_{D_{out}} \\ |
| 118 | +... \\ |
| 119 | +b_0 & ... & b_{D_{out}} |
| 120 | +\end{bmatrix} |
| 121 | += |
| 122 | +\begin{bmatrix} |
| 123 | +h_{0, 0} & ... & h_{0, D_{out}} \\ |
| 124 | +... \\ |
| 125 | +h_{N, 0} & ... & h_{N, D_{out}} |
| 126 | +\end{bmatrix} |
| 127 | +$$ |
| 128 | + |
| 129 | +## ReLU layer |
| 130 | + |
| 131 | +ReLU is pretty easy. It's defined as: |
| 132 | + |
| 133 | +$$ |
| 134 | +r = max(0, h) |
| 135 | +$$ |
| 136 | + |
| 137 | +Here, $h$ is the output from the linear layer being fed into the input of our ReLU layer. And we'll define $r$ as the output of our ReLU layer. |
| 138 | + |
| 139 | +Essentially, what we're doing is taking _every single input_ $h_{i,j}$ from our input matrix and running it through our ReLU function, e.g. ensuring every value becomes $\geq 0$. |
| 140 | + |
| 141 | +The dimensions of our input $h$ is $N \times D_{out}$. ReLU doesn't do any matrix transformations, so it outputs a matrix $r$ of dimensions $N \times D_{out}$ also. |
| 142 | + |
| 143 | +## Softmax layer |
| 144 | + |
| 145 | +Softmax is the final layer of our neural network. We use softmax to make our prediction. For each training example, it returns a probability distribution for our output features (each possible class). More explanation a lil' later. |
| 146 | + |
| 147 | +The input to our softmax layer is the output $r$ from our ReLU layer, which has dimensions $N \times D_{out}$. |
| 148 | + |
| 149 | +For _one training example_ (as most tutorials show), the softmax is defined as: |
| 150 | + |
| 151 | +$$ |
| 152 | +S_i = \frac{e^{r_c}}{\sum_k e^{r_k}} |
| 153 | +$$ |
| 154 | + |
| 155 | +For clarification: |
| 156 | + |
| 157 | +$$ |
| 158 | +\sum_k e^{r_k} = e^{r_0} + e^{r_1} + ... + e^{r_{D_{out}}} |
| 159 | +$$ |
| 160 | + |
| 161 | +$$ |
| 162 | +\begin{bmatrix} |
| 163 | +r_0 & ... & r_{D_{out}} |
| 164 | +\end{bmatrix} |
| 165 | +\Rightarrow |
| 166 | +
|
| 167 | +\begin{bmatrix} |
| 168 | +\frac{e^{r_0}}{\sum_k e^{r_k}} & ... & \frac{e^{r_{D_{out}}}}{\sum_k e^{r_k}} |
| 169 | +\end{bmatrix} |
| 170 | +$$ |
| 171 | + |
| 172 | +Typically, softmax takes in an input vector and outputs a vector of same dimensions. This output vector represents a probability distribution - in our classification model, it represents the predicted probabilities for each class. For each column $S_i$, in the output, $S_i$ represents the probability that the training example is part of class $c$. And since this is a probability distribution, all the values in the output add up to 1. |
| 173 | + |
| 174 | +For example, if softmax returned $[0.8, 0.15, 0.05]$, this means our model predicted that there is an 80% chance that the original input $x$ (not $r$) belongs to class index 0, and that there's a 5% chance that the input belongs to class index 2. |
| 175 | + |
| 176 | +However, this is just for _one training example_. To extend this for our model, where we have multiple training examples and $r$ has dimensions $N \times D_{out}$, we can define our softmax as: |
| 177 | + |
| 178 | +$$ |
| 179 | +S_{i, j} = \frac{e^{r_{i, j}}}{\sum_k e^{r_{i, k}}} |
| 180 | +$$ |
| 181 | + |
| 182 | +For clarification: |
| 183 | + |
| 184 | +$$ |
| 185 | +\sum_k e^{r_{i, k}} = e^{r_{i, 0}} + e^{r_{i, 1}} + ... + e^{r_{i, D_{out}}} |
| 186 | +$$ |
| 187 | + |
| 188 | +$$ |
| 189 | +\begin{bmatrix} |
| 190 | +r_{0, 0} & ... & r_{0, D_{out}} \\ |
| 191 | +... \\ |
| 192 | +r_{N, 0} & ... & r_{N, D_{out}} |
| 193 | +\end{bmatrix} |
| 194 | +
|
| 195 | +\Rightarrow |
| 196 | +
|
| 197 | +\begin{bmatrix} |
| 198 | +\frac{e^{r_{0, 0}}} {\sum_k e^{r_{0, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{0, k}}} \\ |
| 199 | +\\ ... \\ \\ |
| 200 | +\frac{e^{r_{N, 0}}} {\sum_k e^{r_{N, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{N, k}}} |
| 201 | +\end{bmatrix} |
| 202 | +$$ |
| 203 | + |
| 204 | +Running through our softmax layer gives us an output matrix $S$ of dimensions $N \times D_{out}$. Each row represents the predicted probability distribution for one example. As such, if you sum up all the values in one row, it'll add up to one. |
| 205 | + |
| 206 | +## Loss function |
| 207 | + |
| 208 | +Softmax is our final layer in our model. It returns a predicted probability distribution for each of our training examples. Now, we need to measure how well our model did. |
| 209 | + |
| 210 | +To do that, we will compute the negative log likelihood loss (NLL loss) for each training example. Let's look at only one for now. |
| 211 | + |
| 212 | +$$ |
| 213 | +S = [0.8, 0.15, 0.05] |
| 214 | +$$ |
| 215 | + |
| 216 | +Typically, our true probability distribution $y$ will look something like this for classification models: |
| 217 | + |
| 218 | +$$ |
| 219 | +y = [1, 0, 0] |
| 220 | +$$ |
| 221 | + |
| 222 | +Interpreting all this, this means that the input $x$ actually belongs class index 0 (it has 100% chance as shown in $y$). Our model made a prediction that the input has an 80% chance of being in index 0. So it's almost there. |
| 223 | + |
| 224 | +How do we measure how well our model did? We'll calculate a loss value, which is a scalar value. We'll use the NLL loss equation: |
| 225 | + |
| 226 | +$$ |
| 227 | +\mathcal{L}(y, S) = -\sum_c y_c log(S_c) |
| 228 | +$$ |
| 229 | + |
| 230 | +It basically sums up the log likelihood for each output feature $c$ (each possible class), and takes the negative. Why negative? Because we want to minimize the total loss. |
| 231 | + |
| 232 | +> Not gonna explain NLL right now, that's a different story |
| 233 | +
|
| 234 | +To expand this for multiple training examples, we can do something similar: |
| 235 | + |
| 236 | +$$ |
| 237 | +\mathcal{L}(y, S) = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c}) |
| 238 | +$$ |
| 239 | + |
| 240 | +Basically, we calculate the average loss for all $N$ training examples. |
| 241 | + |
| 242 | +## Minimizing loss |
| 243 | + |
| 244 | +So we go forward in our neural network with our initial input $x$, get predicted distributions $S$, and finally calculate a final loss value $\mathcal{L}$. |
| 245 | + |
| 246 | +$$h = Wx + b$$ |
| 247 | +$$r = max(0, h)$$ |
| 248 | +$$S = \text{Softmax}(r)$$ |
| 249 | +$$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$$ |
| 250 | + |
| 251 | +Now it's time to update our parameters (our weights and biases). To do so, we'll calculate the _gradient_ of our loss function with respect to our weights or biases. Recall chain rule: |
| 252 | + |
| 253 | +$$ |
| 254 | +\frac {\partial \mathcal{L}}{\partial W} = |
| 255 | + \frac {\partial \mathcal{L}}{\partial S} |
| 256 | + \frac {\partial S}{\partial r} |
| 257 | + \frac {\partial r}{\partial h} |
| 258 | + \frac {\partial h}{\partial W} |
| 259 | +$$ |
| 260 | + |
| 261 | +$$ |
| 262 | +\frac {\partial \mathcal{L}}{\partial b} = |
| 263 | + \frac {\partial \mathcal{L}}{\partial S} |
| 264 | + \frac {\partial S}{\partial r} |
| 265 | + \frac {\partial r}{\partial h} |
| 266 | + \frac {\partial h}{\partial b} |
| 267 | +$$ |
| 268 | + |
| 269 | +Once we have the loss gradient, we can update each of our weights accordingly. |
| 270 | + |
| 271 | +$$W_{new} = W_{old} - \nabla_W \mathcal{L}$$ |
| 272 | +$$b_{new} = b_{old} - \nabla_b \mathcal{L}$$ |
| 273 | + |
| 274 | +To do this, we need to calculate each derivative in our chain rule. Let's go through each one. |
| 275 | + |
| 276 | +## Loss derivative |
| 277 | + |
| 278 | +Recall our loss function: |
| 279 | + |
| 280 | +$$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$$ |
| 281 | + |
| 282 | +We need to calculate $\frac {\partial \mathcal{L}}{\partial S}$. But, let's observe what this derivative looks like. |
| 283 | + |
| 284 | +$S$ is an input with dimensions of $N \times D_{out}$. Our loss function is simply a scalar value. Thus, we need to compute the following matrix: |
| 285 | + |
| 286 | +$$ |
| 287 | +\frac {\partial \mathcal{L}}{\partial S} |
| 288 | += |
| 289 | +\begin{bmatrix} |
| 290 | +\frac {\partial \mathcal{L}}{\partial S_{0, 0}} & ... & \frac {\partial \mathcal{L}}{\partial S_{0, D_{out}}} \\ |
| 291 | +... \\ |
| 292 | +\frac {\partial \mathcal{L}}{\partial S_{N, 0}} & ... & \frac {\partial \mathcal{L}}{\partial S_{N, D_{out}}} \\ |
| 293 | +\end{bmatrix} |
| 294 | +$$ |
| 295 | + |
| 296 | +TODO |
| 297 | + |
| 298 | +## Softmax derivative |
| 299 | + |
| 300 | +TODO |
| 301 | + |
| 302 | +## ReLU derivative |
| 303 | + |
| 304 | +TODO |
| 305 | + |
| 306 | +## Linear derivative |
| 307 | + |
| 308 | +TODO |
| 309 | + |
| 310 | +## Conclusion |
| 311 | + |
| 312 | +Now we know neural network basic math ππππ |
| 313 | + |
| 314 | +<img src="https://media1.tenor.com/m/mJ_Og97j5WwAAAAC/chipi-chapa.gif" alt="chipi chipi chapa chapa cat" width=300> |
0 commit comments