Skip to content

Commit f6f17d1

Browse files
committed
jazz
1 parent cd1f2e8 commit f6f17d1

5 files changed

Lines changed: 338 additions & 0 deletions

File tree

β€Želeventy.config.jsβ€Ž

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
// TODO consider using eleventy as the final builder/processor and using parcel only for js bundling
22
// rather than having eleventy only for templates and parcel for bundling everything
33
const syntaxHighlight = require("@11ty/eleventy-plugin-syntaxhighlight")
4+
const mdk = require("markdown-it-katex")
45

56
/** @param {import("@11ty/eleventy").UserConfig} eleventyConfig */
67
module.exports = function (eleventyConfig) {
78
eleventyConfig.addPlugin(syntaxHighlight)
9+
eleventyConfig.amendLibrary("md", (mdLib) => mdLib.use(mdk))
810

911
// attempts to extracts a mini excerpt preview given a post
1012
// uses the first paragraph as the excerpt's content

β€Žpackage.jsonβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
"@parcel/plugin": "^2.11.0",
1919
"@shopify/prettier-plugin-liquid": "^1.4.0",
2020
"@tailwindcss/typography": "^0.5.10",
21+
"markdown-it-katex": "^2.0.3",
2122
"parcel": "^2.11.0",
2223
"postcss": "^8.4.32",
2324
"prettier": "^3.1.1",

β€Žpages/_layouts/blog/post.liquidβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ tags: post
2424

2525
<link rel="stylesheet" href="/styles/index.css">
2626
<link rel="stylesheet" href="/styles/syntax-highlight.css">
27+
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css">
2728
<link rel="shortcut icon" type="image/x-icon" href="/assets/images/logo-bg.svg">
2829

2930
<title>
Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
---
2+
layout: blog/post.liquid
3+
title: "Back to basics: Neural network from scratch"
4+
date: 2024-02-03
5+
updated: 2024-02-03
6+
templateEngineOverride: md
7+
---
8+
9+
For class, I had to implement a neural network from scratch, using only basic PyTorch Tensor operations. It was a great refresher on the math behind a neural network. Though, that doesn't make implementing and visualizing the math easier.
10+
11+
> "You're good in math bro, you should learn machine learning bro"
12+
13+
<img src="https://media1.tenor.com/m/-URYSckgL9sAAAAd/get-out-of-my-head-meme.gif" alt="cat crazy like me fr" width=200 style="">
14+
15+
I write this to remind my future self on the intuition behind the math in the neural network, in case I want a refresher, or if I ever have to write a neural network from scratch again (when pigs start flying). I won't be discussing what each layer does or why we use certain layers, per se. I wanted to show how we derived the forward and backward passes (_especially_ the backward pass).
16+
17+
## Setup
18+
19+
The model we'll describe will be a classification model. It takes in input(s), and outputs a probability distribution describing how likely it is to be a specific class. It will consist of the following layers, in order:
20+
21+
1. Linear
22+
2. ReLU
23+
3. Softmax
24+
25+
For our loss function, we'll use the negative log likelihood (NLL) function.
26+
27+
> **PLEASE NOTE** that this is for learning purposes. In practice, depending on what you want (say for classification), you may not want ReLU as the last layer.
28+
29+
> For classification, the cross entropy loss is equivalent to having a softmax layer and using NLL loss. So, you can simplify softmax and NLL loss to be the cross entropy loss function. However, I prefer to keep them separate to make math easier to grasp.
30+
31+
Typically, most guides describe the input $x$ to the model as a vector with $D_{in}$ input features. However, we usually train over multiple examples (a batch size $N$). Thus, in this guide, we'll define our input to our model as a matrix with dimensions $N \times D_{in}$. Each row in the input is one training example.
32+
33+
Similarly, our model output will be a $N \times D_{out}$ matrix, where $D_{out}$ represents the number of output features (the number of classes we can predict).
34+
35+
With the model's core dimensions defined, let's dive into the specifics of each layer and the forward pass of the neural network.
36+
37+
> Note, I'll define variables that are probably not very common in other tutorials. Idc. These variables made sense to me when I was first learning it.
38+
39+
## Linear layer
40+
41+
The linear layer is defined as $h = Wx + b$ (though we may rearrange the matrices multiplication to match dimensions). We will use $h$ to represent the output of the linear layer.
42+
43+
> I find it more intuitive to do $h = xW + b$, rather than having to do transposing and all that jazz. So I will do that. The actual order of matrix multiplication will depend on how you define dimensions and whatever
44+
45+
If we had _one training example_, our matrix multiplication would look something like this:
46+
47+
$$
48+
\begin{bmatrix}
49+
x_0 & ... & x_{D_{in}}
50+
\end{bmatrix}
51+
52+
\begin{bmatrix}
53+
W_{0, 0} & ... & W_{D_{out}} \\
54+
... & ... & ... \\
55+
W_{D_{in},0} & ... & W_{D_{in}, D_{out}}
56+
\end{bmatrix}
57+
58+
+
59+
60+
\begin{bmatrix}
61+
b_0 & ... & b_{D_{out}}
62+
\end{bmatrix}
63+
64+
=
65+
\begin{bmatrix}
66+
h_0 & ... & h_{D_{out}}
67+
\end{bmatrix}
68+
$$
69+
70+
But remember, we typically do training in batches and have multiple examples. Thus, our input $x$ has dimensions $N \times D_{in}$. Each row is one training example, and we have $N$ examples. Likewise, the output of the layer, $h$, will be a matrix of dimensions $N \times D_{out}$. Each row is the output for one example, and we have $N$ examples.
71+
72+
$$
73+
\begin{bmatrix}
74+
h_{0, 0} & ... & h{0, D_{out}} \\
75+
... \\
76+
h_{N, 0} & ... & h_{N, D_{out}}
77+
\end{bmatrix}
78+
$$
79+
80+
Our weights $W$ will have dimensions of $D_{in} \times D_{out}$. We use the same weights for each of our examples.
81+
82+
$$
83+
\begin{bmatrix}
84+
W_{0, 0} & ... & W_{D_{out}} \\
85+
... & W_{i,j} & ... \\
86+
W_{D_{in},0} & ... & W_{D_{in}, D_{out}}
87+
\end{bmatrix}
88+
$$
89+
90+
If we do the matrix multiplication $Wx$, we get a resultant matrix of dimensions $N \times D_{out}$, matching our $h$. Each row in the output is the result of multipling weights to the corresponding input row.
91+
92+
The biases $b$ will have dimensions of $N \times D_{out}$. Keep in mind that we use the same bias values for all training examples. So really, $b$ is just a vector of with $D_{out}$ columns (just like we showed for the one training example), and expanded along the other dimension to have $N$ rows, so that we can add our biases to each training example.
93+
94+
$$
95+
\begin{bmatrix}
96+
b_0 & ... & b_{D_{out}} \\
97+
... \\
98+
b_0 & ... & b_{D_{out}}
99+
\end{bmatrix}
100+
$$
101+
102+
Putting it together:
103+
104+
$$
105+
\begin{bmatrix}
106+
x_{0,0} & ... & x_{0,D_{out}} \\
107+
... \\
108+
x_{N,0} & ... & x_{N,D_{out}}
109+
\end{bmatrix}
110+
\begin{bmatrix}
111+
W_{0, 0} & ... & W_{D_{out}} \\
112+
... & W_{i,j} & ... \\
113+
W_{D_{in},0} & ... & W_{D_{in}, D_{out}}
114+
\end{bmatrix}
115+
+
116+
\begin{bmatrix}
117+
b_0 & ... & b_{D_{out}} \\
118+
... \\
119+
b_0 & ... & b_{D_{out}}
120+
\end{bmatrix}
121+
=
122+
\begin{bmatrix}
123+
h_{0, 0} & ... & h_{0, D_{out}} \\
124+
... \\
125+
h_{N, 0} & ... & h_{N, D_{out}}
126+
\end{bmatrix}
127+
$$
128+
129+
## ReLU layer
130+
131+
ReLU is pretty easy. It's defined as:
132+
133+
$$
134+
r = max(0, h)
135+
$$
136+
137+
Here, $h$ is the output from the linear layer being fed into the input of our ReLU layer. And we'll define $r$ as the output of our ReLU layer.
138+
139+
Essentially, what we're doing is taking _every single input_ $h_{i,j}$ from our input matrix and running it through our ReLU function, e.g. ensuring every value becomes $\geq 0$.
140+
141+
The dimensions of our input $h$ is $N \times D_{out}$. ReLU doesn't do any matrix transformations, so it outputs a matrix $r$ of dimensions $N \times D_{out}$ also.
142+
143+
## Softmax layer
144+
145+
Softmax is the final layer of our neural network. We use softmax to make our prediction. For each training example, it returns a probability distribution for our output features (each possible class). More explanation a lil' later.
146+
147+
The input to our softmax layer is the output $r$ from our ReLU layer, which has dimensions $N \times D_{out}$.
148+
149+
For _one training example_ (as most tutorials show), the softmax is defined as:
150+
151+
$$
152+
S_i = \frac{e^{r_c}}{\sum_k e^{r_k}}
153+
$$
154+
155+
For clarification:
156+
157+
$$
158+
\sum_k e^{r_k} = e^{r_0} + e^{r_1} + ... + e^{r_{D_{out}}}
159+
$$
160+
161+
$$
162+
\begin{bmatrix}
163+
r_0 & ... & r_{D_{out}}
164+
\end{bmatrix}
165+
\Rightarrow
166+
167+
\begin{bmatrix}
168+
\frac{e^{r_0}}{\sum_k e^{r_k}} & ... & \frac{e^{r_{D_{out}}}}{\sum_k e^{r_k}}
169+
\end{bmatrix}
170+
$$
171+
172+
Typically, softmax takes in an input vector and outputs a vector of same dimensions. This output vector represents a probability distribution - in our classification model, it represents the predicted probabilities for each class. For each column $S_i$, in the output, $S_i$ represents the probability that the training example is part of class $c$. And since this is a probability distribution, all the values in the output add up to 1.
173+
174+
For example, if softmax returned $[0.8, 0.15, 0.05]$, this means our model predicted that there is an 80% chance that the original input $x$ (not $r$) belongs to class index 0, and that there's a 5% chance that the input belongs to class index 2.
175+
176+
However, this is just for _one training example_. To extend this for our model, where we have multiple training examples and $r$ has dimensions $N \times D_{out}$, we can define our softmax as:
177+
178+
$$
179+
S_{i, j} = \frac{e^{r_{i, j}}}{\sum_k e^{r_{i, k}}}
180+
$$
181+
182+
For clarification:
183+
184+
$$
185+
\sum_k e^{r_{i, k}} = e^{r_{i, 0}} + e^{r_{i, 1}} + ... + e^{r_{i, D_{out}}}
186+
$$
187+
188+
$$
189+
\begin{bmatrix}
190+
r_{0, 0} & ... & r_{0, D_{out}} \\
191+
... \\
192+
r_{N, 0} & ... & r_{N, D_{out}}
193+
\end{bmatrix}
194+
195+
\Rightarrow
196+
197+
\begin{bmatrix}
198+
\frac{e^{r_{0, 0}}} {\sum_k e^{r_{0, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{0, k}}} \\
199+
\\ ... \\ \\
200+
\frac{e^{r_{N, 0}}} {\sum_k e^{r_{N, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{N, k}}}
201+
\end{bmatrix}
202+
$$
203+
204+
Running through our softmax layer gives us an output matrix $S$ of dimensions $N \times D_{out}$. Each row represents the predicted probability distribution for one example. As such, if you sum up all the values in one row, it'll add up to one.
205+
206+
## Loss function
207+
208+
Softmax is our final layer in our model. It returns a predicted probability distribution for each of our training examples. Now, we need to measure how well our model did.
209+
210+
To do that, we will compute the negative log likelihood loss (NLL loss) for each training example. Let's look at only one for now.
211+
212+
$$
213+
S = [0.8, 0.15, 0.05]
214+
$$
215+
216+
Typically, our true probability distribution $y$ will look something like this for classification models:
217+
218+
$$
219+
y = [1, 0, 0]
220+
$$
221+
222+
Interpreting all this, this means that the input $x$ actually belongs class index 0 (it has 100% chance as shown in $y$). Our model made a prediction that the input has an 80% chance of being in index 0. So it's almost there.
223+
224+
How do we measure how well our model did? We'll calculate a loss value, which is a scalar value. We'll use the NLL loss equation:
225+
226+
$$
227+
\mathcal{L}(y, S) = -\sum_c y_c log(S_c)
228+
$$
229+
230+
It basically sums up the log likelihood for each output feature $c$ (each possible class), and takes the negative. Why negative? Because we want to minimize the total loss.
231+
232+
> Not gonna explain NLL right now, that's a different story
233+
234+
To expand this for multiple training examples, we can do something similar:
235+
236+
$$
237+
\mathcal{L}(y, S) = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})
238+
$$
239+
240+
Basically, we calculate the average loss for all $N$ training examples.
241+
242+
## Minimizing loss
243+
244+
So we go forward in our neural network with our initial input $x$, get predicted distributions $S$, and finally calculate a final loss value $\mathcal{L}$.
245+
246+
$$h = Wx + b$$
247+
$$r = max(0, h)$$
248+
$$S = \text{Softmax}(r)$$
249+
$$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$$
250+
251+
Now it's time to update our parameters (our weights and biases). To do so, we'll calculate the _gradient_ of our loss function with respect to our weights or biases. Recall chain rule:
252+
253+
$$
254+
\frac {\partial \mathcal{L}}{\partial W} =
255+
\frac {\partial \mathcal{L}}{\partial S}
256+
\frac {\partial S}{\partial r}
257+
\frac {\partial r}{\partial h}
258+
\frac {\partial h}{\partial W}
259+
$$
260+
261+
$$
262+
\frac {\partial \mathcal{L}}{\partial b} =
263+
\frac {\partial \mathcal{L}}{\partial S}
264+
\frac {\partial S}{\partial r}
265+
\frac {\partial r}{\partial h}
266+
\frac {\partial h}{\partial b}
267+
$$
268+
269+
Once we have the loss gradient, we can update each of our weights accordingly.
270+
271+
$$W_{new} = W_{old} - \nabla_W \mathcal{L}$$
272+
$$b_{new} = b_{old} - \nabla_b \mathcal{L}$$
273+
274+
To do this, we need to calculate each derivative in our chain rule. Let's go through each one.
275+
276+
## Loss derivative
277+
278+
Recall our loss function:
279+
280+
$$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$$
281+
282+
We need to calculate $\frac {\partial \mathcal{L}}{\partial S}$. But, let's observe what this derivative looks like.
283+
284+
$S$ is an input with dimensions of $N \times D_{out}$. Our loss function is simply a scalar value. Thus, we need to compute the following matrix:
285+
286+
$$
287+
\frac {\partial \mathcal{L}}{\partial S}
288+
=
289+
\begin{bmatrix}
290+
\frac {\partial \mathcal{L}}{\partial S_{0, 0}} & ... & \frac {\partial \mathcal{L}}{\partial S_{0, D_{out}}} \\
291+
... \\
292+
\frac {\partial \mathcal{L}}{\partial S_{N, 0}} & ... & \frac {\partial \mathcal{L}}{\partial S_{N, D_{out}}} \\
293+
\end{bmatrix}
294+
$$
295+
296+
TODO
297+
298+
## Softmax derivative
299+
300+
TODO
301+
302+
## ReLU derivative
303+
304+
TODO
305+
306+
## Linear derivative
307+
308+
TODO
309+
310+
## Conclusion
311+
312+
Now we know neural network basic math πŸ˜ŽπŸŽ‰πŸŽ‰πŸŽ‰
313+
314+
<img src="https://media1.tenor.com/m/mJ_Og97j5WwAAAAC/chipi-chapa.gif" alt="chipi chipi chapa chapa cat" width=300>

β€Žpnpm-lock.yamlβ€Ž

Lines changed: 20 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
Β (0)