Skip to content

Commit de5218e

Browse files
committed
2 parents a5cd393 + 7638cb3 commit de5218e

31 files changed

Lines changed: 3773 additions & 6 deletions

articles/backpropagation.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
## Prerequisites
2+
3+
Before attempting this problem, you should be comfortable with:
4+
5+
- **Chain Rule** - Computing derivatives of composite functions like $f(g(h(x)))$ by multiplying the individual derivatives, because backpropagation is literally the chain rule applied systematically
6+
- **Single Neuron Forward Pass** - You need the forward pass ($z = x \cdot w + b$, then sigmoid) before you can compute gradients going backward
7+
- **Gradient Descent** - Once you have the gradients, you update weights by stepping opposite to them
8+
9+
---
10+
11+
## Concept
12+
13+
Backpropagation is how neural networks learn. It computes how much each weight and bias contributed to the total loss by applying the chain rule backward through the computation graph.
14+
15+
For a single neuron with sigmoid activation and squared error loss, the chain of computation is:
16+
17+
$$z = x \cdot w + b \quad \rightarrow \quad \hat{y} = \sigma(z) \quad \rightarrow \quad L = \frac{1}{2}(\hat{y} - y)^2$$
18+
19+
To find $\frac{\partial L}{\partial w}$, we chain the derivatives step by step:
20+
21+
$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$
22+
23+
Each piece is simple on its own:
24+
- $\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$ (error signal)
25+
- $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})$ (sigmoid derivative, which has this elegant form)
26+
- $\frac{\partial z}{\partial w} = x$ (the input itself)
27+
28+
The product of the first two, $(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})$, is called the **delta**. It captures "how wrong are we, scaled by how sensitive the activation is at this operating point." The weight gradient is delta times the input. The bias gradient is just delta (since $\frac{\partial z}{\partial b} = 1$).
29+
30+
---
31+
32+
## Solution
33+
34+
### Intuition
35+
36+
We run the forward pass to get $\hat{y}$, compute the delta term (error times sigmoid derivative), then multiply by each input to get the weight gradients. The bias gradient is the delta itself.
37+
38+
### Implementation
39+
40+
::tabs-start
41+
```python
42+
import numpy as np
43+
from numpy.typing import NDArray
44+
from typing import Tuple
45+
46+
47+
class Solution:
48+
def backward(self, x: NDArray[np.float64], w: NDArray[np.float64], b: float, y_true: float) -> Tuple[NDArray[np.float64], float]:
49+
z = np.dot(x, w) + b
50+
y_hat = 1.0 / (1.0 + np.exp(-z))
51+
52+
error = y_hat - y_true
53+
sigmoid_deriv = y_hat * (1.0 - y_hat)
54+
delta = error * sigmoid_deriv
55+
56+
dL_dw = np.round(delta * x, 5)
57+
dL_db = round(float(delta), 5)
58+
59+
return (dL_dw, dL_db)
60+
```
61+
::tabs-end
62+
63+
64+
### Walkthrough
65+
66+
Given `x = [1.0, 2.0]`, `w = [0.5, -0.5]`, `b = 0.1`, `y_true = 1.0`:
67+
68+
| Step | Computation | Result |
69+
|---|---|---|
70+
| Linear | $z = 1.0(0.5) + 2.0(-0.5) + 0.1$ | $-0.4$ |
71+
| Sigmoid | $\hat{y} = 1/(1 + e^{0.4})$ | $0.40131$ |
72+
| Error | $\hat{y} - y$ | $-0.59869$ |
73+
| Sigmoid deriv | $0.40131 \times 0.59869$ | $0.24026$ |
74+
| Delta | $-0.59869 \times 0.24026$ | $-0.14384$ |
75+
| $dL/dw$ | $[-0.14384 \times 1.0, -0.14384 \times 2.0]$ | $[-0.14384, -0.28768]$ |
76+
| $dL/db$ | $-0.14384$ | $-0.14384$ |
77+
78+
The negative gradients mean: increase $w_0$, increase $w_1$, and increase $b$ to push $\hat{y}$ closer to 1.0.
79+
80+
### Time & Space Complexity
81+
82+
- Time: $O(d)$ where $d$ is the number of input features
83+
- Space: $O(d)$ for the weight gradient vector
84+
85+
---
86+
87+
## Common Pitfalls
88+
89+
### Getting the Error Direction Wrong
90+
91+
The error is $\hat{y} - y$, not $y - \hat{y}$. Flipping it negates all gradients, making the model move away from the target.
92+
93+
::tabs-start
94+
```python
95+
# Wrong: inverted error
96+
error = y_true - y_hat
97+
98+
# Correct: prediction minus truth
99+
error = y_hat - y_true
100+
```
101+
::tabs-end
102+
103+
104+
### Forgetting the Sigmoid Derivative
105+
106+
The sigmoid derivative is part of the chain. Without it, you are computing the gradient as if the activation were linear, which gives wrong weight updates.
107+
108+
::tabs-start
109+
```python
110+
# Wrong: missing sigmoid derivative in the chain
111+
delta = error # only the error, no activation derivative
112+
113+
# Correct: error * sigmoid derivative
114+
sigmoid_deriv = y_hat * (1.0 - y_hat)
115+
delta = error * sigmoid_deriv
116+
```
117+
::tabs-end
118+
119+
120+
---
121+
122+
## In the GPT Project
123+
124+
This becomes `foundations/backprop.py`. In practice, PyTorch's autograd computes all of this automatically when you call `loss.backward()`. But understanding the manual chain-rule computation explains what happens under the hood and why vanishing gradients occur (the sigmoid derivative term can be very small).
125+
126+
---
127+
128+
## Key Takeaways
129+
130+
- Backpropagation decomposes the gradient into a chain of simple derivatives. Each link in the chain corresponds to one operation in the forward pass.
131+
- The "delta" term (error times activation derivative) is the core building block. In deeper networks, deltas propagate backward through layers.
132+
- The sigmoid derivative $\hat{y}(1-\hat{y})$ peaks at 0.25 when $\hat{y} = 0.5$ and approaches 0 at the extremes, which is why deep sigmoid networks suffer from vanishing gradients.

articles/basics-of-pytorch.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
## Prerequisites
2+
3+
Before attempting this problem, you should be comfortable with:
4+
5+
- **NumPy Array Operations** - PyTorch tensors mirror NumPy arrays in API and behavior, so if you know NumPy, you already know 80% of PyTorch
6+
- **Reshaping and Dimension Semantics** - Understanding what `dim=0` vs `dim=1` means in a 2D tensor is critical because getting the wrong dimension silently produces wrong results
7+
8+
---
9+
10+
## Concept
11+
12+
PyTorch is the most widely used deep learning framework. Its core data structure is the **tensor**, which works like a NumPy array but adds two superpowers: automatic differentiation (autograd) and GPU acceleration.
13+
14+
**Reshaping** changes a tensor's dimensions without touching its data. A $(2, 4)$ tensor has 8 elements, and you can reshape it to $(4, 2)$, $(1, 8)$, or $(8, 1)$. The total element count must stay the same. This is used constantly: for example, flattening a 2D image into a 1D vector for a linear layer.
15+
16+
**Averaging** along a dimension collapses that dimension. For a $(3, 2)$ tensor, averaging along `dim=0` (rows) produces a $(2,)$ tensor with column means. Averaging along `dim=1` (columns) produces a $(3,)$ tensor with row means. Think of `dim=` as "the dimension that disappears."
17+
18+
**Concatenation** joins tensors along a dimension. Concatenating two $(2, 3)$ tensors along `dim=1` produces a $(2, 6)$ tensor. The tensors must agree on all other dimensions.
19+
20+
**MSE Loss** is available as `torch.nn.functional.mse_loss`, computing $\frac{1}{N}\sum(pred_i - target_i)^2$. Using built-in loss functions is preferred because they handle edge cases and are numerically stable.
21+
22+
---
23+
24+
## Solution
25+
26+
### Intuition
27+
28+
Each method exercises a core PyTorch operation. We use `torch.reshape` for reshaping, `torch.mean` with a dimension argument for averaging, `torch.cat` for concatenation, and `torch.nn.functional.mse_loss` for the loss computation.
29+
30+
### Implementation
31+
32+
::tabs-start
33+
```python
34+
import torch
35+
import torch.nn
36+
from torchtyping import TensorType
37+
38+
39+
class Solution:
40+
def reshape(self, to_reshape: TensorType[float]) -> TensorType[float]:
41+
M, N = to_reshape.shape
42+
reshaped = torch.reshape(to_reshape, (M * N // 2, 2))
43+
return torch.round(reshaped, decimals=4)
44+
45+
def average(self, to_avg: TensorType[float]) -> TensorType[float]:
46+
averaged = torch.mean(to_avg, dim = 0)
47+
return torch.round(averaged, decimals=4)
48+
49+
def concatenate(self, cat_one: TensorType[float], cat_two: TensorType[float]) -> TensorType[float]:
50+
concatenated = torch.cat((cat_one, cat_two), dim = 1)
51+
return torch.round(concatenated, decimals=4)
52+
53+
def get_loss(self, prediction: TensorType[float], target: TensorType[float]) -> TensorType[float]:
54+
loss = torch.nn.functional.mse_loss(prediction, target)
55+
return torch.round(loss, decimals=4)
56+
```
57+
::tabs-end
58+
59+
60+
### Walkthrough
61+
62+
| Operation | Input | Computation | Output |
63+
|---|---|---|---|
64+
| Reshape $(2,4) \to (4,2)$ | $[[1,2,3,4],[5,6,7,8]]$ | Split into pairs of 2 | $[[1,2],[3,4],[5,6],[7,8]]$ |
65+
| Average `dim=0` | $[[1,2],[3,4],[5,6]]$ | Column means | $[3.0, 4.0]$ |
66+
| Concat `dim=1` | $[[1,2],[3,4]]$ and $[[5,6],[7,8]]$ | Side by side | $[[1,2,5,6],[3,4,7,8]]$ |
67+
| MSE Loss | pred=$[1.0,2.0]$, target=$[1.5,2.5]$ | $(0.25+0.25)/2$ | $0.25$ |
68+
69+
### Time & Space Complexity
70+
71+
- Time: $O(n)$ for each operation where $n$ is the total number of elements
72+
- Space: $O(n)$ for the output tensors
73+
74+
---
75+
76+
## Common Pitfalls
77+
78+
### Wrong Dimension for Averaging
79+
80+
`dim=0` averages across rows (column-wise means), `dim=1` averages across columns (row-wise means). These are easy to confuse.
81+
82+
::tabs-start
83+
```python
84+
# Wrong: averages across columns instead of rows
85+
averaged = torch.mean(to_avg, dim=1)
86+
87+
# Correct: averages across rows (column means)
88+
averaged = torch.mean(to_avg, dim=0)
89+
```
90+
::tabs-end
91+
92+
93+
### Mismatched Shapes for Concatenation
94+
95+
Concatenation along `dim=1` requires the same number of rows. Different row counts cause a runtime error.
96+
97+
::tabs-start
98+
```python
99+
# Wrong: different number of rows (2 vs 3)
100+
torch.cat((torch.zeros(2, 3), torch.zeros(3, 3)), dim=1)
101+
102+
# Correct: same number of rows
103+
torch.cat((torch.zeros(2, 3), torch.zeros(2, 3)), dim=1)
104+
```
105+
::tabs-end
106+
107+
108+
---
109+
110+
## In the GPT Project
111+
112+
This becomes `foundations/pytorch_basics.py`. Every subsequent problem uses these operations. Reshaping appears when flattening logits for cross-entropy loss. Averaging is used in layer normalization. Concatenation joins multi-head attention outputs. MSE loss trains regression models.
113+
114+
---
115+
116+
## Key Takeaways
117+
118+
- PyTorch tensors mirror NumPy arrays in API but add autograd and GPU support, making them the foundation of modern deep learning.
119+
- `torch.reshape`, `torch.mean`, and `torch.cat` are the three tensor manipulation functions you will use most often.
120+
- Using built-in functions like `mse_loss` is safer than manual computation because they handle numerical edge cases automatically.

articles/build-vocabulary.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
## Prerequisites
2+
3+
Before attempting this problem, you should be comfortable with:
4+
5+
- **Python Dictionaries** - Creating bidirectional mappings between keys and values, because the vocabulary is a pair of dictionaries (string-to-int and int-to-string)
6+
- **Character-Level Tokenization** - Treating each character as an individual token, which is the simplest tokenization approach and the one used in this course's GPT model
7+
8+
---
9+
10+
## Concept
11+
12+
Before a language model can process text, it needs a vocabulary: a bidirectional mapping between characters (or tokens) and integers. The model works with integers internally, so we need `encode` to convert text to numbers and `decode` to convert numbers back to text.
13+
14+
The construction process is:
15+
16+
1. **Extract unique characters** from the training text.
17+
2. **Sort them** alphabetically for deterministic ordering.
18+
3. **Build `stoi`** (string-to-integer): assign each character a unique index starting from 0.
19+
4. **Build `itos`** (integer-to-string): the reverse mapping.
20+
21+
This is character-level tokenization. The vocabulary size equals the number of unique characters in the training data, typically 50-100 for English text. Compare this to BPE (50,000+ tokens) or word-level (100,000+ tokens). Character-level vocabularies produce much longer sequences but never encounter out-of-vocabulary tokens (as long as the character appeared in training).
22+
23+
The `encode`/`decode` functions must be inverses: `decode(encode(text)) == text`. This round-trip property is essential. If you cannot perfectly reconstruct the original text, the model cannot learn the correct input-output mapping.
24+
25+
---
26+
27+
## Solution
28+
29+
### Intuition
30+
31+
Extract unique characters with `set()`, sort them, build two dictionaries with enumerate. Encoding is a list comprehension of dictionary lookups. Decoding joins the looked-up characters.
32+
33+
### Implementation
34+
35+
::tabs-start
36+
```python
37+
from typing import Dict, List, Tuple
38+
39+
class Solution:
40+
def build_vocab(self, text: str) -> Tuple[Dict[str, int], Dict[int, str]]:
41+
chars = sorted(set(text))
42+
stoi = {ch: i for i, ch in enumerate(chars)}
43+
itos = {i: ch for ch, i in stoi.items()}
44+
return stoi, itos
45+
46+
def encode(self, text: str, stoi: Dict[str, int]) -> List[int]:
47+
return [stoi[ch] for ch in text]
48+
49+
def decode(self, ids: List[int], itos: Dict[int, str]) -> str:
50+
return ''.join(itos[i] for i in ids)
51+
```
52+
::tabs-end
53+
54+
55+
### Walkthrough
56+
57+
For `text = "hello"`:
58+
59+
| Step | Input | Output |
60+
|---|---|---|
61+
| Unique chars | "hello" | `{'h', 'e', 'l', 'o'}` |
62+
| Sort | set | `['e', 'h', 'l', 'o']` |
63+
| stoi | sorted chars | `{'e': 0, 'h': 1, 'l': 2, 'o': 3}` |
64+
| itos | reversed | `{0: 'e', 1: 'h', 2: 'l', 3: 'o'}` |
65+
| Encode "hello" | lookup each char | $[1, 0, 2, 2, 3]$ |
66+
| Decode $[1,0,2,2,3]$ | lookup each int | "hello" |
67+
68+
Round-trip: `decode(encode("hello")) = "hello"`.
69+
70+
### Time & Space Complexity
71+
72+
- Time: $O(N \log N)$ for building vocabulary (sorting unique characters), $O(N)$ for encoding/decoding where $N$ is text length
73+
- Space: $O(V)$ for the vocabulary dictionaries where $V$ is the number of unique characters
74+
75+
---
76+
77+
## Common Pitfalls
78+
79+
### Not Sorting the Unique Characters
80+
81+
Python sets have no guaranteed iteration order. Without sorting, the same text may produce different vocabularies on different runs.
82+
83+
::tabs-start
84+
```python
85+
# Wrong: non-deterministic order
86+
chars = list(set(text))
87+
88+
# Correct: sorted for reproducibility
89+
chars = sorted(set(text))
90+
```
91+
::tabs-end
92+
93+
94+
### Building itos Incorrectly
95+
96+
The `itos` mapping must be the exact inverse of `stoi`. Building it independently can introduce mismatches.
97+
98+
::tabs-start
99+
```python
100+
# Wrong: building independently, might not be exact inverse
101+
itos = {i: ch for i, ch in enumerate(chars)}
102+
103+
# Correct: derive from stoi to guarantee inverse relationship
104+
itos = {i: ch for ch, i in stoi.items()}
105+
```
106+
::tabs-end
107+
108+
109+
---
110+
111+
## In the GPT Project
112+
113+
This becomes `data/vocab.py`. The GPT model in this course uses character-level tokenization, so this vocabulary is what converts raw training text into the integer sequences that the model processes. Production models like GPT-2 use BPE instead, but the concept is the same: a bidirectional mapping between tokens and integers.
114+
115+
---
116+
117+
## Key Takeaways
118+
119+
- A character-level vocabulary is the simplest tokenization approach, with vocabulary size equal to the number of unique characters in the training data.
120+
- The `stoi`/`itos` pair enables lossless round-trip conversion between text and integer sequences, which is a hard requirement for any tokenizer.
121+
- Sorting the unique characters ensures deterministic ID assignment. Without sorting, the same text could produce different vocabularies across runs.

0 commit comments

Comments
 (0)