Skip to content

Commit bcac58b

Browse files
Add solution articles for all 27 ML problems (#5649)
Each article includes: - Prerequisites linking to prior problems in the course - Concept section with LaTeX formulas and intuition - Implementation walkthrough with test case traces - Time/space complexity analysis - Key takeaways Problems covered: gradient-descent, sigmoid-and-relu, softmax, cross-entropy-loss, linear-regression-forward, linear-regression-training, basics-of-pytorch, handwritten-digit-classifier, single-neuron, backpropagation, mlp-from-scratch, layer-normalization, training-loop, word-embeddings, nlp-intro, sentiment-analysis, positional-encoding, self-attention, multi-headed-self-attention, transformer-block, tokenizer-bpe, build-vocabulary, gpt-data-loader, gpt-dataset, code-gpt, train-your-gpt, make-gpt-talk-back Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7e43580 commit bcac58b

27 files changed

Lines changed: 3357 additions & 0 deletions

articles/backpropagation.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
## Prerequisites
2+
3+
Before attempting this problem, you should be comfortable with:
4+
5+
- **Chain Rule** - Computing derivatives of composite functions like $f(g(h(x)))$ by multiplying the individual derivatives, because backpropagation is literally the chain rule applied systematically
6+
- **Single Neuron Forward Pass** - You need the forward pass ($z = x \cdot w + b$, then sigmoid) before you can compute gradients going backward
7+
- **Gradient Descent** - Once you have the gradients, you update weights by stepping opposite to them
8+
9+
---
10+
11+
## Concept
12+
13+
Backpropagation is how neural networks learn. It computes how much each weight and bias contributed to the total loss by applying the chain rule backward through the computation graph.
14+
15+
For a single neuron with sigmoid activation and squared error loss, the chain of computation is:
16+
17+
$$z = x \cdot w + b \quad \rightarrow \quad \hat{y} = \sigma(z) \quad \rightarrow \quad L = \frac{1}{2}(\hat{y} - y)^2$$
18+
19+
To find $\frac{\partial L}{\partial w}$, we chain the derivatives step by step:
20+
21+
$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$
22+
23+
Each piece is simple on its own:
24+
- $\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$ (error signal)
25+
- $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})$ (sigmoid derivative, which has this elegant form)
26+
- $\frac{\partial z}{\partial w} = x$ (the input itself)
27+
28+
The product of the first two, $(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})$, is called the **delta**. It captures "how wrong are we, scaled by how sensitive the activation is at this operating point." The weight gradient is delta times the input. The bias gradient is just delta (since $\frac{\partial z}{\partial b} = 1$).
29+
30+
---
31+
32+
## Solution
33+
34+
### Intuition
35+
36+
We run the forward pass to get $\hat{y}$, compute the delta term (error times sigmoid derivative), then multiply by each input to get the weight gradients. The bias gradient is the delta itself.
37+
38+
### Implementation
39+
40+
```python
41+
import numpy as np
42+
from numpy.typing import NDArray
43+
from typing import Tuple
44+
45+
46+
class Solution:
47+
def backward(self, x: NDArray[np.float64], w: NDArray[np.float64], b: float, y_true: float) -> Tuple[NDArray[np.float64], float]:
48+
z = np.dot(x, w) + b
49+
y_hat = 1.0 / (1.0 + np.exp(-z))
50+
51+
error = y_hat - y_true
52+
sigmoid_deriv = y_hat * (1.0 - y_hat)
53+
delta = error * sigmoid_deriv
54+
55+
dL_dw = np.round(delta * x, 5)
56+
dL_db = round(float(delta), 5)
57+
58+
return (dL_dw, dL_db)
59+
```
60+
61+
### Walkthrough
62+
63+
Given `x = [1.0, 2.0]`, `w = [0.5, -0.5]`, `b = 0.1`, `y_true = 1.0`:
64+
65+
| Step | Computation | Result |
66+
|---|---|---|
67+
| Linear | $z = 1.0(0.5) + 2.0(-0.5) + 0.1$ | $-0.4$ |
68+
| Sigmoid | $\hat{y} = 1/(1 + e^{0.4})$ | $0.40131$ |
69+
| Error | $\hat{y} - y$ | $-0.59869$ |
70+
| Sigmoid deriv | $0.40131 \times 0.59869$ | $0.24026$ |
71+
| Delta | $-0.59869 \times 0.24026$ | $-0.14384$ |
72+
| $dL/dw$ | $[-0.14384 \times 1.0, -0.14384 \times 2.0]$ | $[-0.14384, -0.28768]$ |
73+
| $dL/db$ | $-0.14384$ | $-0.14384$ |
74+
75+
The negative gradients mean: increase $w_0$, increase $w_1$, and increase $b$ to push $\hat{y}$ closer to 1.0.
76+
77+
### Time & Space Complexity
78+
79+
- Time: $O(d)$ where $d$ is the number of input features
80+
- Space: $O(d)$ for the weight gradient vector
81+
82+
---
83+
84+
## Common Pitfalls
85+
86+
### Getting the Error Direction Wrong
87+
88+
The error is $\hat{y} - y$, not $y - \hat{y}$. Flipping it negates all gradients, making the model move away from the target.
89+
90+
```python
91+
# Wrong: inverted error
92+
error = y_true - y_hat
93+
94+
# Correct: prediction minus truth
95+
error = y_hat - y_true
96+
```
97+
98+
### Forgetting the Sigmoid Derivative
99+
100+
The sigmoid derivative is part of the chain. Without it, you are computing the gradient as if the activation were linear, which gives wrong weight updates.
101+
102+
```python
103+
# Wrong: missing sigmoid derivative in the chain
104+
delta = error # only the error, no activation derivative
105+
106+
# Correct: error * sigmoid derivative
107+
sigmoid_deriv = y_hat * (1.0 - y_hat)
108+
delta = error * sigmoid_deriv
109+
```
110+
111+
---
112+
113+
## In the GPT Project
114+
115+
This becomes `foundations/backprop.py`. In practice, PyTorch's autograd computes all of this automatically when you call `loss.backward()`. But understanding the manual chain-rule computation explains what happens under the hood and why vanishing gradients occur (the sigmoid derivative term can be very small).
116+
117+
---
118+
119+
## Key Takeaways
120+
121+
- Backpropagation decomposes the gradient into a chain of simple derivatives. Each link in the chain corresponds to one operation in the forward pass.
122+
- The "delta" term (error times activation derivative) is the core building block. In deeper networks, deltas propagate backward through layers.
123+
- The sigmoid derivative $\hat{y}(1-\hat{y})$ peaks at 0.25 when $\hat{y} = 0.5$ and approaches 0 at the extremes, which is why deep sigmoid networks suffer from vanishing gradients.

articles/basics-of-pytorch.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
## Prerequisites
2+
3+
Before attempting this problem, you should be comfortable with:
4+
5+
- **NumPy Array Operations** - PyTorch tensors mirror NumPy arrays in API and behavior, so if you know NumPy, you already know 80% of PyTorch
6+
- **Reshaping and Dimension Semantics** - Understanding what `dim=0` vs `dim=1` means in a 2D tensor is critical because getting the wrong dimension silently produces wrong results
7+
8+
---
9+
10+
## Concept
11+
12+
PyTorch is the most widely used deep learning framework. Its core data structure is the **tensor**, which works like a NumPy array but adds two superpowers: automatic differentiation (autograd) and GPU acceleration.
13+
14+
**Reshaping** changes a tensor's dimensions without touching its data. A $(2, 4)$ tensor has 8 elements, and you can reshape it to $(4, 2)$, $(1, 8)$, or $(8, 1)$. The total element count must stay the same. This is used constantly: for example, flattening a 2D image into a 1D vector for a linear layer.
15+
16+
**Averaging** along a dimension collapses that dimension. For a $(3, 2)$ tensor, averaging along `dim=0` (rows) produces a $(2,)$ tensor with column means. Averaging along `dim=1` (columns) produces a $(3,)$ tensor with row means. Think of `dim=` as "the dimension that disappears."
17+
18+
**Concatenation** joins tensors along a dimension. Concatenating two $(2, 3)$ tensors along `dim=1` produces a $(2, 6)$ tensor. The tensors must agree on all other dimensions.
19+
20+
**MSE Loss** is available as `torch.nn.functional.mse_loss`, computing $\frac{1}{N}\sum(pred_i - target_i)^2$. Using built-in loss functions is preferred because they handle edge cases and are numerically stable.
21+
22+
---
23+
24+
## Solution
25+
26+
### Intuition
27+
28+
Each method exercises a core PyTorch operation. We use `torch.reshape` for reshaping, `torch.mean` with a dimension argument for averaging, `torch.cat` for concatenation, and `torch.nn.functional.mse_loss` for the loss computation.
29+
30+
### Implementation
31+
32+
```python
33+
import torch
34+
import torch.nn
35+
from torchtyping import TensorType
36+
37+
38+
class Solution:
39+
def reshape(self, to_reshape: TensorType[float]) -> TensorType[float]:
40+
M, N = to_reshape.shape
41+
reshaped = torch.reshape(to_reshape, (M * N // 2, 2))
42+
return torch.round(reshaped, decimals=4)
43+
44+
def average(self, to_avg: TensorType[float]) -> TensorType[float]:
45+
averaged = torch.mean(to_avg, dim = 0)
46+
return torch.round(averaged, decimals=4)
47+
48+
def concatenate(self, cat_one: TensorType[float], cat_two: TensorType[float]) -> TensorType[float]:
49+
concatenated = torch.cat((cat_one, cat_two), dim = 1)
50+
return torch.round(concatenated, decimals=4)
51+
52+
def get_loss(self, prediction: TensorType[float], target: TensorType[float]) -> TensorType[float]:
53+
loss = torch.nn.functional.mse_loss(prediction, target)
54+
return torch.round(loss, decimals=4)
55+
```
56+
57+
### Walkthrough
58+
59+
| Operation | Input | Computation | Output |
60+
|---|---|---|---|
61+
| Reshape $(2,4) \to (4,2)$ | $[[1,2,3,4],[5,6,7,8]]$ | Split into pairs of 2 | $[[1,2],[3,4],[5,6],[7,8]]$ |
62+
| Average `dim=0` | $[[1,2],[3,4],[5,6]]$ | Column means | $[3.0, 4.0]$ |
63+
| Concat `dim=1` | $[[1,2],[3,4]]$ and $[[5,6],[7,8]]$ | Side by side | $[[1,2,5,6],[3,4,7,8]]$ |
64+
| MSE Loss | pred=$[1.0,2.0]$, target=$[1.5,2.5]$ | $(0.25+0.25)/2$ | $0.25$ |
65+
66+
### Time & Space Complexity
67+
68+
- Time: $O(n)$ for each operation where $n$ is the total number of elements
69+
- Space: $O(n)$ for the output tensors
70+
71+
---
72+
73+
## Common Pitfalls
74+
75+
### Wrong Dimension for Averaging
76+
77+
`dim=0` averages across rows (column-wise means), `dim=1` averages across columns (row-wise means). These are easy to confuse.
78+
79+
```python
80+
# Wrong: averages across columns instead of rows
81+
averaged = torch.mean(to_avg, dim=1)
82+
83+
# Correct: averages across rows (column means)
84+
averaged = torch.mean(to_avg, dim=0)
85+
```
86+
87+
### Mismatched Shapes for Concatenation
88+
89+
Concatenation along `dim=1` requires the same number of rows. Different row counts cause a runtime error.
90+
91+
```python
92+
# Wrong: different number of rows (2 vs 3)
93+
torch.cat((torch.zeros(2, 3), torch.zeros(3, 3)), dim=1)
94+
95+
# Correct: same number of rows
96+
torch.cat((torch.zeros(2, 3), torch.zeros(2, 3)), dim=1)
97+
```
98+
99+
---
100+
101+
## In the GPT Project
102+
103+
This becomes `foundations/pytorch_basics.py`. Every subsequent problem uses these operations. Reshaping appears when flattening logits for cross-entropy loss. Averaging is used in layer normalization. Concatenation joins multi-head attention outputs. MSE loss trains regression models.
104+
105+
---
106+
107+
## Key Takeaways
108+
109+
- PyTorch tensors mirror NumPy arrays in API but add autograd and GPU support, making them the foundation of modern deep learning.
110+
- `torch.reshape`, `torch.mean`, and `torch.cat` are the three tensor manipulation functions you will use most often.
111+
- Using built-in functions like `mse_loss` is safer than manual computation because they handle numerical edge cases automatically.

articles/build-vocabulary.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
## Prerequisites
2+
3+
Before attempting this problem, you should be comfortable with:
4+
5+
- **Python Dictionaries** - Creating bidirectional mappings between keys and values, because the vocabulary is a pair of dictionaries (string-to-int and int-to-string)
6+
- **Character-Level Tokenization** - Treating each character as an individual token, which is the simplest tokenization approach and the one used in this course's GPT model
7+
8+
---
9+
10+
## Concept
11+
12+
Before a language model can process text, it needs a vocabulary: a bidirectional mapping between characters (or tokens) and integers. The model works with integers internally, so we need `encode` to convert text to numbers and `decode` to convert numbers back to text.
13+
14+
The construction process is:
15+
16+
1. **Extract unique characters** from the training text.
17+
2. **Sort them** alphabetically for deterministic ordering.
18+
3. **Build `stoi`** (string-to-integer): assign each character a unique index starting from 0.
19+
4. **Build `itos`** (integer-to-string): the reverse mapping.
20+
21+
This is character-level tokenization. The vocabulary size equals the number of unique characters in the training data, typically 50-100 for English text. Compare this to BPE (50,000+ tokens) or word-level (100,000+ tokens). Character-level vocabularies produce much longer sequences but never encounter out-of-vocabulary tokens (as long as the character appeared in training).
22+
23+
The `encode`/`decode` functions must be inverses: `decode(encode(text)) == text`. This round-trip property is essential. If you cannot perfectly reconstruct the original text, the model cannot learn the correct input-output mapping.
24+
25+
---
26+
27+
## Solution
28+
29+
### Intuition
30+
31+
Extract unique characters with `set()`, sort them, build two dictionaries with enumerate. Encoding is a list comprehension of dictionary lookups. Decoding joins the looked-up characters.
32+
33+
### Implementation
34+
35+
```python
36+
from typing import Dict, List, Tuple
37+
38+
class Solution:
39+
def build_vocab(self, text: str) -> Tuple[Dict[str, int], Dict[int, str]]:
40+
chars = sorted(set(text))
41+
stoi = {ch: i for i, ch in enumerate(chars)}
42+
itos = {i: ch for ch, i in stoi.items()}
43+
return stoi, itos
44+
45+
def encode(self, text: str, stoi: Dict[str, int]) -> List[int]:
46+
return [stoi[ch] for ch in text]
47+
48+
def decode(self, ids: List[int], itos: Dict[int, str]) -> str:
49+
return ''.join(itos[i] for i in ids)
50+
```
51+
52+
### Walkthrough
53+
54+
For `text = "hello"`:
55+
56+
| Step | Input | Output |
57+
|---|---|---|
58+
| Unique chars | "hello" | `{'h', 'e', 'l', 'o'}` |
59+
| Sort | set | `['e', 'h', 'l', 'o']` |
60+
| stoi | sorted chars | `{'e': 0, 'h': 1, 'l': 2, 'o': 3}` |
61+
| itos | reversed | `{0: 'e', 1: 'h', 2: 'l', 3: 'o'}` |
62+
| Encode "hello" | lookup each char | $[1, 0, 2, 2, 3]$ |
63+
| Decode $[1,0,2,2,3]$ | lookup each int | "hello" |
64+
65+
Round-trip: `decode(encode("hello")) = "hello"`.
66+
67+
### Time & Space Complexity
68+
69+
- Time: $O(N \log N)$ for building vocabulary (sorting unique characters), $O(N)$ for encoding/decoding where $N$ is text length
70+
- Space: $O(V)$ for the vocabulary dictionaries where $V$ is the number of unique characters
71+
72+
---
73+
74+
## Common Pitfalls
75+
76+
### Not Sorting the Unique Characters
77+
78+
Python sets have no guaranteed iteration order. Without sorting, the same text may produce different vocabularies on different runs.
79+
80+
```python
81+
# Wrong: non-deterministic order
82+
chars = list(set(text))
83+
84+
# Correct: sorted for reproducibility
85+
chars = sorted(set(text))
86+
```
87+
88+
### Building itos Incorrectly
89+
90+
The `itos` mapping must be the exact inverse of `stoi`. Building it independently can introduce mismatches.
91+
92+
```python
93+
# Wrong: building independently, might not be exact inverse
94+
itos = {i: ch for i, ch in enumerate(chars)}
95+
96+
# Correct: derive from stoi to guarantee inverse relationship
97+
itos = {i: ch for ch, i in stoi.items()}
98+
```
99+
100+
---
101+
102+
## In the GPT Project
103+
104+
This becomes `data/vocab.py`. The GPT model in this course uses character-level tokenization, so this vocabulary is what converts raw training text into the integer sequences that the model processes. Production models like GPT-2 use BPE instead, but the concept is the same: a bidirectional mapping between tokens and integers.
105+
106+
---
107+
108+
## Key Takeaways
109+
110+
- A character-level vocabulary is the simplest tokenization approach, with vocabulary size equal to the number of unique characters in the training data.
111+
- The `stoi`/`itos` pair enables lossless round-trip conversion between text and integer sequences, which is a hard requirement for any tokenizer.
112+
- Sorting the unique characters ensures deterministic ID assignment. Without sorting, the same text could produce different vocabularies across runs.

0 commit comments

Comments
 (0)