neetcode-gh
diff --git a/‎articles/backpropagation.md‎
Lines changed: 132 additions & 0 deletions b/‎articles/backpropagation.md‎
Lines changed: 132 additions & 0 deletions
diff --git a/‎articles/basics-of-pytorch.md‎
Lines changed: 120 additions & 0 deletions b/‎articles/basics-of-pytorch.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎articles/build-vocabulary.md‎
Lines changed: 121 additions & 0 deletions b/‎articles/build-vocabulary.md‎
Lines changed: 121 additions & 0 deletions
@@ -0,0 +1,132 @@
+## Prerequisites
+
+Before attempting this problem, you should be comfortable with:
+
+- **Chain Rule** - Computing derivatives of composite functions like $f(g(h(x)))$ by multiplying the individual derivatives, because backpropagation is literally the chain rule applied systematically
+- **Single Neuron Forward Pass** - You need the forward pass ($z = x \cdot w + b$, then sigmoid) before you can compute gradients going backward
+- **Gradient Descent** - Once you have the gradients, you update weights by stepping opposite to them
+
+---
+
+## Concept
+
+Backpropagation is how neural networks learn. It computes how much each weight and bias contributed to the total loss by applying the chain rule backward through the computation graph.
+
+For a single neuron with sigmoid activation and squared error loss, the chain of computation is:
+
+$$z = x \cdot w + b \quad \rightarrow \quad \hat{y} = \sigma(z) \quad \rightarrow \quad L = \frac{1}{2}(\hat{y} - y)^2$$
+
+To find $\frac{\partial L}{\partial w}$, we chain the derivatives step by step:
+
+$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$
+
+Each piece is simple on its own:
+- $\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$ (error signal)
+- $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})$ (sigmoid derivative, which has this elegant form)
+- $\frac{\partial z}{\partial w} = x$ (the input itself)
+
+The product of the first two, $(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})$, is called the **delta**. It captures "how wrong are we, scaled by how sensitive the activation is at this operating point." The weight gradient is delta times the input. The bias gradient is just delta (since $\frac{\partial z}{\partial b} = 1$).
+
+---
+
+## Solution
+
+### Intuition
+
+We run the forward pass to get $\hat{y}$, compute the delta term (error times sigmoid derivative), then multiply by each input to get the weight gradients. The bias gradient is the delta itself.
+
+### Implementation
+
+::tabs-start
+```python
+import numpy as np
+from numpy.typing import NDArray
+from typing import Tuple
+
+
+class Solution:
+    def backward(self, x: NDArray[np.float64], w: NDArray[np.float64], b: float, y_true: float) -> Tuple[NDArray[np.float64], float]:
+        z = np.dot(x, w) + b
+        y_hat = 1.0 / (1.0 + np.exp(-z))
+
+        error = y_hat - y_true
+        sigmoid_deriv = y_hat * (1.0 - y_hat)
+        delta = error * sigmoid_deriv
+
+        dL_dw = np.round(delta * x, 5)
+        dL_db = round(float(delta), 5)
+
+        return (dL_dw, dL_db)
+```
+::tabs-end
+
+
+### Walkthrough
+
+Given `x = [1.0, 2.0]`, `w = [0.5, -0.5]`, `b = 0.1`, `y_true = 1.0`:
+
+| Step | Computation | Result |
+|---|---|---|
+| Linear | $z = 1.0(0.5) + 2.0(-0.5) + 0.1$ | $-0.4$ |
+| Sigmoid | $\hat{y} = 1/(1 + e^{0.4})$ | $0.40131$ |
+| Error | $\hat{y} - y$ | $-0.59869$ |
+| Sigmoid deriv | $0.40131 \times 0.59869$ | $0.24026$ |
+| Delta | $-0.59869 \times 0.24026$ | $-0.14384$ |
+| $dL/dw$ | $[-0.14384 \times 1.0, -0.14384 \times 2.0]$ | $[-0.14384, -0.28768]$ |
+| $dL/db$ | $-0.14384$ | $-0.14384$ |
+
+The negative gradients mean: increase $w_0$, increase $w_1$, and increase $b$ to push $\hat{y}$ closer to 1.0.
+
+### Time & Space Complexity
+
+- Time: $O(d)$ where $d$ is the number of input features
+- Space: $O(d)$ for the weight gradient vector
+
+---
+
+## Common Pitfalls
+
+### Getting the Error Direction Wrong
+
+The error is $\hat{y} - y$, not $y - \hat{y}$. Flipping it negates all gradients, making the model move away from the target.
+
+::tabs-start
+```python
+# Wrong: inverted error
+error = y_true - y_hat
+
+# Correct: prediction minus truth
+error = y_hat - y_true
+```
+::tabs-end
+
+
+### Forgetting the Sigmoid Derivative
+
+The sigmoid derivative is part of the chain. Without it, you are computing the gradient as if the activation were linear, which gives wrong weight updates.
+
+::tabs-start
+```python
+# Wrong: missing sigmoid derivative in the chain
+delta = error  # only the error, no activation derivative
+
+# Correct: error * sigmoid derivative
+sigmoid_deriv = y_hat * (1.0 - y_hat)
+delta = error * sigmoid_deriv
+```
+::tabs-end
+
+
+---
+
+## In the GPT Project
+
+This becomes `foundations/backprop.py`. In practice, PyTorch's autograd computes all of this automatically when you call `loss.backward()`. But understanding the manual chain-rule computation explains what happens under the hood and why vanishing gradients occur (the sigmoid derivative term can be very small).
+
+---
+
+## Key Takeaways
+
+- Backpropagation decomposes the gradient into a chain of simple derivatives. Each link in the chain corresponds to one operation in the forward pass.
+- The "delta" term (error times activation derivative) is the core building block. In deeper networks, deltas propagate backward through layers.
+- The sigmoid derivative $\hat{y}(1-\hat{y})$ peaks at 0.25 when $\hat{y} = 0.5$ and approaches 0 at the extremes, which is why deep sigmoid networks suffer from vanishing gradients.
@@ -0,0 +1,120 @@
+## Prerequisites
+
+Before attempting this problem, you should be comfortable with:
+
+- **NumPy Array Operations** - PyTorch tensors mirror NumPy arrays in API and behavior, so if you know NumPy, you already know 80% of PyTorch
+- **Reshaping and Dimension Semantics** - Understanding what `dim=0` vs `dim=1` means in a 2D tensor is critical because getting the wrong dimension silently produces wrong results
+
+---
+
+## Concept
+
+PyTorch is the most widely used deep learning framework. Its core data structure is the **tensor**, which works like a NumPy array but adds two superpowers: automatic differentiation (autograd) and GPU acceleration.
+
+**Reshaping** changes a tensor's dimensions without touching its data. A $(2, 4)$ tensor has 8 elements, and you can reshape it to $(4, 2)$, $(1, 8)$, or $(8, 1)$. The total element count must stay the same. This is used constantly: for example, flattening a 2D image into a 1D vector for a linear layer.
+
+**Averaging** along a dimension collapses that dimension. For a $(3, 2)$ tensor, averaging along `dim=0` (rows) produces a $(2,)$ tensor with column means. Averaging along `dim=1` (columns) produces a $(3,)$ tensor with row means. Think of `dim=` as "the dimension that disappears."
+
+**Concatenation** joins tensors along a dimension. Concatenating two $(2, 3)$ tensors along `dim=1` produces a $(2, 6)$ tensor. The tensors must agree on all other dimensions.
+
+**MSE Loss** is available as `torch.nn.functional.mse_loss`, computing $\frac{1}{N}\sum(pred_i - target_i)^2$. Using built-in loss functions is preferred because they handle edge cases and are numerically stable.
+
+---
+
+## Solution
+
+### Intuition
+
+Each method exercises a core PyTorch operation. We use `torch.reshape` for reshaping, `torch.mean` with a dimension argument for averaging, `torch.cat` for concatenation, and `torch.nn.functional.mse_loss` for the loss computation.
+
+### Implementation
+
+::tabs-start
+```python
+import torch
+import torch.nn
+from torchtyping import TensorType
+
+
+class Solution:
+    def reshape(self, to_reshape: TensorType[float]) -> TensorType[float]:
+        M, N = to_reshape.shape
+        reshaped = torch.reshape(to_reshape, (M * N // 2, 2))
+        return torch.round(reshaped, decimals=4)
+
+    def average(self, to_avg: TensorType[float]) -> TensorType[float]:
+        averaged = torch.mean(to_avg, dim = 0)
+        return torch.round(averaged, decimals=4)
+
+    def concatenate(self, cat_one: TensorType[float], cat_two: TensorType[float]) -> TensorType[float]:
+        concatenated = torch.cat((cat_one, cat_two), dim = 1)
+        return torch.round(concatenated, decimals=4)
+
+    def get_loss(self, prediction: TensorType[float], target: TensorType[float]) -> TensorType[float]:
+        loss = torch.nn.functional.mse_loss(prediction, target)
+        return torch.round(loss, decimals=4)
+```
+::tabs-end
+
+
+### Walkthrough
+
+| Operation | Input | Computation | Output |
+|---|---|---|---|
+| Reshape $(2,4) \to (4,2)$ | $[[1,2,3,4],[5,6,7,8]]$ | Split into pairs of 2 | $[[1,2],[3,4],[5,6],[7,8]]$ |
+| Average `dim=0` | $[[1,2],[3,4],[5,6]]$ | Column means | $[3.0, 4.0]$ |
+| Concat `dim=1` | $[[1,2],[3,4]]$ and $[[5,6],[7,8]]$ | Side by side | $[[1,2,5,6],[3,4,7,8]]$ |
+| MSE Loss | pred=$[1.0,2.0]$, target=$[1.5,2.5]$ | $(0.25+0.25)/2$ | $0.25$ |
+
+### Time & Space Complexity
+
+- Time: $O(n)$ for each operation where $n$ is the total number of elements
+- Space: $O(n)$ for the output tensors
+
+---
+
+## Common Pitfalls
+
+### Wrong Dimension for Averaging
+
+`dim=0` averages across rows (column-wise means), `dim=1` averages across columns (row-wise means). These are easy to confuse.
+
+::tabs-start
+```python
+# Wrong: averages across columns instead of rows
+averaged = torch.mean(to_avg, dim=1)
+
+# Correct: averages across rows (column means)
+averaged = torch.mean(to_avg, dim=0)
+```
+::tabs-end
+
+
+### Mismatched Shapes for Concatenation
+
+Concatenation along `dim=1` requires the same number of rows. Different row counts cause a runtime error.
+
+::tabs-start
+```python
+# Wrong: different number of rows (2 vs 3)
+torch.cat((torch.zeros(2, 3), torch.zeros(3, 3)), dim=1)
+
+# Correct: same number of rows
+torch.cat((torch.zeros(2, 3), torch.zeros(2, 3)), dim=1)
+```
+::tabs-end
+
+
+---
+
+## In the GPT Project
+
+This becomes `foundations/pytorch_basics.py`. Every subsequent problem uses these operations. Reshaping appears when flattening logits for cross-entropy loss. Averaging is used in layer normalization. Concatenation joins multi-head attention outputs. MSE loss trains regression models.
+
+---
+
+## Key Takeaways
+
+- PyTorch tensors mirror NumPy arrays in API but add autograd and GPU support, making them the foundation of modern deep learning.
+- `torch.reshape`, `torch.mean`, and `torch.cat` are the three tensor manipulation functions you will use most often.
+- Using built-in functions like `mse_loss` is safer than manual computation because they handle numerical edge cases automatically.
@@ -0,0 +1,121 @@
+## Prerequisites
+
+Before attempting this problem, you should be comfortable with:
+
+- **Python Dictionaries** - Creating bidirectional mappings between keys and values, because the vocabulary is a pair of dictionaries (string-to-int and int-to-string)
+- **Character-Level Tokenization** - Treating each character as an individual token, which is the simplest tokenization approach and the one used in this course's GPT model
+
+---
+
+## Concept
+
+Before a language model can process text, it needs a vocabulary: a bidirectional mapping between characters (or tokens) and integers. The model works with integers internally, so we need `encode` to convert text to numbers and `decode` to convert numbers back to text.
+
+The construction process is:
+
+1. **Extract unique characters** from the training text.
+2. **Sort them** alphabetically for deterministic ordering.
+3. **Build `stoi`** (string-to-integer): assign each character a unique index starting from 0.
+4. **Build `itos`** (integer-to-string): the reverse mapping.
+
+This is character-level tokenization. The vocabulary size equals the number of unique characters in the training data, typically 50-100 for English text. Compare this to BPE (50,000+ tokens) or word-level (100,000+ tokens). Character-level vocabularies produce much longer sequences but never encounter out-of-vocabulary tokens (as long as the character appeared in training).
+
+The `encode`/`decode` functions must be inverses: `decode(encode(text)) == text`. This round-trip property is essential. If you cannot perfectly reconstruct the original text, the model cannot learn the correct input-output mapping.
+
+---
+
+## Solution
+
+### Intuition
+
+Extract unique characters with `set()`, sort them, build two dictionaries with enumerate. Encoding is a list comprehension of dictionary lookups. Decoding joins the looked-up characters.
+
+### Implementation
+
+::tabs-start
+```python
+from typing import Dict, List, Tuple
+
+class Solution:
+    def build_vocab(self, text: str) -> Tuple[Dict[str, int], Dict[int, str]]:
+        chars = sorted(set(text))
+        stoi = {ch: i for i, ch in enumerate(chars)}
+        itos = {i: ch for ch, i in stoi.items()}
+        return stoi, itos
+
+    def encode(self, text: str, stoi: Dict[str, int]) -> List[int]:
+        return [stoi[ch] for ch in text]
+
+    def decode(self, ids: List[int], itos: Dict[int, str]) -> str:
+        return ''.join(itos[i] for i in ids)
+```
+::tabs-end
+
+
+### Walkthrough
+
+For `text = "hello"`:
+
+| Step | Input | Output |
+|---|---|---|
+| Unique chars | "hello" | `{'h', 'e', 'l', 'o'}` |
+| Sort | set | `['e', 'h', 'l', 'o']` |
+| stoi | sorted chars | `{'e': 0, 'h': 1, 'l': 2, 'o': 3}` |
+| itos | reversed | `{0: 'e', 1: 'h', 2: 'l', 3: 'o'}` |
+| Encode "hello" | lookup each char | $[1, 0, 2, 2, 3]$ |
+| Decode $[1,0,2,2,3]$ | lookup each int | "hello" |
+
+Round-trip: `decode(encode("hello")) = "hello"`.
+
+### Time & Space Complexity
+
+- Time: $O(N \log N)$ for building vocabulary (sorting unique characters), $O(N)$ for encoding/decoding where $N$ is text length
+- Space: $O(V)$ for the vocabulary dictionaries where $V$ is the number of unique characters
+
+---
+
+## Common Pitfalls
+
+### Not Sorting the Unique Characters
+
+Python sets have no guaranteed iteration order. Without sorting, the same text may produce different vocabularies on different runs.
+
+::tabs-start
+```python
+# Wrong: non-deterministic order
+chars = list(set(text))
+
+# Correct: sorted for reproducibility
+chars = sorted(set(text))
+```
+::tabs-end
+
+
+### Building itos Incorrectly
+
+The `itos` mapping must be the exact inverse of `stoi`. Building it independently can introduce mismatches.
+
+::tabs-start
+```python
+# Wrong: building independently, might not be exact inverse
+itos = {i: ch for i, ch in enumerate(chars)}
+
+# Correct: derive from stoi to guarantee inverse relationship
+itos = {i: ch for ch, i in stoi.items()}
+```
+::tabs-end
+
+
+---
+
+## In the GPT Project
+
+This becomes `data/vocab.py`. The GPT model in this course uses character-level tokenization, so this vocabulary is what converts raw training text into the integer sequences that the model processes. Production models like GPT-2 use BPE instead, but the concept is the same: a bidirectional mapping between tokens and integers.
+
+---
+
+## Key Takeaways
+
+- A character-level vocabulary is the simplest tokenization approach, with vocabulary size equal to the number of unique characters in the training data.
+- The `stoi`/`itos` pair enables lossless round-trip conversion between text and integer sequences, which is a hard requirement for any tokenizer.
+- Sorting the unique characters ensures deterministic ID assignment. Without sorting, the same text could produce different vocabularies across runs.