|
| 1 | +--- |
| 2 | +title: "Math Analysis — Derivatives" |
| 3 | +description: "Derivative, gradient, and chain rule — the backbone of neural network training" |
| 4 | +date: "2025-03-12" |
| 5 | +slug: "math-derivatives" |
| 6 | +tags: |
| 7 | + - Machine Learning |
| 8 | + - Mathematics |
| 9 | +--- |
| 10 | + |
| 11 | +Second article in the «Essential Mathematics» series — on derivatives, gradient, and chain rule. Without this you can't understand how neural networks learn. |
| 12 | + |
| 13 | +## What is a derivative |
| 14 | + |
| 15 | +### Simple explanation |
| 16 | + |
| 17 | +**The speedometer** — that's essentially a derivative: how fast the distance traveled changes. Accelerating — speed goes up. Braking — it drops. So the derivative answers: "how much does one quantity change when you change another one a little?" |
| 18 | + |
| 19 | +**A hill.** The steepness of a slope — "how far down you go when you take a step forward." Steep slope — large derivative. Gentle — small. Flat road — zero. |
| 20 | + |
| 21 | +**A function graph.** If you have a graph \( y = f(x) \), the derivative at a point is the **slope** of the graph at that point. For a straight line \( y = kx + b \) the slope is the familiar coefficient \( k \). For a curve, each point has its own slope — that's the derivative. |
| 22 | + |
| 23 | +- Graph steeply going up → positive derivative |
| 24 | +- Going down → negative derivative |
| 25 | +- Flat (top or bottom) → derivative equals zero |
| 26 | + |
| 27 | +### Formal definition |
| 28 | + |
| 29 | +The derivative of a function at a point shows the **rate of change**: how fast the function grows (or decreases) for a small shift in the argument. |
| 30 | + |
| 31 | +For a single-variable function \( f(x) \), the derivative \( f'(x) \) is the limit of the ratio of the change in \( f \) to the change in \( x \) as the change goes to zero: |
| 32 | + |
| 33 | +\[ |
| 34 | +f'(x) = \lim_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} |
| 35 | +\] |
| 36 | + |
| 37 | +Here \( \Delta x \) is the increment of the argument (a small step). In the examples below we use \( h \) for the same quantity. |
| 38 | + |
| 39 | +Geometrically — the slope of the tangent line at \( x \): |
| 40 | +- \( f'(x) > 0 \) — function is increasing |
| 41 | +- \( f'(x) < 0 \) — function is decreasing |
| 42 | +- \( f'(x) = 0 \) — possible minimum or maximum |
| 43 | + |
| 44 | +### Numerical example: linear function \( g(x) = 3x - 1 \) |
| 45 | + |
| 46 | +For a straight line the slope is the same everywhere. At any point, e.g. \( x = 5 \): |
| 47 | + |
| 48 | +\[ |
| 49 | +\frac{g(5 + h) - g(5)}{h} = \frac{(3(5+h) - 1) - g(5)}{h} = \frac{14 + 3h - 14}{h} = \frac{3h}{h} = 3 |
| 50 | +\] |
| 51 | +where \( g(5) = 3 \cdot 5 - 1 = 14 \). |
| 52 | + |
| 53 | +For any \( h \neq 0 \) we get **3** — the derivative is constant, same as the coefficient of \( x \) in the line equation. |
| 54 | + |
| 55 | +| \( x \) | \( g(x) \) | \( g(x+0.1) - g(x) \) | \( \frac{g(x+0.1)-g(x)}{0.1} \) | |
| 56 | +|---------|------------|------------------------|--------------| |
| 57 | +| 5 | 14 | 0.3 | 3 | |
| 58 | +| 10 | 29 | 0.3 | 3 | |
| 59 | +| -2 | -7 | 0.3 | 3 | |
| 60 | + |
| 61 | +<div style="margin: 1em 0; aspect-ratio: 320/160; min-height: 160px; max-height: 320px; overflow: hidden; border-radius: 8px; background: #1a1a1a;"> |
| 62 | + <iframe src="/games/derivative-graph-linear.html" style="width: 100%; height: 100%; border: none; display: block; background: #1a1a1a;" title="Graph of y = 3x − 1" scrolling="no"></iframe> |
| 63 | +</div> |
| 64 | + |
| 65 | +### Numerical example: quadratic function \( f(x) = x^2 \) |
| 66 | + |
| 67 | +Approximate the derivative at \( x = 2 \) using "change in \( f \) over change in \( x \)". Take a small step \( h \): |
| 68 | + |
| 69 | +\[ |
| 70 | +\frac{f(2 + h) - f(2)}{h} = \frac{(2+h)^2 - 4}{h} |
| 71 | +\] |
| 72 | + |
| 73 | +**At \( x = 2 \)** (\( f(2) = 4 \)): |
| 74 | + |
| 75 | +| \( h \) | \( f(2+h) \) | \( f(2+h) - f(2) \) | \( \frac{f(2+h)-f(2)}{h} \) | |
| 76 | +|--------|--------------|---------------------|-----------------------------| |
| 77 | +| 1 | 9 | 5 | 5 | |
| 78 | +| 0.1 | 4.41 | 0.41 | 4.1 | |
| 79 | +| 0.01 | 4.0401 | 0.0401 | 4.01 | |
| 80 | +| 0.001 | 4.0004… | 0.004… | 4.001 | |
| 81 | + |
| 82 | +The ratio tends to **4** → \( f'(2) = 4 \). |
| 83 | + |
| 84 | +**At \( x = 4 \)** (\( f(4) = 16 \)): |
| 85 | + |
| 86 | +\[ |
| 87 | +\frac{f(4 + h) - f(4)}{h} = \frac{(4+h)^2 - 16}{h} = \frac{8h + h^2}{h} = 8 + h \to 8 |
| 88 | +\] |
| 89 | + |
| 90 | +| \( h \) | \( f(4+h) \) | \( f(4+h) - f(4) \) | \( \frac{f(4+h)-f(4)}{h} \) | |
| 91 | +|--------|--------------|---------------------|-----------------------------| |
| 92 | +| 1 | 25 | 9 | 9 | |
| 93 | +| 0.1 | 16.81 | 0.81 | 8.1 | |
| 94 | +| 0.01 | 16.0801 | 0.0801 | 8.01 | |
| 95 | + |
| 96 | +The ratio tends to **8** → \( f'(4) = 8 \). By \( (x^2)' = 2x \): at \( x = 2 \) we get 4, at \( x = 4 \) — 8. For a curve, the derivative is different at each point. |
| 97 | + |
| 98 | +<div style="margin: 1em 0; aspect-ratio: 320/240; min-height: 240px; max-height: 480px; overflow: hidden; border-radius: 8px; background: #1a1a1a;"> |
| 99 | + <iframe src="/games/derivative-graph-parabola.html" style="width: 100%; height: 100%; border: none; display: block; background: #1a1a1a;" title="Graph of y = x² and tangent" scrolling="no"></iframe> |
| 100 | +</div> |
| 101 | + |
| 102 | +**Why it matters in ML:** training is minimization of the loss function. We need to know which direction to change the weights so that the loss decreases. The derivative points in the direction of increase; we go the opposite way so the loss drops. |
| 103 | + |
| 104 | +## Partial derivative |
| 105 | + |
| 106 | +A neural network has many weights — the loss depends on thousands or millions of variables. We need to know how the loss changes when we change **each** weight separately. |
| 107 | + |
| 108 | +A **partial derivative** \( \frac{\partial f}{\partial x} \) — the derivative with respect to one variable, treating the others as constants. |
| 109 | + |
| 110 | +Example: \( f(x, y) = x^2 + xy \) |
| 111 | +- \( \frac{\partial f}{\partial x} = 2x + y \) |
| 112 | +- \( \frac{\partial f}{\partial y} = x \) |
| 113 | + |
| 114 | +## Gradient |
| 115 | + |
| 116 | +The **gradient** \( \nabla f \) — the vector of all partial derivatives: |
| 117 | + |
| 118 | +\[ |
| 119 | +\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) |
| 120 | +\] |
| 121 | + |
| 122 | +The gradient points in the direction of **steepest increase** of the function. So \( -\nabla f \) is the direction of steepest **decrease**. |
| 123 | + |
| 124 | +**Gradient descent** — iterative weight updates: |
| 125 | + |
| 126 | +\[ |
| 127 | +w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w} |
| 128 | +\] |
| 129 | + |
| 130 | +where \( \alpha \) is the learning rate (step size), \( L \) is the loss. We move in the direction opposite to the gradient so the loss decreases. |
| 131 | + |
| 132 | +## Chain rule |
| 133 | + |
| 134 | +A neural network is a chain of layers: each layer's output is fed to the next. The loss depends on the outputs, and those depend on the weights. To get \( \frac{\partial L}{\partial w} \), we need to «propagate» the derivative backward through this chain. |
| 135 | + |
| 136 | +**Chain rule** for composed functions: if \( z = f(y) \) and \( y = g(x) \), then |
| 137 | + |
| 138 | +\[ |
| 139 | +\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} |
| 140 | +\] |
| 141 | + |
| 142 | +For many variables it works the same way: the derivative with respect to an earlier layer's weight is the product of derivatives along the path from that weight to the loss. |
| 143 | + |
| 144 | +**Backpropagation** — the algorithm that does this efficiently: one backward pass through the network computes all needed gradients at once. |
| 145 | + |
| 146 | +## Example: linear regression |
| 147 | + |
| 148 | +Loss — mean squared error: \( L = \frac{1}{2}(y - \hat{y})^2 \), where \( \hat{y} = wx + b \). |
| 149 | + |
| 150 | +\[ |
| 151 | +\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w} = -(y - \hat{y}) \cdot x |
| 152 | +\] |
| 153 | + |
| 154 | +The larger the error \( (y - \hat{y}) \) and the larger \( x \), the more we adjust \( w \). Makes sense: a big error and an «important» input require a larger update. |
| 155 | + |
| 156 | +## Summary |
| 157 | + |
| 158 | +- **Derivative** — rate of change, direction of increase |
| 159 | +- **Partial derivative** — with respect to one variable, others fixed |
| 160 | +- **Gradient** — vector of partial derivatives, direction of steepest increase |
| 161 | +- **Chain rule** — how to compute derivatives along a chain of layers |
| 162 | +- **Gradient descent** — update weights in the \( -\nabla L \) direction |
| 163 | + |
| 164 | +PyTorch, TensorFlow, etc. compute gradients automatically (autograd) — but understanding what happens under the hood helps when you hit «exploding gradients» or «model won't learn». |
0 commit comments