A scalar-valued autograd engine and neural network library built entirely from scratch in Python — no PyTorch, no TensorFlow, just pure math.
Micrograd implements backpropagation (reverse-mode autodiff) over a dynamically built Directed Acyclic Graph (DAG) of scalar values. It supports enough operations to build and train small neural networks — demonstrating the exact same core concepts that power frameworks like PyTorch under the hood.
| Component | Description |
|---|---|
Value |
Wraps a scalar with gradient tracking, graph connectivity, and a local _backward function |
Neuron |
tanh(w · x + b) — a single unit with learnable weights and bias |
Layer |
A collection of Neuron objects operating in parallel |
MLP |
A Multi-Layer Perceptron — layers chained sequentially |
Every arithmetic operation (+, *, **, tanh, exp) builds a computational graph. Calling .backward() on the output traverses this graph in reverse topological order, applying the chain rule at each node to compute dL/d(node) for every node.
A simple expression: L = (a * b + c) * f
a = Value(2.0); b = Value(-3.0)
c = Value(10.0); f = Value(-2.0)
e = a * b # -6.0
d = e + c # 4.0
L = d * f # -8.0Gradients computed manually using the chain rule:
| Node | Formula | Gradient |
|---|---|---|
L |
dL/dL = 1 | 1.0 |
f |
dL/df = d | 4.0 |
d |
dL/dd = f | -2.0 |
c |
dL/dd · dd/dc = -2.0 × 1.0 | -2.0 |
e |
dL/dd · dd/de = -2.0 × 1.0 | -2.0 |
b |
dL/de · de/db = -2.0 × 2.0 | -4.0 |
a |
dL/de · de/da = -2.0 × (-3.0) | 6.0 |
Computational graph for Example 1 showing data and grad at each node
Models a single neuron: o = tanh(x1·w1 + x2·w2 + b)
x1, x2 = Value(2.0), Value(0.0) # inputs
w1, w2 = Value(-3.0), Value(1.0) # weights
b = Value(6.8814) # bias
n = x1*w1 + x2*w2 + b # 0.8814
o = n.tanh() # 0.7071The tanh derivative: do/dn = 1 - tanh(n)² = 1 - 0.7071² = 0.5
Computational graph for a single neuron with tanh activation
Same neuron as above, but using the automatic backward() method:
o.backward() # topological sort + reverse traversal — does everything!This replaces all the manual gradient assignments. Results are identical to Example 2.
Proves that tanh can be broken into its raw mathematical operations and backprop still works correctly:
# Instead of o = n.tanh():
e = (2 * n).exp() # e^(2n)
o = (e - 1) / (e + 1) # manual tanh formula
o.backward() # all gradients match!
tanh decomposed into exp, subtraction, addition, and division — gradients propagate correctly through all primitives
Key Insight: You can compose any differentiable operations and the autograd engine will figure out all the gradients automatically.
Multi-Layer Perceptron: inputs (blue) → hidden layers → outputs (green)
class Neuron:
"""Single neuron: out = tanh(Σ(wi * xi) + b)"""
def __init__(self, nin):
self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
self.b = Value(random.uniform(-1, 1))
class Layer:
"""Collection of neurons operating in parallel"""
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
class MLP:
"""Multi-Layer Perceptron — layers chained sequentially"""
def __init__(self, nin, nouts):
sz = [nin] + nouts
self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]In the notebook: MLP(3, [4, 4, 1]) creates a network with 41 parameters:
| Layer | Shape | Parameters |
|---|---|---|
| Hidden 1 | 3 → 4 | 3×4 + 4 = 16 |
| Hidden 2 | 4 → 4 | 4×4 + 4 = 20 |
| Output | 4 → 1 | 4×1 + 1 = 5 |
The training process follows a four-step cycle repeated for K iterations:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Forward │ ──▶ │ 2. Zero Grad │ ──▶ │ 3. Backward │ ──▶ │ 4. Update │
│ Pass │ │ │ │ Pass │ │ Weights │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
▲ │
└──────────────────────────────────────────────────────────────┘
for k in range(20):
# 1. Forward pass
ypred = [n(x) for x in xs]
loss = sum(((yout - ygt) ** 2 for ygt, yout in zip(ys, ypred)), Value(0.0))
# 2. Zero gradients
for p in n.parameters():
p.grad = 0.0
# 3. Backward pass
loss.backward()
# 4. Update weights (gradient descent)
for p in n.parameters():
p.data += -0.5 * p.grad # learning rate = 0.5Loss function: Mean Squared Error → L = Σ(y_pred - y_true)²
Update rule: w = w - lr × dL/dw
| Micrograd | PyTorch Equivalent |
|---|---|
Value |
torch.Tensor (with requires_grad=True) |
.backward() |
loss.backward() |
p.data += -lr * p.grad |
optimizer.step() |
p.grad = 0.0 |
optimizer.zero_grad() |
The only real difference: PyTorch operates on tensors (batched multi-dimensional arrays) instead of scalars, for GPU-accelerated efficiency.
micrograd/
├── micrograd.ipynb # Full implementation notebook
├── README.md
└── images/
├── img1.png # Example 1 — expression graph
├── img2.png # Example 2 — single neuron graph
├── img3.png # Example 4 — decomposed tanh graph
└── img4.png # MLP architecture diagram
git clone https://github.com/sherurox/micrograd.git
cd micrograd
pip install graphviz numpy matplotlib
jupyter notebook micrograd.ipynbNote: You need Graphviz installed on your system for the
draw_dotvisualization to work.
- Andrej Karpathy — "The spelled-out intro to neural networks and backpropagation: building micrograd"
- Karpathy's micrograd repository
Shreyas Khandale
MS Computer Science (AI Track) — Binghamton University
GitHub · Email