A reinforcement learning solution to the multi-armed bandit problem, applied to ad click-through rate optimization.
Given 10 ads and 10,000 simulated user rounds, the UCB algorithm learns which ad maximizes clicks by balancing exploration (trying less-known ads) with exploitation (favoring high-performing ones).
At each round, the algorithm selects the ad with the highest upper confidence bound:
UCB(i) = avg(i) + sqrt( (3/2) * ln(n) / N(i) )
- avg(i) -- average reward of ad i so far
- n -- current round number
- N(i) -- number of times ad i has been selected
Ads with no selections yet are assigned infinite confidence, ensuring every ad gets tried at least once. Over time, the exploration bonus shrinks and the algorithm converges on the best ad.
Ads_CTR_Optimisation.csv -- 10,000 rows x 10 columns. Each row is a user round; each column is an ad. Values are binary (1 = click, 0 = no click). The file must be in the same directory as the scripts.
| Tool | Purpose | |
|---|---|---|
| 🐍 | Python 3 | Primary implementation |
| 📈 | Matplotlib | Histogram visualization |
| 🗂️ | Pandas | CSV data loading |
| 📉 | R | Alternative implementation (base R) |
pip install matplotlib pandas
python upper_confidence_bound.pyRscript upper_confidence_bound.RBoth scripts output a histogram showing how often each ad was selected. The best-performing ad dominates the distribution.
upper_confidence_bound.py # Python implementation
upper_confidence_bound.R # R implementation
Ads_CTR_Optimisation.csv # Simulated CTR dataset
LICENSE # MIT License
README.md
- The simulation uses a fixed dataset rather than a live reward signal -- this is a batch replay, not true online learning.
- Changing N or d beyond the dataset dimensions will cause index errors; the code now derives these from the CSV automatically.
MIT