A comprehensive collection of Reinforcement Learning projects demonstrating mastery of Q-Learning, Policy Gradients, and Actor-Critic methods using PyTorch and OpenAI Gym environments.
| # | Project | Algorithm | Notebook | Environment |
|---|---|---|---|---|
| 1 | Q-Learning (Tabular) | Q-Table | 01_q_learning_tabular.ipynb |
Discrete State Space |
| 2 | Actor-Critic | A2C | 02_actor_critic_cartpole.ipynb |
CartPole-v1 |
| 3 | REINFORCE | Policy Gradient | 03_reinforce_policy_gradient.ipynb |
Various Gym Envs |
- PyTorch - Deep RL neural networks
- OpenAI Gym - RL environments
- NumPy - Numerical computing
- Matplotlib - Visualization
- Value-Based Methods - Q-Learning, DQN
- Policy-Based Methods - REINFORCE, Policy Gradients
- Actor-Critic Methods - A2C, A3C variants
- Exploration Strategies - ฮต-greedy, entropy regularization
- Python 3.8 or higher
- GPU recommended (but not required)
-
Clone the repository
git clone https://github.com/uzi-gpu/reinforcement-learning.git cd reinforcement-learning -
Create a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\\Scripts\\activate
-
Install dependencies
pip install -r requirements.txt
-
Launch Jupyter Notebook
jupyter notebook
File: 01_q_learning_tabular.ipynb
Objective: Implement tabular Q-Learning from scratch for discrete state-action spaces
Algorithm: Q-Learning
- Off-policy temporal difference learning
- Updates Q-table using Bellman equation
- ฮต-greedy exploration strategy
Key Concepts:
- โ Q-Table initialization and updates
- โ Exploration vs Exploitation trade-off
- โ Learning rate (ฮฑ) and discount factor (ฮณ)
- โ Episode-based training
- โ Convergence analysis
- โ Performance visualization
Implementation Highlights:
Q(s,a) โ Q(s,a) + ฮฑ[r + ฮณยทmax(Q(s',a')) - Q(s,a)]Use Cases:
- Small discrete state spaces
- Grid world problems
- Simple game environments
File: 02_actor_critic_cartpole.ipynb
Objective: Solve the CartPole balancing problem using Actor-Critic method
Environment: CartPole-v1
- Goal: Balance pole on cart by moving left/right
- State Space: 4 continuous values (position, velocity, angle, angular velocity)
- Action Space: 2 discrete actions (left, right)
- Reward: +1 for each timestep pole remains upright
Architecture:
Actor Network (Policy):
- Input: State (4 dimensions)
- Hidden: Fully connected layers with ReLU
- Output: Action probabilities (softmax)
Critic Network (Value Function):
- Input: State (4 dimensions)
- Hidden: Fully connected layers with ReLU
- Output: State value V(s)
Training Process:
- โ Actor learns optimal policy ฯ(a|s)
- โ Critic estimates value function V(s)
- โ Advantage function: A(s,a) = R + ฮณV(s') - V(s)
- โ Policy gradient with baseline reduction
- โ Simultaneous actor-critic updates
Key Features:
- โ Continuous state space handling
- โ On-policy learning
- โ Variance reduction through baseline
- โ Episode reward tracking
- โ Training visualization
File: 03_reinforce_policy_gradient.ipynb
Objective: Implement REINFORCE algorithm for policy optimization
Algorithm: REINFORCE (Monte Carlo Policy Gradient)
- Pure policy-based method
- No value function approximation
- Learn policy parameters directly
Mathematical Foundation:
โJ(ฮธ) = E[โ โlog ฯ(a|s,ฮธ) ยท G_t]Where G_t = cumulative discounted reward
Implementation:
- โ Policy network with softmax output
- โ Monte Carlo return estimation
- โ Policy gradient calculation
- โ Gradient ascent optimization
- โ Baseline subtraction (optional)
- โ Entropy regularization
Advantages:
- Works well with continuous action spaces
- Can learn stochastic policies
- Effective for high-dimensional problems
Challenges:
- High variance in gradient estimates
- Requires complete episodes
- Sample inefficient
Solutions Implemented:
- Baseline subtraction to reduce variance
- Reward normalization
- Adaptive learning rates
-
Agent-Environment Interaction
- State observation
- Action selection
- Reward signals
- State transitions
-
Exploration vs Exploitation
- ฮต-greedy strategy
- Entropy-based exploration
- Decaying exploration rates
-
Value Functions
- State-value function V(s)
- Action-value function Q(s,a)
- Advantage function A(s,a)
-
Policy Optimization
- Policy gradients
- Actor-critic methods
- On-policy vs off-policy learning
- Temporal Difference Learning - Bootstrapping updates
- Eligibility Traces - Credit assignment
- Function Approximation - Neural network values/policies
- Variance Reduction - Baselines, advantage estimates
- Reward Shaping - Engineering reward signals
- Convergence: Successfully learns optimal policy
- Stability: Stable Q-table after sufficient episodes
- Exploration: ฮต-greedy ensures thorough state coverage
- Training Episodes: Typically solves in 200-500 episodes
- Max Timesteps: Achieves 200+ timesteps (environment maximum)
- Stability: Reliable convergence with proper hyperparameters
- Model Saved: Trained weights available for inference
- Policy Learning: Successfully optimizes stochastic policies
- Sample Efficiency: Improved with baseline subtraction
- Generalization: Adapts to various Gym environments
Through these projects, I have demonstrated expertise in:
-
RL Foundations
- Markov Decision Processes (MDPs)
- Bellman equations
- Value iteration and policy iteration
-
Deep RL
- Neural network function approximators
- Policy gradient methods
- Actor-critic architectures
-
Practical RL
- Environment setup and interaction
- Training loop implementation
- Hyperparameter tuning
- Performance evaluation and visualization
-
Advanced Topics
- On-policy vs off-policy methods
- Variance reduction techniques
- Exploration strategies
- Continuous vs discrete action spaces
Uzair Mubasher - BSAI Graduate
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Gym team for excellent RL environments
- PyTorch community for deep learning framework
- RL course instructors and resources
โญ If you found this repository helpful, please consider giving it a star!