Skip to content

Uzi-gpu/reinforcement-learning

Repository files navigation

๐ŸŽฎ Reinforcement Learning Projects

Python PyTorch Gym License

A comprehensive collection of Reinforcement Learning projects demonstrating mastery of Q-Learning, Policy Gradients, and Actor-Critic methods using PyTorch and OpenAI Gym environments.


๐Ÿ“‹ Table of Contents


๐Ÿš€ Projects Overview

# Project Algorithm Notebook Environment
1 Q-Learning (Tabular) Q-Table 01_q_learning_tabular.ipynb Discrete State Space
2 Actor-Critic A2C 02_actor_critic_cartpole.ipynb CartPole-v1
3 REINFORCE Policy Gradient 03_reinforce_policy_gradient.ipynb Various Gym Envs

๐Ÿ› ๏ธ Technologies Used

Core Libraries

  • PyTorch - Deep RL neural networks
  • OpenAI Gym - RL environments
  • NumPy - Numerical computing
  • Matplotlib - Visualization

RL Techniques

  • Value-Based Methods - Q-Learning, DQN
  • Policy-Based Methods - REINFORCE, Policy Gradients
  • Actor-Critic Methods - A2C, A3C variants
  • Exploration Strategies - ฮต-greedy, entropy regularization

๐Ÿ“ฆ Installation

Prerequisites

  • Python 3.8 or higher
  • GPU recommended (but not required)

Setup Instructions

  1. Clone the repository

    git clone https://github.com/uzi-gpu/reinforcement-learning.git
    cd reinforcement-learning
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\\Scripts\\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Launch Jupyter Notebook

    jupyter notebook

๐Ÿ“Š Project Details

1. ๐Ÿ“Š Q-Learning (Tabular Method)

File: 01_q_learning_tabular.ipynb

Objective: Implement tabular Q-Learning from scratch for discrete state-action spaces

Algorithm: Q-Learning

  • Off-policy temporal difference learning
  • Updates Q-table using Bellman equation
  • ฮต-greedy exploration strategy

Key Concepts:

  • โœ… Q-Table initialization and updates
  • โœ… Exploration vs Exploitation trade-off
  • โœ… Learning rate (ฮฑ) and discount factor (ฮณ)
  • โœ… Episode-based training
  • โœ… Convergence analysis
  • โœ… Performance visualization

Implementation Highlights:

Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax(Q(s',a')) - Q(s,a)]

Use Cases:

  • Small discrete state spaces
  • Grid world problems
  • Simple game environments

2. ๐ŸŽฏ Actor-Critic for CartPole

File: 02_actor_critic_cartpole.ipynb

Objective: Solve the CartPole balancing problem using Actor-Critic method

Environment: CartPole-v1

  • Goal: Balance pole on cart by moving left/right
  • State Space: 4 continuous values (position, velocity, angle, angular velocity)
  • Action Space: 2 discrete actions (left, right)
  • Reward: +1 for each timestep pole remains upright

Architecture:

Actor Network (Policy):

  • Input: State (4 dimensions)
  • Hidden: Fully connected layers with ReLU
  • Output: Action probabilities (softmax)

Critic Network (Value Function):

  • Input: State (4 dimensions)
  • Hidden: Fully connected layers with ReLU
  • Output: State value V(s)

Training Process:

  • โœ… Actor learns optimal policy ฯ€(a|s)
  • โœ… Critic estimates value function V(s)
  • โœ… Advantage function: A(s,a) = R + ฮณV(s') - V(s)
  • โœ… Policy gradient with baseline reduction
  • โœ… Simultaneous actor-critic updates

Key Features:

  • โœ… Continuous state space handling
  • โœ… On-policy learning
  • โœ… Variance reduction through baseline
  • โœ… Episode reward tracking
  • โœ… Training visualization

3. ๐Ÿš€ REINFORCE Policy Gradient

File: 03_reinforce_policy_gradient.ipynb

Objective: Implement REINFORCE algorithm for policy optimization

Algorithm: REINFORCE (Monte Carlo Policy Gradient)

  • Pure policy-based method
  • No value function approximation
  • Learn policy parameters directly

Mathematical Foundation:

โˆ‡J(ฮธ) = E[โˆ‘ โˆ‡log ฯ€(a|s,ฮธ) ยท G_t]

Where G_t = cumulative discounted reward

Implementation:

  • โœ… Policy network with softmax output
  • โœ… Monte Carlo return estimation
  • โœ… Policy gradient calculation
  • โœ… Gradient ascent optimization
  • โœ… Baseline subtraction (optional)
  • โœ… Entropy regularization

Advantages:

  • Works well with continuous action spaces
  • Can learn stochastic policies
  • Effective for high-dimensional problems

Challenges:

  • High variance in gradient estimates
  • Requires complete episodes
  • Sample inefficient

Solutions Implemented:

  • Baseline subtraction to reduce variance
  • Reward normalization
  • Adaptive learning rates

๐Ÿ“š Key RL Concepts Demonstrated

Fundamental RL Components

  1. Agent-Environment Interaction

    • State observation
    • Action selection
    • Reward signals
    • State transitions
  2. Exploration vs Exploitation

    • ฮต-greedy strategy
    • Entropy-based exploration
    • Decaying exploration rates
  3. Value Functions

    • State-value function V(s)
    • Action-value function Q(s,a)
    • Advantage function A(s,a)
  4. Policy Optimization

    • Policy gradients
    • Actor-critic methods
    • On-policy vs off-policy learning

Advanced Techniques

  • Temporal Difference Learning - Bootstrapping updates
  • Eligibility Traces - Credit assignment
  • Function Approximation - Neural network values/policies
  • Variance Reduction - Baselines, advantage estimates
  • Reward Shaping - Engineering reward signals

๐Ÿ† Results

Q-Learning Performance

  • Convergence: Successfully learns optimal policy
  • Stability: Stable Q-table after sufficient episodes
  • Exploration: ฮต-greedy ensures thorough state coverage

Actor-Critic on CartPole

  • Training Episodes: Typically solves in 200-500 episodes
  • Max Timesteps: Achieves 200+ timesteps (environment maximum)
  • Stability: Reliable convergence with proper hyperparameters
  • Model Saved: Trained weights available for inference

REINFORCE Algorithm

  • Policy Learning: Successfully optimizes stochastic policies
  • Sample Efficiency: Improved with baseline subtraction
  • Generalization: Adapts to various Gym environments

๐ŸŽ“ Learning Outcomes

Through these projects, I have demonstrated expertise in:

  1. RL Foundations

    • Markov Decision Processes (MDPs)
    • Bellman equations
    • Value iteration and policy iteration
  2. Deep RL

    • Neural network function approximators
    • Policy gradient methods
    • Actor-critic architectures
  3. Practical RL

    • Environment setup and interaction
    • Training loop implementation
    • Hyperparameter tuning
    • Performance evaluation and visualization
  4. Advanced Topics

    • On-policy vs off-policy methods
    • Variance reduction techniques
    • Exploration strategies
    • Continuous vs discrete action spaces

๐Ÿ“ง Contact

Uzair Mubasher - BSAI Graduate

LinkedIn Email GitHub


๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • OpenAI Gym team for excellent RL environments
  • PyTorch community for deep learning framework
  • RL course instructors and resources

โญ If you found this repository helpful, please consider giving it a star!

About

Reinforcement Learning projects with Q-Learning, Actor-Critic, and REINFORCE algorithms using PyTorch and OpenAI Gym

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors