(RL) Exploration of Reinforcement Learning Algorithms

Project Overview

This project focuses on implementing and comparing reinforcement learning (RL) algorithms. Specifically, it explores solving multi-armed bandit problems using an epsilon-greedy algorithm and designing agents for episodic MDP tasks using Semi-gradient SARSA and Q-learning with function approximation.

Part 1: Multi-Armed Bandit Problem (Epsilon-Greedy Algorithm)

The multi-armed bandit problem is a classic RL challenge where an agent must choose between multiple options (bandits) to maximize rewards while balancing exploration and exploitation. The epsilon-greedy algorithm is employed here to find the optimal action.

Key Components

Problem Setup:
- A k-armed bandit problem is modeled based on the OpenAI Gym framework.
- The goal is to maximize rewards by selecting the best arm, balancing exploration (trying different actions) and exploitation (choosing the known best action).
Epsilon-Greedy Algorithm:
- Implemented to explore the balance between exploration and exploitation.
- Hyperparameter tuning: Exploration is controlled by the epsilon parameter, which is varied to find the optimal setting.

Table: Exploration vs Exploitation in Epsilon-Greedy

Parameter	Role	Impact on Performance
Epsilon (ε)	Exploration Rate	Higher values encourage exploration, lower values lead to exploitation.
Reward Function	Decision Metric	Measures the success of selected actions.
Action Selection	Optimal Choice	A balance between random and reward-maximizing actions.

Part 2: Reinforcement Learning with Function Approximation (Semi-Grad SARSA & Q-Learning)

The second part of the project involves designing agents to complete an episodic MDP task. The agent navigates a maze and must learn to find a goal at the center by choosing one of four possible actions: moving left, right, up, or down.

Key Components

Problem Setup:
- An agent is trained to navigate using an OpenAI Gym environment.
- The agent must learn to reach the goal as quickly as possible using function approximation.
Algorithms:
- Semi-Gradient SARSA: An on-policy RL method where the agent learns the action-value function by directly interacting with the environment.
- Q-Learning: An off-policy RL method where the agent learns the optimal action-value function by considering future actions.

Comparison of SARSA vs Q-Learning

Aspect	Semi-Grad SARSA	Q-Learning
Policy	On-policy	Off-policy
Update Method	Updates based on actions taken	Updates based on best future action
Function Approximation	Linear	Linear
Exploration vs Exploitation	Balanced during action selection	Prioritizes optimal policy estimation

Hyperparameters and Tuning:

Discount Factor (γ): Controls the trade-off between immediate and future rewards.
Step Size (α): Governs the learning rate for adjusting the value estimates.

Resources

Code Repository:

GitHub - Epsilon-greedy Alhgorithm

GitHub - Semi-grad SARSA and QLearning Algorithms

Course: Advanced Machine Learning

Share on

Twitter Facebook LinkedIn

Riccardo Campanella

Project Overview

Part 1: Multi-Armed Bandit Problem (Epsilon-Greedy Algorithm)

Key Components

Table: Exploration vs Exploitation in Epsilon-Greedy

Part 2: Reinforcement Learning with Function Approximation (Semi-Grad SARSA & Q-Learning)

Key Components

Comparison of SARSA vs Q-Learning

Hyperparameters and Tuning:

Resources

Share on