https://github.com/aaronjs99/intelligent-agents
Comparative analysis of Markov decision processes & intelligent agents
https://github.com/aaronjs99/intelligent-agents
bandit-algorithms linear-programming mdp policy-iteration reinforcement-learning value-iteration
Last synced: about 1 month ago
JSON representation
Comparative analysis of Markov decision processes & intelligent agents
- Host: GitHub
- URL: https://github.com/aaronjs99/intelligent-agents
- Owner: aaronjs99
- License: mit
- Created: 2020-10-21T13:13:28.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2025-05-16T00:05:23.000Z (about 1 year ago)
- Last Synced: 2026-03-06T00:39:30.912Z (3 months ago)
- Topics: bandit-algorithms, linear-programming, mdp, policy-iteration, reinforcement-learning, value-iteration
- Language: Python
- Homepage:
- Size: 1.68 MB
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Intelligent Agents
A collection of reinforcement learning and intelligent agents projects showcasing implementations of key algorithms and their comparative analysis. These projects were developed as part of the *CS747: Foundations of Intelligent and Learning Agents* course at IIT Bombay.
## Projects Included
### Bandits (`src/bandits`)
Implements and compares classical multi-armed bandit algorithms: **ε-greedy**, **UCB**, **KL-UCB**, and **Thompson Sampling**, along with a custom variant: **Thompson Sampling with a hint**. Each algorithm minimizes cumulative regret across different horizons and random seeds.
**Key Findings:**
- Thompson Sampling generally outperforms others in regret minimization.
- KL-UCB improves over UCB using a tighter confidence bound via binary search.
- ε-Greedy performs best at `ε ≈ 0.02` — striking a balance between exploration and exploitation.
- The Thompson Sampling "hinted" version leverages knowledge of true means, improving early performance through a custom Beta-distribution-based selector.
Includes regret plots over multiple seeds and horizons, as well as parameter studies.
### MDP Maze Solver (`src/mdp`)
Solves mazes by modeling them as Markov Decision Processes using:
- **Value Iteration**
- **Linear Programming** (via PuLP)
- **Howard’s Policy Iteration**
**Pipeline:**
1. `encoder.py` transforms grid mazes into MDPs.
2. `solver.py` computes optimal policy.
3. `decoder.py` reconstructs the shortest path using the policy.
**Insights:**
- LP is consistently fastest for large mazes.
- Howard's Policy Iteration performs well on small problems but becomes costly as maze complexity grows.
- Visual comparisons confirm that solved mazes follow intuitive paths with minimal steps.
Benchmarks for runtime across methods and visualizations for grid navigation are included.
### Windy Gridworld (`src/windy_gridworld`)
Adopts the Sutton & Barto Windy Gridworld challenge with multiple RL approaches:
- **Sarsa** (normal and King’s moves)
- **Sarsa with stochastic wind**
- **Q-Learning**
- **Expected Sarsa**
**Key Results:**
- Sarsa with King’s Moves converges fastest due to shorter episodes.
- Q-Learning and Expected Sarsa outperform standard Sarsa on stability and convergence.
- The stochastic wind variant adds realistic randomness but slows convergence.
- Paths from all agents are visualized for both deterministic and windy environments.
Gridworld is defined as an episodic MDP with reward shaping and stepwise convergence plotting.
## 🔧 Running Experiments with `run.py`
Use the `run.py` script to run all experiments. It acts as a unified launcher for bandits, MDP solving, verification, visualization, and Windy Gridworld tasks.
Enable `--verbose` to view subprocess outputs and logs in real time.
### Bandits
```bash
python run.py --verbose bandits \
--instance data/bandits/instances/i-1.txt \
--algorithm thompson-sampling \
--rseed 42 \
--epsilon 0.1 \
--horizon 1000
```
### MDP Maze Solver
```bash
python run.py --verbose solve_mdp \
--grid data/mdp/grids/grid10.txt \
--algorithm pi
```
To create a synthetic MDP file:
```bash
python run.py --verbose generate_mdp \
--num_states 10 \
--num_actions 5 \
--gamma 0.95 \
--mdptype episodic \
--rseed 42 \
--output_file src/mdp/tmp/generated_mdp.txt
```
To verify all default grids (10 through 100):
```bash
python run.py --verbose verify_mdp --algorithm vi
```
To verify specific grids:
```bash
python run.py --verbose verify_mdp \
--algorithm lp \
--grid data/mdp/grids/grid40.txt data/mdp/grids/grid50.txt
```
To visualize a grid:
```bash
python run.py --verbose visualize_mdp \
--grid_file data/mdp/grids/grid10.txt \
--output_file plots/mdp/grid10_unsolved.png
```
To visualize a solved grid:
```bash
python run.py --verbose visualize_mdp \
--grid_file data/mdp/grids/grid10.txt \
--path_file data/mdp/paths/path10.txt \
--output_file plots/mdp/grid10_solved.png
```
### Windy Gridworld
```bash
python run.py --verbose windy \
--episodes 200 \
--epsilon 0.15 \
--discount 0.99 \
--learning-rate 0.5
```
## Command Reference (Summary)
| Command | Description |
|------------------|------------------------------------------|
| `bandits` | Run multi-armed bandit experiments |
| `windy` | Run Windy Gridworld RL agents |
| `generate_mdp` | Generate synthetic MDP instance files |
| `solve_mdp` | Solve a maze-based MDP using vi/pi/lp |
| `verify_mdp` | Verify path optimality for maze solvers |
| `visualize_mdp` | Create visual output of MDP grid/paths |
## References
- [`./references/mdp_references.txt`](./references/mdp_references.txt)
- [`./references/bandits_references.txt`](./references/bandits_references.txt)
- [`./references/windy_gridworld_references.txt`](./references/windy_gridworld_references.txt)
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.