https://github.com/windjammer6/35.-star-wars-reinforcement-learning

A series of Star Wars-inspired Gymnasium custom-made Reinforcement Learning (RL) Environments in grid-world style.
https://github.com/windjammer6/35.-star-wars-reinforcement-learning

deep-reinforcement-learning gymnasium python reinforcement-learning reinforcement-learning-algorithms reinforcement-learning-environments stable-baselines3

Last synced: 2 months ago
JSON representation

A series of Star Wars-inspired Gymnasium custom-made Reinforcement Learning (RL) Environments in grid-world style.

Host: GitHub
URL: https://github.com/windjammer6/35.-star-wars-reinforcement-learning
Owner: WindJammer6
License: mit
Created: 2025-07-22T04:26:29.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-07-30T03:59:18.000Z (3 months ago)
Last Synced: 2025-07-30T05:49:09.361Z (3 months ago)
Topics: deep-reinforcement-learning, gymnasium, python, reinforcement-learning, reinforcement-learning-algorithms, reinforcement-learning-environments, stable-baselines3
Language: Jupyter Notebook
Homepage:
Size: 63.6 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 35.-Star-Wars-Reinforcement-Learning ⭐🔫

https://github.com/user-attachments/assets/26c88294-88b0-424d-bc6f-2ba9be1ee8b5

A Machine Learning enthusiast who is also a Star Wars nerd - hence this project.

## Table of Contents
1. [What is Star Wars Reinforcement Learning?](#what)
2. [Storyline](#storyline)
3. [How to Run?](#run)
4. [Development Process](#development)
1. [Phase 1 - Explore Planets](#phase1)
2. [Phase 2 - War](#phase2)
3. [Phase 3 - Hostile Planets and Hive Mind](#phase3)
4. [Phase 4 - Multi-drones](#phase4)
5. [Phase 5 - Bounty Hunter](#phase5)
5. [Results](#results)
6. [Future Directions](#future)
7. [Limitations of the project](#limitations)
8. [Spark any ideas?](#ideas)

## What is Star Wars Reinforcement Learning?

Finding the best policy for a Reinforcement Learning (RL) agent in a series of different RL Environments with the ultimate objective to explore as many planets as possible.

This project is less on the RL algorithms itself, but more on learning how to design good RL Environments (i.e. design of reward functions, MDP, etc.). The RL Environments in this project are relatively simple, in a grid world style.

The RL Enviornments are all custom made [Gymnasium](https://gymnasium.farama.org/) RL Environments by me, designed as Partially Observable Markov Decision Processes (POMDP), each with their own state space, observation space and reward functions (their action spaces are the same). I used [Stable Baselines 3](https://stable-baselines.readthedocs.io/en/master/index.html)'s Proximal Policy Optimisation (PPO) as the RL algorithm for all the RL Environments.

**Technology stack used:**
- Farama Foundation's Gymnasium (https://gymnasium.farama.org/) ([OpenAI Gym's](https://www.gymlibrary.dev/) successor) (used to create the RL Environment)
- Stable Baselines 3 (https://stable-baselines.readthedocs.io/en/master/index.html) (used for quick implementation of DRL algorithms)
- Python

## Storyline (for all of yall fellow nerds)

Generated by ChatGPT

You're a neutral planet explorer from Corusant, looking to discover planets to work with in the unexplored Outer Rim by sending out drones (the RL agents). You seek to find the best ways to explore as many planets as possible with a restricted amount of fuel in your drones to maximise efficiency.

Along the way, your drones encounter various obstacles and conflicts in the Outer Rim, but like any resilient explorer, you are determined to overcome them to achieve your objectives.

## How to Run?
Each scenario is created as a seperate RL Environment, which I termed as a 'phase'. Each phase is stored in its respective folder. Play with the 'main.ipynb' file in each folder to try out the RL Environments!

## Development Process
**I spend weeks watching pixels move around - honestly not good for my sanity.**

Generally, to get the RL agents to learn the best policy in each phase, I played with:
- Reward function tuning (adding new types of rewards, and adjusting their magnitude)
- Vectorising of the RL Environment and Normalisation of rewards
- RL algorithm hyperparaneter tuning

For each phase, I trained a 500k iterations PPO DRL algorithm and a 1M iterations PPO DRL algorithm. Try them out! (Each 500k takes me 30 minutes to train since I'm running all this on my local laptop 😢)

In this section, I will give a rough explanation of my thought processes in developing each phase, for more details (i.e. about the MDP) see each phase's 'main.ipynb'.

### Phase 1 - Explore Planets
*The Baseline RL Environment.*

Started simple with randomly generated planets in a fixed grid-world, giving the RL agent 4 discrete actions, and a reward function of just rewarding the RL agent if it visited a planet.

Quickly realised this was too difficult, as it took an extremely long training time. Hence, I gave the RL agent 'vision' and 'memory' to help it learn faster.

Did some additional fine tuning on the reward function since I noticed it kept getting stuck at corners, and staying at a fixed location.

### Phase 2 - War
*There's a war going on now.*

Started simple with a prey-predator situation, with the Separatists ships as the 'predator' and the Republic ships as the 'prey'. They have their own vision, and initially the Separatists ships will chase both the RL agent or the Republic ship in view, while the Republic ship will flee when a Separatist ship is in view. I later introduced more stochasticity in the Separatist and Republic ships' behaviour (see the 'npc_ships.py' file).

They move at the same speed as the RL agent initially, but I quickly realised it took too long for the RL agent to learn since it was too difficult, hence I halved their speeds.

Simple reward function of, if a Separatists ship is adjacent to the RL agent, the RL agent will be penalised/takes damage. Later improved it as I realised the reward function was too sparse, and to make the reward signals denser, introduced the idea of the planets emitting a positive reward halo (across the map, decays with distance) once it enters the vision of the RL agent, and the idea of visited planets emitting a negative reward halo (also across the map, decays with distance) to discourage RL agents from camping.

I introduced Republic Ships in hopes of encouraging emergent behaviour (not reward-driven, since I don't reward the RL agent from being near a Republic ship, I merely allowed it to 'see' them in its vision) where the RL agent learns more complex policies like 'kiting' i.e. learns to camp near Republic ships so they will take agro from incoming Seperatist ships to maximise its rewards if the RL agent 'sees' a Republic ship. (unfortunately I was not able to show this, likely requires much more iterations to learn)

### Phase 3 - Hostile Planets and Hive Mind
*Even more chaos (hostile planets) to the war. Hive Mind feature involves implementing Meta-Epsiodes in the RL Environment.*

Started with randomly turning a portion of the planets into hostile planets, which emit a kill radius that kills any ship entering it. The RL agents does not know which plaents are hostile nor how big the kill radius are.

To help them, I introduced the idea of a 'Hive Mind', where initial RL agents randomly explore until they get killed, and the 'Hive Mind' remembers the death locations, visited planets, and seen areas of the map of previous RL agents. Within sub-episodes, the map is not reset. (except for the Separatist and Republic ships since I wanted to simulate the idea that by the time the planet explorers decide to send a new RL agent/drone, the state of the war could be different and new ships will occupy different parts of the map)

Also, hostile planets initially produce the same positive reward halo upon discovery by the RL agent. But once a RL agent dies from its kill radius, the 'Hive Mind' knows it is a hostile planet and immediately marks it as a 'visited planet' and it will produce a negative reward halo to discourage future RL agents from going near it.

**The 'camp' at last planet issue**
For Phase 3, there was an issue where the RL agent found a policy to 'camp' at the last planet to farm the rewards off its positive rewards halo rather than just visiting it. It likely didnt want to visit it because once it visits it, the planet (now visited) will emit a negative rewards halo, which penalises it.

Possible solutions:
- could fix this by not making visited planets penalise the RL agent.
- could give a big reward for the RL agent if it finds all planets. However, I didnt want to do it because it dosen't make sense as the RL agent isn't supposed to know how many planets there are in the map, and the objective is supposedly to try its best to find as many planets as possible.
- one more rationale is that we could still give the RL agent a big reward for finding all planets, if it has already seen all parts of the map (since if it has already seen all parts of the map, then it make sense that we should already know how many planets there are in the area and thus we should know if all planets has already been found)

But I will open this phenomenon to your interpretation of the RL Environment and future experiments.

### Phase 4 - Multi-drones
*Implements Multi-agent collaboration in the RL Environment.*

Agents perceive other agents as part of the Environment.

-- Coming soon! --

- Inspired by this OpenAI article: https://openai.com/index/emergent-tool-use/

### Phase 5 - Bounty Hunter
*Implements Adversarial agents competition in the RL Environment.*

-- Coming soon! --

- Inspired by the AI Warehouse Youtube channel: https://www.youtube.com/@aiwarehouse

## Results
### Phase 1 - Explore Planets

https://github.com/user-attachments/assets/e35d074c-5a28-485a-9526-d6802a10aba0

With Stable Baselines 3's PPO DRL algorithm for 1 million iterations.

### Phase 2 - War

https://github.com/user-attachments/assets/24e89f9d-c866-4227-bfee-a8a782edb981

With Stable Baselines 3's PPO DRL algorithm for 1 million iterations.

### Phase 3 - Hostile Planets and Hive Mind

https://github.com/user-attachments/assets/9c652053-fd08-40ae-ba2b-74fca894e1cf

With Stable Baselines 3's PPO DRL algorithm for 1 million iterations.

*Legend*
| Color | Description |
|---------------------|-----------------------------------------------|
| Black | Starting position |
| Grey | RL agent |
| Yellow | RL agent's vision |
| Red-pink | Parts of the map the RL agent has seen |
| Blue | Republic Ship |
| Blue outline | Republic Ship's vision |
| Red | Separatist Ship |
| Red outline | Separatist Ship's vision |
| Light blue | Planet |
| Green | Visited Planet |
| Orange | Hostile Planet |
| Orange outline | Hostile Planet kill radius |
| Pink | RL agent destroyed location |

## Future Directions
I believe there is a lot of possible improvements and expansion on this project to simulate different scenarios.

**Possible application to the real world**
- Most obviously, space travel (or any similar fields of work)
- Since the idea behind this project is,

With limited resources (fuel, determined by time of mission/episode),

what is the best/optimal way to clear as many objectives (find as many planets) as possible?

it investigates a new perspective on how to tackle this age-old problem with Reinforcement Learning.

**Further testing of the existing RL Environments**
For example,
- currently the NPC ships, particularly the Separatist ships, are moving at half the speed of the RL agent, making it very easy for the RL agent to outmaneouvre them. I believe if we make them move at the same speed as the RL agent, it will force the RL agent to learn very different policies (and definitely will take longer/more iterations to train) (which I do not have the luxury of since I'm running all this on my local laptop and each training process takes me 30 minutes)
- test out the RL Enviroments with other RL algorithms (I didn't have the luxury to test the RL Environments with other RL algorithms, but let me know if you did (though I assume they should more or less work similarly in finding the planets (since thats the ultimate goal in the various scenarios with slight differences in cumulative rewards)). Since in all my experiments I just used Stable Baseline 3's PPO DRL algorithm only.

**Integration of LLM? LMM expert + RL?**

## Limitations of this project
In most RL papers, they do benchmarking via comparing the (numerical results) reward convergence/cumulative reward graphs as results between RL algorithms for different RL Environments... however, I wasn't able to do this. How I tell a RL model is improving is solely by visually seeing the simulation (I know it's not the best way and seeing the reward convergence graph usually is the right way but I didn't have the time to study graphs and abit laze 😆)

I do have the logs for the learned policies of the RL models... perhaps will show case the reward convergence/cumulative reward graphs as results in the 'Results' section in the future.

The only proof I can show you that these RL Environments works and gives RL agent the right signals to learn good/expected policies to achieve the objective is via the simulation rendering.

## Spark any ideas?
Have an idea of a Star Wars-themed RL Environment? Open to anyone with new scenario ideas and their respective RL agent's learned policies, or a better reward function/learnd policies for the existing scenarios. Just send me a pull request!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/windjammer6/35.-star-wars-reinforcement-learning

Awesome Lists containing this project

README