https://github.com/elixir-code/menacer
Machine Educable Noughts and Crosses Engine - Revived
https://github.com/elixir-code/menacer
markov-decision-processes menace reinforcement-learning value-iteration-algorithm
Last synced: about 1 month ago
JSON representation
Machine Educable Noughts and Crosses Engine - Revived
- Host: GitHub
- URL: https://github.com/elixir-code/menacer
- Owner: elixir-code
- License: mit
- Created: 2018-12-25T12:48:35.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-01-19T17:58:55.000Z (over 7 years ago)
- Last Synced: 2025-07-06T04:37:35.002Z (11 months ago)
- Topics: markov-decision-processes, menace, reinforcement-learning, value-iteration-algorithm
- Language: Python
- Homepage:
- Size: 254 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: docs/README.rst
- License: LICENSE
Awesome Lists containing this project
README
==============================================================
MENACER: Machine Educable Noughts and Crosses Engine - Revived
==============================================================
.. From Layman's Perspective
**MENACER** (``Machine Educable Noughts and Crosses Engine - Revived``) is a computer program that plays the game of Noughts and Crosses (aka. Tic-Tac-Toe). It learns, evolves and gets better at the game with every game it plays.
API Examples
============
----------------
MENACER vs Human
----------------
.. code:: python
from menacer import AgentX, AgentO, playNoughtsCrosses
# Initialize MDP Agent that plays 'X' with random policy
agentx = AgentX()
agento = 'human'
while True:
# Simulate a game between agentx and agento
game = playNoughtsCrosses(agentx, agento)
# Update the MDP and policy of agentx to learn from game
agentx.learnGameplay([game])
------------------
MENACER vs MENACER
------------------
.. code:: python
from menacer import AgentX, AgentO, playNoughtsCrosses
# Initialize MDP Agents that plays 'X' and 'O' with random policies
agentx = AgentX()
agento = AgentO()
# Define the number games to be simulated
n_games = 1000
for i_game in range(n_games):
# Simulate a game between agentx and agento
game = playNoughtsCrosses(agentx, agento)
# Update the MDP and policy of agents to learn from game
agentx.learnGameplay([game])
agento.learnGameplay([game])
--------------------------------------
Dumping and Loading Pre-Trained Agents
--------------------------------------
Pre-trained agents can be *serialized*, and dumped or loaded from binary files using the `pickle `_ library.
How it Works
============
.. From Reinforcement Learning Perspective
MENACER is a simple **reinforcement learning** (RL) agent that uses **Markov Decision Process (MDP)** model to capture the dynamics of the game, and **value iteration** algorithm to determine the probabilistically optimum move to play for every possible configuration of the board.
MENACER employs two seperate **Markov Decision Process (MDP)** models to learn the dynamics of gameplay for agents that play the Noughts ('X') and the Crosses ('O') respectively.
.. contents:: Markov Decision Process (MDP) model
:local:
-------------------
Representing States
-------------------
The various possible configurations of the *Noughts and Crosses* board correspond to the states in the MDP models.
The MDP model of an agent (either the agent that plays *Noughts* or the agent that plays *Crosses*) only involves states where **the agent plays the next move**, along with states that correspond to the **won, lost and drawn** configurations of the board.
.. image:: static-assets/board.png
:align: left
The board configurations can be represented in two forms:
+ **String Representation:** a 9-character string composed of '**x**' (noughts), '**o**' (crosses) and '**.**' (empty space).
**Example:** '.ooxo..x.'
.. _`array representation`:
+ **Array Representation:** a 9-element array composed of '1' (noughts), '0' (empty space) and '-1' (crosses).
**Example:** [0, -1, -1, 1, -1, 0, 0, 1, 0]
The string notation can be conveniently used as keys in the hashing data structures used to store the policy, transition probabilities and rewards for the states in the MDP model.
Standard Form of a State
------------------------
The next move to play for a given board configuration is symmetric (or identical) for *rotated and/or mirrored* configurations of the board. The board configurations can be represented in a rotation and mirror-invariant form called the **standard form**.
The **standard form** of a board configuration is the board configuration which has the lexicographically largest `array representation`_ among the eight possible rotated and/or mirrored configurations of the given board.
.. image:: static-assets/standard-form.png
..
The use of the standard form of the board configurations to represent the states in the MDP models drastically reduce the number of possible states and improve the learning capability of the agent.
--------------------
Representing Actions
--------------------
.. image:: static-assets/board-actions.png
:align: left
.. End of image directive
The various *next moves* (or positions) that an agent can play for a given board configuration correspond to **actions** that can be performed at the corresponding state in the MDP model.
The possible actions that can be performed at a given state in the MDP models are encoded as a *subset of numbers enumerated from 0 to 8*, each corresponding to one of the nine possible positions in the board.
Contributor's Section
=====================
The MENACER community encouages all its members to contribute to the project in however small ways possible.
Some of the important milestones in the future roadmap of MENACER include:
+ **Creation of Website:** Since, large hours of training are necessary for the agents to capture the complete dynamics of the game and evolve to become expert players in the game of Noughts and Crosses, creation of a website where the users can play against MENACER is of primal focus.
Please refer to `issues `_ section for related discussion and more information on possible directions of future work.