https://github.com/jfdev001/id3-decision-tree
Implementation of ID3 algorithm for decision tree classifier from scratch.
https://github.com/jfdev001/id3-decision-tree
decision-tree-classifier python
Last synced: 3 months ago
JSON representation
Implementation of ID3 algorithm for decision tree classifier from scratch.
- Host: GitHub
- URL: https://github.com/jfdev001/id3-decision-tree
- Owner: jfdev001
- Created: 2021-11-11T06:49:42.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-12-09T17:48:34.000Z (over 3 years ago)
- Last Synced: 2025-02-08T13:45:34.506Z (5 months ago)
- Topics: decision-tree-classifier, python
- Language: Python
- Homepage:
- Size: 1.71 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ID3 Decision Tree
An implementation of the [ID3 Algorithm](https://en.wikipedia.org/wiki/ID3_algorithm) for the creation of classification decision trees via maximizing information gain. Intended for continuous data with any number of features with only a single label (which can be multi-class). Binary splitting was employed to account for continuous data. A report summarizing the results and methodology was used as well.
For a less organized view, see branch `20211112-turnedin` for the materials turned in as part of MTSU's CSCI 4350 (Intro to AI) open lab assignment.
For a summary of results, see `report/report.pdf`.
# Installation
This repository relies on the Python scientific computing/data analysis libraries NumPy, Pandas, and Matplotlib.
`pip install numpy pandas matplotlib`
# Testing
To train and subsequently test the decision tree, use the command line interface for `id3.py`
```
python id3.py ./data/iris-data.txt ./data/iris-data.txt
```The first argument is the training data, the second argument is the testing data. In this case, training and testing occurs on the same data set. For more arguments to any of the scripts, type `python name_of_script.py -h`.
# Analysis
Output for random resamples of testing size `n` were used to obtain the results in `report/report.pdf`. For the iris data set,
`n = [1,5,10,25,50,75,100,125,140,145,149]` while for the cancer data set
`n = [1,5,10,25,50,75,90,100,104]`.
To reproduce the results in `stats/`, use the following commands for each respective dataset:
```
# iris
for n in 1 5 10 25 50 75 100 125 140 145 149; do for ((x=0; x<100; x++)); do echo "cat iris-data.txt | ./split.bash $n python id3.py --percentage True --precision 3"; done | ./parallelize.bash; done >> stats/iris_out.txt# cancer
for n in 1 5 10 25 50 75 90 100 104; do for ((x=0; x<100; x++)); do echo "cat cancer-data.txt | ./split.bash $n python id3.py --percentage True --precision 3"; done | ./parallelize.bash; done >> stats/cancer_out.txt```
Plotting and computation of descriptive statistics (mean and standard error) can be done using `plot.py` and `stats.py`, respectively.
# Future Work
The program works currently for continuous data only. It could be adapted for categorical data, or different splitting techniques for the continuous data could be implemented.