Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nepfaff/mldecisiontreecoursework
Imperial College London - Introduction to Machine Learning - Decision Tree Coursework
https://github.com/nepfaff/mldecisiontreecoursework
Last synced: 10 days ago
JSON representation
Imperial College London - Introduction to Machine Learning - Decision Tree Coursework
- Host: GitHub
- URL: https://github.com/nepfaff/mldecisiontreecoursework
- Owner: nepfaff
- Created: 2021-11-17T12:17:37.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2022-01-18T11:16:46.000Z (about 3 years ago)
- Last Synced: 2024-11-19T03:19:57.570Z (2 months ago)
- Language: Python
- Size: 660 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# MLDecisionTreeCoursework
## Project Setup
Please follow the following setup instructions before using any of the functionality provided by this package.
1. Ensure that the directory `mldecisiontreecoursework` is present on an Ubuntu machine.
2. Install `mldecisiontreecoursework` by running `pip3 install ./path/to/mldecisiontreecoursework` where `path/to/mldecisiontreecoursework` is the path to the `mldecisiontreecoursework` directory.## Producing Evaluation Results
1. Open a terminal inside the `mldecisiontreecoursework` directory.
2. Run one of the following commands:
- Printing the results to the terminal: `./evaluate_decision_tree_algorithm.py`
- Printing the results to a specified text file: `./evaluate_decision_tree_algorithm.py --path "path/to/file.txt"` where `path/to/file.txt` is the path to the text file to append the results to.This script produces evaluation results for the decision tree algorithm without and with pruning for both the clean and the noisy data sets. These data sets can be found in `mldecisiontreecoursework/Data/intro2ML-coursework1/wifi_db`.
Note, that this script might take a couple of minutes to execute. Moreover, there might be a delay between the output of the clean and the output of the noisy data set. Please don't terminate the script before both results have been produced.
## Running Individual Functions
Instructions for running the most important functionality are given below. Please consult the individual docstrings in the source code for additional information.
### Loading data sets from a text file
The following code snipet shows how to load data sets from a text file. `data_file_path` is the `string` path to the text file containing the data. Each row in the file should contain a number of attributes of type `float` followed by one label of type `int` where individual columns are separated by whitespace. Each row represents one instance. `number_of_attributes` is the `int` number of attributes that the data set contains which is one less than the number of columns. `x` is a `np.ndarray` of type `float` and shape `(n,k)` where `n` is the number of instances and `k` is the number of attributes. `y` is a `np.ndarray` of type `int` and shape `(n,)`.
```python
from data_loading import load_txt_datax, y = load_txt_data(data_file_path, number_of_attributes)
```### Training a decision tree without pruning
The following code snipet shows how to train a decision tree using training instances `x_train` and the corresponding class labels `y_train`. `x_train` is a `np.ndarray` of type `float` and shape `(n,k)` where `n` is the number of instances and `k` is the number of attributes. `y_train` is a `np.ndarray` of type `int` and shape `(n,)`. `decision_tree` is a `dictionary` representation of the trained decision tree. `depth` is the `int` depth of the tree.
```python
from decision_tree import decision_tree_learningdecision_tree, depth = decision_tree_learning(x_train, y_train)
```### Training a decision tree with pruning
The following code snipet shows how to prune a trained decision tree using a separate validation set. `x_train` and the corresponding class labels `y_train`. `x_train` is a `np.ndarray` of type `float` and shape `(n,k)` where `n` is the number of instances and `k` is the number of attributes. `y_train` is a `np.ndarray` of type `int` and shape `(n,)`. `x_validation` and `y_validation` have the same format as `x_train` and `y_train`. `pruned_decision_tree` is a pruned version of `decision_tree` and `validation_errors` is the `int` number of validation errors produced by the pruned tree.
```python
from decision_tree import decision_tree_learning, decision_tree_pruningdecision_tree, depth = decision_tree_learning(x_train, y_train)
pruned_decision_tree, validation_errors = decision_tree_pruning(decision_tree, x_train, y_train, x_validation, y_validation)
```### Using a trained decision tree to predict labels
The following code snipet shows how to use a trained decision tree (pruned or unpruned) to predict the class labels for instances. `decision_tree` is either a pruned or unpruned decision tree as produced in the above code snipets. `x` is the instances that we want to predict labels for. It is a `np.ndarray` of type `float` and shape `(n,k)` where `n` is the number of instances and `k` is the number of attributes. `y_predict` is the predicted class labels corresponding to `x`. It is a `np.ndarray` of type `int` and shape `(n,)`.
```python
from decision_tree import decision_tree_predicty_predict = decision_tree_predict(decision_tree, x)
```### Evaluating the decision tree without pruning algorithm using cross-validation
The following code snipet shows how to evaluate the decision tree algorithm without pruning using cross-validation. `x` is a `np.ndarray` of type `float` and shape `(n,k)` where `n` is the number of instances and `k` is the number of attributes. `y` is a `np.ndarray` of type `int` and shape `(n,)`. `evaluation` is an instance of the `Evaluation` class which contains the evaluation metrics. `confusion_matrix` is an averaged confusion matrix of type `float` and shape `(c,c)`, where `c` is the number of classes. The confusion matrix rows represent the actual classes and the columns the predicted classes. `accuracy` is the averaged accuracy. `precisions` are the averaged precisions per class where the lowest index represents the class with the lowest value. `recalls` are the averaged recalls per class where the lowest index represents the class with the lowest value. `f1s` are the averaged F1-measures per class where the lowest index represents the class with the lowest value.
```python
from evaluation import cross_validationevaluation = cross_validation(x, y)
# Display individual evaluation metrics:
print(evaluation.confusion_matrix)
print(evaluation.accuracy)
print(evaluation.precisions)
print(evaluation.recalls)
print(evaluation.f1s)
```### Evaluating the decision tree with pruning algorithm using nested cross-validation
The following code snipet shows how to evaluate the decision tree algorithm with pruning using cross-validation. `x` is a `np.ndarray` of type `float` and shape `(n,k)` where `n` is the number of instances and `k` is the number of attributes. `y` is a `np.ndarray` of type `int` and shape `(n,)`. `evaluation` is an instance of the `Evaluation` class which contains the evaluation metrics. `confusion_matrix` is an averaged confusion matrix of type `float` and shape `(c,c)`, where `c` is the number of classes. The confusion matrix rows represent the actual classes and the columns the predicted classes. `accuracy` is the averaged accuracy. `precisions` are the averaged precisions per class where the lowest index represents the class with the lowest value. `recalls` are the averaged recalls per class where the lowest index represents the class with the lowest value. `f1s` are the averaged F1-measures per class where the lowest index represents the class with the lowest value. `average_unpruned_depth` is the average depth of all decision trees (before pruning them) obtained inside the nested cross-validation function. `average_pruned_depth` is the average depth of all pruned decision trees obtained inside the nested cross-validation function.
```python
from evaluation import nested_cross_validationevaluation, average_unpruned_depth, average_pruned_depth = nested_cross_validation(x, y)
# Display individual evaluation metrics:
print(evaluation.confusion_matrix)
print(evaluation.accuracy)
print(evaluation.precisions)
print(evaluation.recalls)
print(evaluation.f1s)print(average_unpruned_depth)
print(average_pruned_depth)
```## Visualisation
To visualise a decision tree trained on the entire clean data set without pruning, run the script `mldecisiontreecoursework/visualise_decision_tree.py`. For example, from inside the `mldecisiontreecoursework` directory, this script can be called using `./visualise_decision_tree.py`.
This creates a `.png` file named `clean_unprunned.png` inside the `figures/` directory.## Testing
The directory `mldecisiontreecoursework/test` contains unit tests for the majority of the functionality present in `mldecisiontreecoursework`. The testing framework used is `pytest`/`pytest-3`. All unit tests can be run using the script `mldecisiontreecoursework/test/run_tests.sh`. For example, from inside the `mldecisiontreecoursework` directory, this script can be called using `./test/run_tests.sh`.