https://github.com/zsxkib/breast-cancer-detection
~93% accuracy diagnosing (benign/malignant) tumours w/ knn
https://github.com/zsxkib/breast-cancer-detection
Last synced: 5 months ago
JSON representation
~93% accuracy diagnosing (benign/malignant) tumours w/ knn
- Host: GitHub
- URL: https://github.com/zsxkib/breast-cancer-detection
- Owner: zsxkib
- Created: 2019-08-03T23:13:33.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-02-13T23:30:24.000Z (over 5 years ago)
- Last Synced: 2025-04-01T13:08:36.690Z (7 months ago)
- Language: Python
- Homepage: https://sakib56.github.io
- Size: 133 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Breast Cancer Detection w/ K-Nearest Neighbours (2019)
This is my approach at implementing **K-Nearest Neighbours from scratch** and applying it to the **Breast Cancer Wisconsin (Diagnostic) dataset**, provided by the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
## How do I use this Repo?
Simple. Just clone the repo into a folder, open a terminal in that directory and run
```python cancer.py```
By default the following code will run...
```python
### MAIN ###
dataset = loadData()
trainingData, testingData = crossValidate(dataset, trainSize=0.65)
k = 7
knn = knn.KNearestNeighbour()
score = 100*knn.getScore(trainingData, testingData, k)
print("average accuracy: {0:.5f}%".format(score))
```
Alternatively, you can open a terminal in the directory and just import the relevant modules and run the methods.
For example:
```python
python
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import knn
>>> import cancer
>>> dataset = cancer.loadData()
>>> trainingData, testingData = cancer.crossValidate(dataset, trainSize=0.65)
>>> k = 7
>>> knn = knn.KNearestNeighbour()
>>> score = 100*knn.getScore(trainingData, testingData, k)
>>> print("average accuracy: {0:.5f}%".format(score))
"accuracy for M diagnosis: 91.33%"
"accuracy for B diagnosis: 96.80%"
"average accuracy: 94.06500%"
```
This will:
* load the dataset from data.csv
* cross validate the data, splitting into a training and testing set
* initialise a new KNN model
* test KNN on the testing set given the training set
* output the average accuracy (for both benign/malignant)
###### *Feel free to remove this code and mess around with the methods!*
## cancer.py
#### Loading the Data from .csv
First, I start off by loading the data from the .csv using the **loadData()** function. What is returned will look something like this:
{"B": [b1, b2, ..., bn], "M": [m1, m2, ..., mn]} where bn or mn for any n, is a 30 dimensional list (vector).
###### *Note: it is 30-D instead of 32-D as I have removed the "id" and "diagnosis" columns within loadData()*
for example, loading the data into the format shown above:
```python
dataset = loadData()
```
#### Cross Validating the Data
As I have a finite dataset to work with, I cannot use all of the dataset to "train" KNN, instead I will cross-validate. This is essentially shuffling the data and splitting into two groups, one for training purposes and another for testing.
The reason this is done is because if I used all the dataset to train and test my model, it would get a higher accuracy compared to unseen data points (as there will always be one training point exactly on the testing point)... You wouldn't give your student a final exam which was the same as a past paper.
**crossValidate(data, trainSize=0.8)** is used to do this cross-validation, which returns trainingData and testingData (in that order), where the arguments...
*data* is the training data, a matrix of row vectors which are the datapoints (n by m) and
*trainSize* is the percentage of *data* which is set aside for training (default being 0.8)
for example, creating two datasets for training and testing:
```python
trainingData, testingData = crossValidate(dataset, trainSize=0.6)
```
let's assume that ```dataset``` had 100 vectors, then ```trainingData``` and ```testingData``` would have 60 and 40 vectors, respectively.
###### *Note: trainingData and testingData are disjoint (share no common elements)*
#### Finding the best hyper-parameters (K and trainSize)
This is a completely inefficient way to find the best hyper parameters for the model. **bruteForceBestHyperParams(diagnosis)** where *diagnosis* is “M” or “B”.
This method will run through values of trainSize from 0.4 to 0.9 for cross validation,
Each time it will also run through values of k from 1 to 15.
Randomising the dataset and running KNN to create a list of accuracies and hyper parameters.
The basic pseudocode is:
```
BestParameters = []
For trainSizes from 0.4 to 0.9
For k from 1 to 15
Randomise dataset
Get accuracy
BestParameters += (accuracy, trainSizes, k)
Sort BestParameters by accuracy
Return BestParameters
```
From doing this and collating the results, I have concluded that the best results from my KNN can be achieved by using a
**trainSize ≈ 0.63** and **k ≈ 7**
## knn.py
#### Predicting a new sample given data
**knn.predict(data, sample, k=n)** is used to predict what class (benign/malignant) *sample* will be, based on *data* - the other data points around it. where the arguments...
*data* is the training data, a matrix of row vectors which are the datapoints (n by m),
*sample* is a single row vector (1 by m) and
*k* is the number of nearest neighbours the algorithm will consider before classifying *sample*.
for example, predicting what the first value of testingData["M"] (for malignant class) will be:
```python
knn.predict(trainingData, testingData["M"][0], k=7)
>> "M"
```
#### Finding the Accuracy
**getScore(trainingData, testingData, k)** is used to calculate the accuracy of the model. This is done by inputting *trainingData* and *testingData* and a value for *k*. This method is similar to predict, the difference being that a *testingData* is also a matrix of row vectors instead of a single row vector and a percentage (between 0-1) is returned.
Ideally *trainingData* and *testingData* should be cross validated from the original dataset.
###### *Note: by "accuracy" I mean (number of correct diagnoses)/(total number of cases)*