https://github.com/lucacappelletti94/snv_classifier

Project for the bioinformatics course of professor Valentini, Unimi.
https://github.com/lucacappelletti94/snv_classifier

Last synced: 2 months ago
JSON representation

Project for the bioinformatics course of professor Valentini, Unimi.

Host: GitHub
URL: https://github.com/lucacappelletti94/snv_classifier
Owner: LucaCappelletti94
License: mit
Created: 2018-06-26T05:30:53.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2019-02-01T06:40:22.000Z (over 6 years ago)
Last Synced: 2025-02-08T13:14:17.838Z (4 months ago)
Language: Jupyter Notebook
Size: 87.2 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  



# SNV Classifier

Project for the bioinformatics course of professor Valentini, Unimi.



  



## Documentation

The documentation of the project is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/main.pdf) and shows an analysis and visualizations of the datasets, modelling of the network and results.

## Doubts over obtained results

Some doubts have been raised by the extremely quick "overfitting" on the test set when using simple networks, such as a 2 layer with 3 neurons each. This is probably motivated (experimental proof in the PCA notebook) by the distribution of the test set that does not reflect the distribution of the train set, but is actually extremely more easily separable.



  



## Batch of neural networks

To verify if 36/40 is the maximum of precision that a common neural network an reach over the given dataset I have trained 136 networks with a gradient of architectures for 100 generations each.

All the trained models and weights are available [here](https://github.com/LucaCappelletti94/snv_classifier/tree/master/meta_networks).

### Errors and issues with this approach

- Deeper networks need more epochs to converge.

- I **forgot** to reset the random seed for each network, so the networs start from different random weights. I will retrain the networks resetting the seeds as soon as I get the time.



  

  

  

  

  

  



### Results

The approach suggests that 36/40 is the maximal precision.

## Jupyter Notebooks

Various jupyter notebooks with explanations are available:

### Keras neural network

A jupyter notebook implementing the [project neural network](https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/network.png) in keras is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Keras.ipynb).



#### Network trained model usage example

A jupyter notebook implementing a usage example of the trained model is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Loading%20saved%20model.ipynb) or just below here:

```python

#!/usr/bin/python

# -*- coding: utf-8 -*-

import numpy as np

from keras.models import load_model

def number_to_class(value):

    """Map class identifier to class name."""

    if value:

        return 'Positive'

    return 'Negative'

EXAMPLE_DATASET = 'Mendelian.normalized.example.test.tsv'

model = load_model('model.h5')

model.load_weights('weights.h5')

data_points = np.loadtxt(EXAMPLE_DATASET, delimiter='\t')

for prediction in model.predict_classes(data_points):

    print 'I believe %s to be %s' % (number_to_class(1),

            number_to_class(prediction))

"""

  I believe Positive to be Positive

  I believe Positive to be Positive

  I believe Positive to be Positive

  I believe Positive to be Positive

  I believe Positive to be Positive

  I believe Positive to be Positive

  I believe Positive to be Negative

  I believe Positive to be Positive

  I believe Positive to be Positive

  I believe Positive to be Negative

"""

```

### Scatter plot

A jupyter notebook generating a [scatter plot](https://github.com/LucaCappelletti94/snv_classifier/blob/master/scatter_plot.png?raw=true) from the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Scatter%20plot.ipynb).



### Correlation matrices

A jupyter notebook generating a [correlation matrix](https://github.com/LucaCappelletti94/snv_classifier/blob/master/correlation_matrix.png?raw=true) from the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Correlation.ipynb).



### PCA

A jupyter notebook generating [PCA 2D visualization](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/pca) of the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20PCA.ipynb).



### TSNE

A jupyter notebook generating [TSNE 2D visualization](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/tsne) of the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20TSNE.ipynb).



### Dataset plots

A jupyter notebook generating [dataset plots](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/plot) is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Metrics%20plots.ipynb).



### Dataset distributions

A jupyter notebook generating [dataset distributions](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/distributions) is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Metric%20distributions.ipynb).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucacappelletti94/snv_classifier

Awesome Lists containing this project

README