{"id":15906161,"url":"https://github.com/lucacappelletti94/snv_classifier","last_synced_at":"2025-04-02T22:43:45.613Z","repository":{"id":98555889,"uuid":"138689041","full_name":"LucaCappelletti94/snv_classifier","owner":"LucaCappelletti94","description":"Project for the bioinformatics course of professor Valentini, Unimi.","archived":false,"fork":false,"pushed_at":"2019-02-01T06:40:22.000Z","size":91458,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-08T13:14:17.838Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LucaCappelletti94.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-26T05:30:53.000Z","updated_at":"2022-06-07T08:57:33.000Z","dependencies_parsed_at":"2023-03-25T10:00:57.282Z","dependency_job_id":null,"html_url":"https://github.com/LucaCappelletti94/snv_classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fsnv_classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fsnv_classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fsnv_classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fsnv_classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LucaCappelletti94","download_url":"https://codeload.github.com/LucaCappelletti94/snv_classifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246905830,"owners_count":20852818,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T13:21:33.941Z","updated_at":"2025-04-02T22:43:45.596Z","avatar_url":"https://github.com/LucaCappelletti94.png","language":"Jupyter Notebook","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg width=\"150\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/logo.png\"/\u003e\n\u003c/p\u003e\n\n# SNV Classifier\nProject for the bioinformatics course of professor Valentini, Unimi.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/results.png\"/\u003e\n\u003c/p\u003e\n\n## Documentation\nThe documentation of the project is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/main.pdf) and shows an analysis and visualizations of the datasets, modelling of the network and results.\n\n## Doubts over obtained results\nSome doubts have been raised by the extremely quick \"overfitting\" on the test set when using simple networks, such as a 2 layer with 3 neurons each. This is probably motivated (experimental proof in the PCA notebook) by the distribution of the test set that does not reflect the distribution of the train set, but is actually extremely more easily separable.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/doubts.png\"/\u003e\n\u003c/p\u003e\n\n## Batch of neural networks\nTo verify if 36/40 is the maximum of precision that a common neural network an reach over the given dataset I have trained 136 networks with a gradient of architectures for 100 generations each.\n\nAll the trained models and weights are available [here](https://github.com/LucaCappelletti94/snv_classifier/tree/master/meta_networks).\n\n### Errors and issues with this approach\n- Deeper networks need more epochs to converge.\n- I **forgot** to reset the random seed for each network, so the networs start from different random weights. I will retrain the networks resetting the seeds as soon as I get the time.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg alt=\"Meta Training Confusion\" width=\"120\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/meta%20training/Meta%20Testing%20Confusion.png\"/\u003e\n  \u003cimg alt=\"Meta Testing Confusion\" width=\"120\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/meta%20training/Meta%20Training%20Confusion.png\"/\u003e\n  \u003cimg alt=\"Meta Training Auroc\" width=\"120\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/meta%20training/Meta%20Testing%20Auroc.png\"/\u003e\n  \u003cimg alt=\"Meta Testing Auroc\" width=\"120\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/meta%20training/Meta%20Training%20Auroc.png\"/\u003e\n  \u003cimg alt=\"Meta Training AuPRC\" width=\"120\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/meta%20training/Meta%20Testing%20Auprc.png\"/\u003e\n  \u003cimg alt=\"Meta Testing AuPRC\" width=\"120\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/meta%20training/Meta%20Training%20Auprc.png\"/\u003e\n\u003c/p\u003e\n\n### Results\nThe approach suggests that 36/40 is the maximal precision.\n\n## Jupyter Notebooks\nVarious jupyter notebooks with explanations are available:\n\n### Keras neural network\nA jupyter notebook implementing the [project neural network](https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/network.png) in keras is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Keras.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/network.png?raw=true\"/\u003e\n\n#### Network trained model usage example\nA jupyter notebook implementing a usage example of the trained model is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Loading%20saved%20model.ipynb) or just below here:\n\n```python\n#!/usr/bin/python\n# -*- coding: utf-8 -*-\nimport numpy as np\nfrom keras.models import load_model\n\ndef number_to_class(value):\n    \"\"\"Map class identifier to class name.\"\"\"\n\n    if value:\n        return 'Positive'\n    return 'Negative'\n\nEXAMPLE_DATASET = 'Mendelian.normalized.example.test.tsv'\n\nmodel = load_model('model.h5')\nmodel.load_weights('weights.h5')\ndata_points = np.loadtxt(EXAMPLE_DATASET, delimiter='\\t')\n\nfor prediction in model.predict_classes(data_points):\n    print 'I believe %s to be %s' % (number_to_class(1),\n            number_to_class(prediction))\n\n\"\"\"\n  I believe Positive to be Positive\n  I believe Positive to be Positive\n  I believe Positive to be Positive\n  I believe Positive to be Positive\n  I believe Positive to be Positive\n  I believe Positive to be Positive\n  I believe Positive to be Negative\n  I believe Positive to be Positive\n  I believe Positive to be Positive\n  I believe Positive to be Negative\n\"\"\"\n```\n\n### Scatter plot\nA jupyter notebook generating a [scatter plot](https://github.com/LucaCappelletti94/snv_classifier/blob/master/scatter_plot.png?raw=true) from the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Scatter%20plot.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/scatter_plot.png?raw=true\"/\u003e\n\n\n### Correlation matrices\nA jupyter notebook generating a [correlation matrix](https://github.com/LucaCappelletti94/snv_classifier/blob/master/correlation_matrix.png?raw=true) from the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Correlation.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/correlation_matrix.png?raw=true\"/\u003e\n\n\n### PCA\nA jupyter notebook generating [PCA 2D visualization](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/pca) of the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20PCA.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/pca/training.png?raw=true\"/\u003e\n\n\n### TSNE\nA jupyter notebook generating [TSNE 2D visualization](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/tsne) of the dataset is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20TSNE.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/tsne/testing.png?raw=true\"/\u003e\n\n\n### Dataset plots\nA jupyter notebook generating [dataset plots](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/plot) is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Metrics%20plots.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/plot/CpGobsExp.png?raw=true\"/\u003e\n\n\n### Dataset distributions\nA jupyter notebook generating [dataset distributions](https://github.com/LucaCappelletti94/snv_classifier/tree/master/documentation/Latex/Documentation/images/distributions) is available [here](https://github.com/LucaCappelletti94/snv_classifier/blob/master/Bioinformatica%20-%20Metric%20distributions.ipynb).\n\n\u003cimg width=\"300\" src=\"https://github.com/LucaCappelletti94/snv_classifier/blob/master/documentation/Latex/Documentation/images/distributions/CpGperCpG.png?raw=true\"/\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucacappelletti94%2Fsnv_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucacappelletti94%2Fsnv_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucacappelletti94%2Fsnv_classifier/lists"}