Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/j-i-l/tfb-prediction
Transcription factor binding prediction
https://github.com/j-i-l/tfb-prediction
bioinformatics machine-learning pandas python scikit-learn
Last synced: 15 days ago
JSON representation
Transcription factor binding prediction
- Host: GitHub
- URL: https://github.com/j-i-l/tfb-prediction
- Owner: j-i-l
- Created: 2022-06-07T15:01:41.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-06-15T10:35:41.000Z (over 2 years ago)
- Last Synced: 2024-05-02T02:13:31.241Z (8 months ago)
- Topics: bioinformatics, machine-learning, pandas, python, scikit-learn
- Language: Jupyter Notebook
- Homepage:
- Size: 145 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Transcription factor binding prediction
## Preparation
### Data location
The notebooks assume the presence of two files:
- `./data/peak_data.txt`
- `./data/shuffled_data.txt`Further the folder `./data` should contain the subfolders
- `./data/interim/`
- `./data/engineered/`### Installation
The code is compatible with python >=3.10, additional dependencies
are listed in `requirements.txt`, to install them run:pip install -r requirements.txt
in the root folder of this project.## Workflow
sklearn's [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) are a great tool to condense an analysis and I would use them in most cases.
However, I find them not to be ideal for demonstrating purposes, so in this project they will only come into play in the hyper-parameter tuning part in [Hyperparam_Tuning.ipynb](Hyperparam_Tuning.ipynb).The notebooks should be run in the following order:
- [Intro.ipynb](./Intro.ipynb): This is **optional** as it only provides some info about the problem at hand
- [Processing_and_Cleaning.ipynb](Processing_and_Cleaning.ipynb): Creates pandas DataFrame's from the raw data (see [Data location](#data-location))
- [Feature_Engineering.ipynb](Feature_Engineering.ipynb): Performs some feature engineering steps to convert the DNA sequences into usable feature vectors
- [Model_Selection.ipynb](Model_Selection.ipynb): Performs a basic screening over some potential classifiers
- [Hyperparam_Tuning.ipynb](Hyperparam_Tuning.ipynb): Performs hyper-parameter tuning, including some feature-engineering parameter, by cross-validating ml pipelines**Note:** The [Hyperparam_Tuning.ipynb](Hyperparam_Tuning.ipynb) includes feature engineering into a pipeline so it can work directly with the cleaned data generated in [Processing_and_Cleaning.ipynb](Processing_and_Cleaning.ipynb).