https://github.com/j-i-l/tfb-prediction

Transcription factor binding prediction
https://github.com/j-i-l/tfb-prediction

bioinformatics machine-learning pandas python scikit-learn

Last synced: 3 months ago
JSON representation

Transcription factor binding prediction

Host: GitHub
URL: https://github.com/j-i-l/tfb-prediction
Owner: j-i-l
Created: 2022-06-07T15:01:41.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-06-15T10:35:41.000Z (about 3 years ago)
Last Synced: 2025-02-13T02:45:03.724Z (5 months ago)
Topics: bioinformatics, machine-learning, pandas, python, scikit-learn
Language: Jupyter Notebook
Homepage:
Size: 145 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Transcription factor binding prediction

## Preparation

### Data location

The notebooks assume the presence of two files:

- `./data/peak_data.txt`
- `./data/shuffled_data.txt`

Further the folder `./data` should contain the subfolders

- `./data/interim/`
- `./data/engineered/`

### Installation

The code is compatible with python >=3.10, additional dependencies
are listed in `requirements.txt`, to install them run:

pip install -r requirements.txt

in the root folder of this project.

## Workflow

sklearn's [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) are a great tool to condense an analysis and I would use them in most cases.
However, I find them not to be ideal for demonstrating purposes, so in this project they will only come into play in the hyper-parameter tuning part in [Hyperparam_Tuning.ipynb](Hyperparam_Tuning.ipynb).

The notebooks should be run in the following order:

- [Intro.ipynb](./Intro.ipynb): This is **optional** as it only provides some info about the problem at hand
- [Processing_and_Cleaning.ipynb](Processing_and_Cleaning.ipynb): Creates pandas DataFrame's from the raw data (see [Data location](#data-location))
- [Feature_Engineering.ipynb](Feature_Engineering.ipynb): Performs some feature engineering steps to convert the DNA sequences into usable feature vectors
- [Model_Selection.ipynb](Model_Selection.ipynb): Performs a basic screening over some potential classifiers
- [Hyperparam_Tuning.ipynb](Hyperparam_Tuning.ipynb): Performs hyper-parameter tuning, including some feature-engineering parameter, by cross-validating ml pipelines

**Note:** The [Hyperparam_Tuning.ipynb](Hyperparam_Tuning.ipynb) includes feature engineering into a pipeline so it can work directly with the cleaned data generated in [Processing_and_Cleaning.ipynb](Processing_and_Cleaning.ipynb).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/j-i-l/tfb-prediction

Awesome Lists containing this project

README