https://github.com/datngu/deepfarm
DeepFARM
https://github.com/datngu/deepfarm
Last synced: about 2 months ago
JSON representation
DeepFARM
- Host: GitHub
- URL: https://github.com/datngu/deepfarm
- Owner: datngu
- Created: 2024-12-26T21:36:54.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-02-15T23:10:21.000Z (3 months ago)
- Last Synced: 2025-02-16T00:18:20.027Z (3 months ago)
- Language: Python
- Size: 23.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DeepFARM Pipeline
This pipeline is part of the manuscript: "Sequence-based chromatin activity modeling and regulatory impact prediction of genetic variants in farmed animals using deep learning". The computational pipeline leverages the power of [Nextflow](https://www.nextflow.io/) for scalable and reproducible workflows.
## Repository
GitHub: [DeepFARM](https://github.com/datngu/DeepFARM)## Author
- **Dat T Nguyen**
- Contact: ndat<@>utexas.edu---
## Features
- Supports multiple learning rates and hyperparameter tuning.
- Built-in reproducibility and scalability via Nextflow and Singularity.## Quick Start
### Requirements
- **Modules**:
- `Nextflow/21.03`
- `singularity/rpm`
- Environment variables:
- Set `NXF_SINGULARITY_CACHEDIR` for Singularity.
- Provide a valid `TOWER_ACCESS_TOKEN` for Nextflow Tower integration.### Running the Pipeline
Please customize it based on your HPC system and don't forget to look at the file nextflow.config.
```bash
## load modules
module load Nextflow/21.03
module load singularity/rpm## export env variables
export NXF_SINGULARITY_CACHEDIR=/mnt/users/ngda/software/singularity
export TOWER_ACCESS_TOKEN=## run the pipeline
nextflow run main.nf -resume -w work_dir \
--genome /mnt/users/ngda/genomes/cattle/Bos_taurus.ARS-UCD1.2.dna_sm.toplevel.fa \
--chrom 29 \
--val_chrom 21 \
--test_chrom 25 \
--window 200 \
--peaks '/mnt/SCRATCH/ngda/data/Cattle/*.bed' \
-with-tower```
## Parameters and defaut setting
- `params.genome`: Path to the genome file (default: `$baseDir/data/ref/genome.fa`).
- `params.peaks`: Path to peak files (default: `$baseDir/data/peak/*`).
- `params.outdir`: Output directory (default: `results`).
- `params.trace_dir`: Directory for trace files (default: `trace_dir`).
- `params.chrom`: The largest chromosome index used to build dataset (default: `29`, this mean chromosome 1,2,3,...,28,29 are used)
- `params.val_chrom`: Chromosome for validation (default: `21`).
- `params.test_chrom`: Chromosome for testing (default: `25`).
- `params.window`: Window size for genomic analysis (default: `200`).
- `params.learning_rates`: Learning rates for model tuning (default: `[1e-3, 1e-4, 5e-4, 5e-5]`).## Output
- Results will be saved in the specified `--outdir` (default: `results`).
- Trace files for debugging and performance monitoring will be stored in `params.trace_dir` (default: `trace_dir`).## Pre-train models
Model weights of DanQ for cattle, pig, chicken, and salmon are available at:
https://github.com/datngu/data/releases/download/v.0.0.4/cattle_DanQ.h5
https://github.com/datngu/data/releases/download/v.0.0.4/chicken_DanQ.h5
https://github.com/datngu/data/releases/download/v.0.0.4/pig_DanQ.h5
https://github.com/datngu/data/releases/download/v.0.0.4/salmon_DanQ.h5
## Citation
TBA