{"id":13710193,"url":"https://github.com/bioinfomaticsCSU/deepsignal","last_synced_at":"2025-05-06T18:34:29.622Z","repository":{"id":49522726,"uuid":"161388796","full_name":"bioinfomaticsCSU/deepsignal","owner":"bioinfomaticsCSU","description":"Detecting methylation using signal-level features from Nanopore sequencing reads","archived":false,"fork":false,"pushed_at":"2023-06-04T08:27:40.000Z","size":311,"stargazers_count":111,"open_issues_count":13,"forks_count":21,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-15T11:52:33.789Z","etag":null,"topics":["bioinformatics","epigenetics","methylation","nanopore-sequencing","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bioinfomaticsCSU.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-12-11T20:19:10.000Z","updated_at":"2024-10-14T09:34:48.000Z","dependencies_parsed_at":"2023-10-20T18:26:20.903Z","dependency_job_id":null,"html_url":"https://github.com/bioinfomaticsCSU/deepsignal","commit_stats":{"total_commits":187,"total_committers":5,"mean_commits":37.4,"dds":"0.44385026737967914","last_synced_commit":"3f138fac0d151cbd3727e5ffa82510f0bbd2e234"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioinfomaticsCSU%2Fdeepsignal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioinfomaticsCSU%2Fdeepsignal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioinfomaticsCSU%2Fdeepsignal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bioinfomaticsCSU%2Fdeepsignal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bioinfomaticsCSU","download_url":"https://codeload.github.com/bioinfomaticsCSU/deepsignal/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224521603,"owners_count":17325264,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","epigenetics","methylation","nanopore-sequencing","tensorflow"],"created_at":"2024-08-02T23:00:52.919Z","updated_at":"2024-11-13T20:31:24.636Z","avatar_url":"https://github.com/bioinfomaticsCSU.png","language":"Python","funding_links":[],"categories":["Software packages"],"sub_categories":["DNA modification analysis"],"readme":"# News\r\n- 2021.03.15: We developed [deepsignal2](https://github.com/PengNi/deepsignal2). Compared to deepsignal, deepsignal2 has much smaller DNN model in size, and slightly better performance in 5mCpG detection of human.\r\n\r\n\r\n# DeepSignal\r\n\r\n[![Python](https://img.shields.io/pypi/pyversions/deepsignal)](https://www.python.org/)\r\n[![PyPI version](https://img.shields.io/pypi/v/deepsignal)](https://pypi.org/project/deepsignal/)\r\n[![GitHub License](https://img.shields.io/github/license/bioinfomaticsCSU/deepsignal)](https://github.com/bioinfomaticsCSU/deepsignal/blob/master/LICENSE)\r\n[![PyPI-Downloads](https://pepy.tech/badge/deepsignal)](https://pepy.tech/project/deepsignal)\r\n[![PyPI-Downloads/m](https://pepy.tech/badge/deepsignal/month)](https://pepy.tech/project/deepsignal/month)\r\n\r\n## A deep-learning method for detecting DNA methylation state from Oxford Nanopore sequencing reads.\r\nDeepSignal constructs a BiLSTM+Inception structure to detect DNA methylation state from Nanopore reads. It is\r\nbuilt with **Tensorflow** and **Python 3**.\r\n\r\n## Contents\r\n- [Installation](#Installation)\r\n- [Trained models](#Trained-models)\r\n- [Example data](#Example-data)\r\n- [Quick start](#Quick-start)\r\n- [Usage](#Usage)\r\n\r\n## Installation\r\ndeepsignal is built on Python3. [tombo](https://github.com/nanoporetech/tombo) is required to re-squiggle the raw signals from nanopore reads before running deepsignal.\r\n   - Prerequisites:\\\r\n       [Python 3.*](https://www.python.org/)\\\r\n       [tensorflow](https://www.tensorflow.org/) (1.8.0\u003c=tensorflow\u003c=1.13.1)\\\r\n       [tombo](https://github.com/nanoporetech/tombo)\r\n   - Dependencies:\\\r\n       [numpy](http://www.numpy.org/)\\\r\n       [h5py](https://github.com/h5py/h5py)\\\r\n       [statsmodels](https://github.com/statsmodels/statsmodels/)\\\r\n       [scikit-learn](https://scikit-learn.org/stable/)\r\n\r\n#### 1. Create an environment\r\nWe highly recommend using a virtual environment for the installation of deepsignal and its dependencies. A virtual environment can be created and (de)activated as follows by using [conda](https://conda.io/docs/):\r\n```bash\r\n# create\r\nconda create -n deepsignalenv python=3.7\r\n# activate\r\nconda activate deepsignalenv\r\n# deactivate\r\nconda deactivate\r\n```\r\nThe virtual environment can also be created by using [virtualenv](https://github.com/pypa/virtualenv/).\r\n\r\n#### 2. Install deepsignal\r\n- After creating and activating the environment, download and install deepsignal (**latest version**) from github:\r\n```bash\r\ngit clone https://github.com/bioinfomaticsCSU/deepsignal.git\r\ncd deepsignal\r\npython setup.py install\r\n```\r\n**or** install deepsignal using *pip*:\r\n```bash\r\npip install deepsignal\r\n```\r\n\r\n- [tombo](https://github.com/nanoporetech/tombo) is required to be installed in the same environment:\r\n```bash\r\n# install using conda\r\nconda install -c bioconda ont-tombo\r\n# or install using pip\r\npip install ont-tombo\r\n``` \r\n\r\n- install [tensorflow](https://www.tensorflow.org/)  (version: 1.8.0\u003c=tensorflow\u003c=1.13.1) in the same environment:\r\n\r\n```bash\r\n# install using conda\r\nconda install -c anaconda tensorflow==1.13.1\r\n# or install using pip\r\npip install 'tensorflow==1.13.1'\r\n```\r\nIf a GPU-machine is used, install the gpu version of tensorflow. The cpu version is not required:\r\n```bash\r\n# install using conda\r\nconda install -c anaconda tensorflow-gpu==1.13.1\r\n# or install using pip\r\npip install 'tensorflow-gpu==1.13.1'\r\n```\r\n\r\n## Trained models\r\nThe models we trained can be downloaded from [google drive](https://drive.google.com/open?id=1zkK8Q1gyfviWWnXUBMcIwEDw3SocJg7P).\r\n\r\nCurrently we have trained the following models:\r\n   * _model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz_: A CpG model trained using HX1 R9.4 1D reads (for **deepsignal\u003e=0.1.7**).\r\n   * ~~_model.CpG.R9.4_1D.human_hx1.bn17.sn360.tar.gz_: A CpG model trained using HX1 R9.4 1D reads (for **deepsignal\u003c=0.1.6**).~~\r\n   * ~~_model.GATC.R9_2D.tem.puc19.bn17.sn360.tar.gz_: A G*A*TC model trained using pUC19 R9 2D template reads (for **deepsignal\u003c=0.1.6**).~~\r\n\r\n## Example data\r\nThe example data can be downloaded from [google drive](https://drive.google.com/open?id=1zkK8Q1gyfviWWnXUBMcIwEDw3SocJg7P).\r\n\r\n   * ~~_fast5s.sample.tar.gz_: The data contain ~4000 yeast R9.4 1D reads each with called events (basecalled by Albacore), along with a genome reference.~~\r\n\r\n## Quick start\r\nTo call modifications, the raw fast5 files should be basecalled ([Guppy or Albacore](https://nanoporetech.com/community)) and then be re-squiggled by [tombo](https://github.com/nanoporetech/tombo). At last, modifications of specified motifs can be called by deepsignal. The following are commands to call 5mC in CG contexts from the example data:\r\n```bash\r\n# 1. guppy basecall\r\nguppy_basecaller -i fast5s.al -r -s fast5s.al.guppy --config dna_r9.4.1_450bps_hac_prom.cfg\r\ncat fast5s.al.guppy/*.fastq \u003e fast5s.al.guppy.fastq\r\n# 2. tombo resquiggle\r\ntombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s.al --fastq-filenames fast5s.al.guppy.fastq --sequencing-summary-filenames fast5s.al.guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10\r\ntombo resquiggle fast5s.al GCF_000146045.2_R64_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_000 --overwrite\r\n# 3. deepsignal call_mods\r\ndeepsignal call_mods --input_path fast5s.al/ --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --corrected_group RawGenomeCorrected_001 --nproc 10 --is_gpu no\r\npython /path/to/deepsignal/scripts/call_modification_frequency.py --input_path fast5s.al.CpG.call_mods.tsv --result_file fast5s.al.CpG.call_mods.frequency.tsv\r\n```\r\n\r\n## Usage\r\n#### 1. Basecall and re-squiggle\r\nBefore run deepsignal, the raw reads should be basecalled ([Guppy or Albacore](https://nanoporetech.com/community)) and then be processed by the *re-squiggle* module of [tombo](https://github.com/nanoporetech/tombo).\r\n\r\nNote:\r\n- If the fast5 files are in multi-read FAST5 format, please use _multi_to_single_fast5_ command from the [ont_fast5_api package](https://github.com/nanoporetech/ont_fast5_api) to convert the fast5 files first (Ref to [issue #173](https://github.com/nanoporetech/tombo/issues/173) in [tombo](https://github.com/nanoporetech/tombo)).\r\n```bash\r\nmulti_to_single_fast5 -i $multi_read_fast5_dir -s $single_read_fast5_dir -t 30 --recursive\r\n```\r\n- If the basecall results are saved as fastq, run the [*tombo proprecess annotate_raw_with_fastqs*](https://nanoporetech.github.io/tombo/resquiggle.html) command before *re-squiggle*.\r\n\r\nFor the example data:\r\n```bash\r\n# 1. basecall\r\nguppy_basecaller -i fast5s.al -r -s fast5s.al.guppy --config dna_r9.4.1_450bps_hac_prom.cfg\r\n# 2. proprecess fast5 if basecall results are saved in fastq format\r\ncat fast5s.al.guppy/*.fastq \u003e fast5s.al.guppy.fastq\r\ntombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s.al --fastq-filenames fast5s.al.guppy.fastq --sequencing-summary-filenames fast5s.al.guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10\r\n# 3. resquiggle, cmd: tombo resquiggle $fast5_dir $reference_fa\r\ntombo resquiggle fast5s.al GCF_000146045.2_R64_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_000 --overwrite\r\n```\r\n\r\n#### 2. extract features\r\nFeatures of targeted sites can be extracted for training or testing.\r\n\r\nFor the example data (deepsignal extracts 17-mer-seq and 360-signal features of each CpG motif in reads by default. Note that the value of *--corrected_group* must be the same as that of *--corrected-group* in tombo.):\r\n```bash\r\ndeepsignal extract --fast5_dir fast5s.al/ --write_path fast5s.al.CpG.signal_features.17bases.rawsignals_360.tsv --corrected_group RawGenomeCorrected_001 --nproc 10\r\n```\r\n\r\nThe extracted_features file is a tab-delimited text file in the following format:\r\n   - **chrom**: the chromosome name\r\n   - **pos**:   0-based position of the targeted base in the chromosome\r\n   - **strand**:    +/-, the aligned strand of the read to the reference\r\n   - **pos_in_strand**: 0-based position of the targeted base in the aligned strand of the chromosome (_legacy column, not necessary for downstream analysis_)\r\n   - **readname**:  the read name\r\n   - **read_strand**:   t/c, template or complement\r\n   - **k_mer**: the sequence around the targeted base\r\n   - **signal_means**:  signal means of each base in the kmer\r\n   - **signal_stds**:   signal stds of each base in the kmer\r\n   - **signal_lens**:   lens of each base in the kmer\r\n   - **cent_signals**:  the central signals of the kmer\r\n   - **methy_label**:   0/1, the label of the targeted base, for training\r\n\r\n#### 3. call modifications\r\nThe extracted features can be used to call modifications as follows (If a GPU-machine is used, set *--is_gpu* to \"yes\".):\r\n```bash\r\n# the CpGs are called by using the CpG model of HX1 R9.4 1D\r\ndeepsignal call_mods --input_path fast5s.al.CpG.signal_features.17bases.rawsignals_360.tsv --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --nproc 10 --is_gpu no\r\n```\r\n\r\n**The modifications can also be called from the fast5 files directly**:\r\n```bash\r\n# use CPU\r\ndeepsignal call_mods --input_path fast5s.al/ --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --corrected_group RawGenomeCorrected_001 --nproc 10 --is_gpu no\r\n# or use GPU\r\nCUDA_VISIBLE_DEVICES=0 deepsignal call_mods --input_path fast5s.al/ --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --corrected_group RawGenomeCorrected_001 --nproc 10 --is_gpu yes\r\n```\r\n\r\nThe modification_call file is a tab-delimited text file in the following format:\r\n   - **chrom**: the chromosome name\r\n   - **pos**:   0-based position of the targeted base in the chromosome\r\n   - **strand**:    +/-, the aligned strand of the read to the reference\r\n   - **pos_in_strand**: 0-based position of the targeted base in the aligned strand of the chromosome (_legacy column, not necessary for downstream analysis_)\r\n   - **readname**:  the read name\r\n   - **read_strand**:   t/c, template or complement\r\n   - **prob_0**:    [0, 1], the probability of the targeted base predicted as 0 (unmethylated)\r\n   - **prob_1**:    [0, 1], the probability of the targeted base predicted as 1 (methylated)\r\n   - **called_label**:  0/1, unmethylated/methylated\r\n   - **k_mer**:   the kmer around the targeted base\r\n\r\nA modification-frequency file can be generated by the script [scripts/call_modification_frequency.py](https://github.com/bioinfomaticsCSU/deepsignal/blob/master/scripts/call_modification_frequency.py) with the modification_call file:\r\n```bash\r\npython /path/to/deepsignal/scripts/call_modification_frequency.py --input_path fast5s.al.CpG.call_mods.tsv --result_file fast5s.al.CpG.call_mods.frequency.tsv --prob_cf 0\r\n```\r\n\r\nThe modification_frequency file is a tab-delimited text file in the following format:\r\n   - **chrom**: the chromosome name\r\n   - **pos**:   0-based position of the targeted base in the chromosome\r\n   - **strand**:    +/-, the aligned strand of the read to the reference\r\n   - **pos_in_strand**: 0-based position of the targeted base in the aligned strand of the chromosome (_legacy column, not necessary for downstream analysis_)\r\n   - **prob_0_sum**:    sum of the probabilities of the targeted base predicted as 0 (unmethylated)\r\n   - **prob_1_sum**:    sum of the probabilities of the targeted base predicted as 1 (methylated)\r\n   - **count_modified**:    number of reads in which the targeted base counted as modified\r\n   - **count_unmodified**:  number of reads in which the targeted base counted as unmodified\r\n   - **coverage**:  number of reads aligned to the targeted base\r\n   - **modification_frequency**:    modification frequency\r\n   - **k_mer**:   the kmer around the targeted base\r\n\r\n#### 4. train new models\r\nA new model can be trained as follows:\r\n```bash\r\n# need two independent datasets for training and validating\r\n# use deepsignal train -h/--help for more details\r\ndeepsignal train --train_file /path/to/train_data/file --valid_file /path/to/valid_data/file --model_dir /dir/to/save/the/new/model\r\n```\r\n\r\nPublication\r\n===========\r\nPeng Ni, Neng Huang, Zhi Zhang, De-Peng Wang, Fan Liang, Yu Miao, Chuan-Le Xiao, Feng Luo, and Jianxin Wang, \"DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning.\", Bioinformatics 35, no. 22 (2019): 4586-4595. [doi:10.1093/bioinformatics/btz276](https://doi.org/10.1093/bioinformatics/btz276)\r\n\r\nLicense\r\n=========\r\nCopyright (C) 2018 [Jianxin Wang](mailto:jxwang@mail.csu.edu.cn), [Feng Luo](mailto:luofeng@clemson.edu), [Peng Ni](mailto:nipeng@csu.edu.cn), [Neng Huang](mailto:huangneng@csu.edu.cn)\r\n\r\nThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.\r\n\r\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.\r\n\r\nYou should have received a copy of the GNU General Public License along with this program. If not, see \u003chttps://www.gnu.org/licenses/\u003e.\r\n\r\n[Jianxin Wang](mailto:jxwang@mail.csu.edu.cn), [Peng Ni](mailto:nipeng@csu.edu.cn), [Neng Huang](mailto:huangneng@csu.edu.cn),\r\nSchool of Information Science and Engineering, Central South University, Changsha 410083, China\r\n\r\n[Feng Luo](mailto:luofeng@clemson.edu), School of Computing, Clemson University, Clemson, SC 29634, USA\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FbioinfomaticsCSU%2Fdeepsignal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FbioinfomaticsCSU%2Fdeepsignal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FbioinfomaticsCSU%2Fdeepsignal/lists"}