{"id":32563033,"url":"https://github.com/smithlabcode/smurfseq_scripts","last_synced_at":"2025-10-29T02:56:28.707Z","repository":{"id":63286250,"uuid":"178742684","full_name":"smithlabcode/smurfseq_scripts","owner":"smithlabcode","description":"Scripts and some documentation for how we did CNV analysis using SMURF-seq reads and how we generated simulated data to evaluate mapping performance.","archived":false,"fork":false,"pushed_at":"2019-06-17T22:22:26.000Z","size":5081,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-01T00:29:55.570Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smithlabcode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-03-31T21:12:04.000Z","updated_at":"2021-02-24T14:17:26.000Z","dependencies_parsed_at":"2022-11-16T09:47:24.569Z","dependency_job_id":null,"html_url":"https://github.com/smithlabcode/smurfseq_scripts","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/smithlabcode/smurfseq_scripts","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smithlabcode%2Fsmurfseq_scripts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smithlabcode%2Fsmurfseq_scripts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smithlabcode%2Fsmurfseq_scripts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smithlabcode%2Fsmurfseq_scripts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smithlabcode","download_url":"https://codeload.github.com/smithlabcode/smurfseq_scripts/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smithlabcode%2Fsmurfseq_scripts/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281549786,"owners_count":26520515,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-29T02:00:06.901Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-29T02:56:11.966Z","updated_at":"2025-10-29T02:56:28.699Z","avatar_url":"https://github.com/smithlabcode.png","language":"Python","readme":"# SMURF-seq scripts\n\nSMURF-seq is a protocol for sequencing short reads on a long-read\nsequencer by randomly concatenating short fragments. This repo\ncontains the scripts we used to conduct initial analysis of SMURF-seq\ndata in the context of copy-number profiling (with sequencing done on\nthe Oxford MinION instrument), and to benchmark mapping methods for\nperformance in mapping SMURF-seq reads.\n\n## Copy-number analysis\n\nThe copy-number analysis we performed using SMURF-seq reads closely\nfollows procedures already used in other publications. We first map\nthe SMURF-seq reads using BWA:\n```\nbwa mem -x ont2d -k 12 -W 12 \\\n    -A 4 -B 10 -O 6 -E 3 -T 120 bwa-mem/index/hg19.fa \\\n    smurf_reads.fa \u003e mapped_smurf_reads.sam\n```\nThe parameters for the Smith-Waterman scoring ('A', 'B', 'O' and 'E')\nwere determined using the simulation approach outined below (see also\nmanuscript and supp info). The 'T' flag gives the minimum alignment\nscore to output. The 'k' gives the size of k-mers to use for\nseeds. The 'W' indicates to discard a chain if seed bases are shorter\nthan this value. The 'k' and 'W' are set to be liberal to catch and\nevaluate as many candidate mappings as possible.\n\nThe mapped fragments are given to a script that filters ambiguously\nmapped fragments:\n```\n./filterAlnScoreAndQual.py -i mapped_smurf_reads.sam \\\n    -o unambig_smurf_frags.sam -s 120 -q 1\n```\n\nThe input file `mapped_smurf_reads.sam` is just the mapped reads\n(e.g. with BWA). The output file `unambig_smurf_frags.sam`\ncontains mapped fragments with mapping quality greater than or equal\nto 1.\n\nThen the remaining fragments are given to a script that obtains\nthe counts of reads in bins:\n```\n./getBinCounts.py -i unambig_smurf_frags.sam -c hg19.chrom.sizes \\\n    -b bins_5k_hg19.bed -o bin_counts.bed -s bin_stats.txt\n```\nThe input file `unambig_smurf_frags.sam` is the same as described above. \nThe file `hg19.chrom.size` is the size of all chroms\nin the reference genome. This file for the hg19 reference is supplied\nin the `data` directory in this repo, and was obtained from the UCSC\nGenome Browser's database. The pre-defined bins file `bins_5k_hg19.bed`\nis also in the `data` directory of this repo, and defines the 5000\nbins in the genome used for the CNV analysis. The first output file\n`bin_counts.bed` is like bedgraph format: it has the chrom, start and\nend given in `bins_5k_hg1.bed` but it also has two extra columns:\none is the count of reads in that bin, and the final column is those\ncounts divided by the average reads per bin. This information was\ndetermined based on what is required in the next script.\n\nIn the next step we use an adaptation of a script originally due to\nTimour et al. (Nat. Protocols, 2014). The script is run\nas follows:\n```\n./cnvAnalysis.R bin_counts.bed SampleName bins_5k_hg19_gc.txt bins_5k_hg19_exclude.txt\n```\nThe input file `bin_counts.bed` is the same as described above. The\ninput file `bins_5k_hg19_gc.txt` is the GC content of each bin. The\ninput `bins_5k_hg19_exclude.txt` is used to exclude certain parts of\nthe genome that attract an unusual amount of reads.  The format is\nsimply the line numbers, in the corresponding bed file, of the bins to\nexclude from the CNV analysis. The first output is a PDF file\n`SampleName.pdf` for the CNV profile. In addition, two tables are\nsaved: one table `SampleName.data.txt` with the information\n(chromosome, genome position, GC content, bin count, segmented value)\nfor each bin, and the other table `SampleName.short.txt` summerizing\nthe breakpoints in the CNV profile.\n\n## Simulating SMURF-seq reads for evaluating mappers\n\nThe simulation procedure takes mapped long reads and uses them to\ngenerate short fragments with a known mapping location. Since the\nmapping location is known, we can assess a mapping strategy by how\nwell it recovers the known mapping locations. The steps are as follows.\n\n1. Select and map long reads: The data set should be from WGS using\n   the sequencing technology of interest (in our case, the Oxford\n   MinION). Use a standard long-read mapper, and make sure the output\n   is in SAM format.\n\n2. Generate candidate short fragments: From the SAM format of the\n   long-read mapping locations, generate candidate short fragments\n   with known mapping locations. This is done with a script in the\n   `scripts` directory as follows:\n   ```\n   $ ./getFragsFromLongReads.py -s long_reads_mapped.sam -l 100 \u003e candidates.bed\n   ```\n   The candidates are encoded in 6-col BED format with the name column\n   (the 4th) encoding the DNA sequence of the fragment. The parameter\n   `-l` indicates the length of the candidate fragments to generate.\n   Only uniquely mapping long reads are used for generating candidate\n   fragments. The coordinates for each line in the `candidates.bed` file\n   are the reference genome mapping coordinates for the short fragment\n   obtained by arithmetic on the long-read reference location accounting\n   for indels in the mapping.\n\n3. Filter the candidates to exclude deadzones: Any short fragments that\n   would not map uniquely without the rest of the long read are excluded\n   by using bedtools and a file of deadzones. We obtain the deadzones using\n   the program `deadzones` available from `http://github.com/smithlabcode/utils`\n   and it will generate deadzones for a given read length. We used 40 bp, which\n   is conservative if the candidate fragments are longer than 40 bp. The\n   deadzone k-mer should not be larger than the candidate fragment size. The\n   program is run as follows:\n   ```\n   $ ./deadzones -k 40 -o dz_hg19_40.bed hg19.fa\n   ```\n   This will use lots of memory and might be slow, but it has parameters\n   to make it use less memory and it only needs to be done once.\n   To filter the candidate fragments, use `bedtools` as follows:\n   ```\n   bedtools intersect -v -a candidates.bed -b dz_hg19_40.bed \u003e good_frags.bed\n   ```\n   The `good_candidates.bed` will be used to generate the simulated\n   SMURF-seq reads.\n\n4. Randomly combine fragments into long reads: This step uses another script\n   from the `scripts` directory:\n   ```\n   ./readsFromFrags.py -n FAB42704 -f 10 -l 100 -r 10000 \\\n       -b good_frags.bed -o simulated.fa\n   ```\n   Above, the `-n` parameter indicates an identifying for the original\n   data set, which goes into the read name so we can map multiple\n   simulations together but later take them apart and analyze the\n   results separately. The `-f`, `-l` and `-r` give the number of\n   fragments, their length and the number of reads to simulate (the\n   fragment length could be inferred from the input). In the output,\n   the name of each simulated SMURF-seq read in `simulated.fa`\n   contains a list of the original mapping locations of each fragment\n   in the reference genome, so later we can use these to evaluate mapping\n   performance.\n\n5. Map the simulated reads: Select a mapping tools, set the parameters, and\n   map the simulated reads in `simulated.fa` from the above step to the\n   same reference genome used already. The output should be in SAM format,\n   and we will assume it is named `simulated_out.sam`.\n\n6. Evaluate the mapping performance: here we use a script from the `scripts`\n   directory:\n   ```\n   ./evalSmurfSim.py -f simulated.fa -s simulated_out.sam -l 100\n   ```\n   This script computes precision and recall at the level of recovering\n   fragments and at the level of correctly determining mapping location\n   for each nucleotide in the simulated reads. In addition, the portion of\n   each simulated read that is mapped in at least one identified fragment\n   is reported, along with summary stats on the sizes of fragments identified\n   by the mapper.\n\n## Dependencies\n\n* [Python:] All Python scripts here are in Python3 (we used 3.6.8). The\n    following non-standard libraries are used: pysam (we used 0.15.0) and\n    numpy (we used 1.15.0).\n\n* [R:] The R script `cnvAnalysis.R` uses the [DNAcopy](https://bioconductor.org/packages/release/bioc/html/DNAcopy.html) library.\n    \u003e Seshan VE, Olshen A (2018). DNAcopy: DNA copy number data analysis. R package version 1.56.0.\n\n* [Software tools:] For the simulations/valuations we require\n    `bedtools` (we used v2.26.0). We also require the `deadzones`\n    program from `http://github.com/smithlabcode/utils` but this could\n    be substituted for any means of excluding unmappable regions.\n    In our CNV analysis, we used `bwa` (0.7.17).\n\n## Contacts and Bug Reports\n\n- Andrew D. Smith andrewds@usc.edu\n- Rishvanth K. Prabakar kaliappa@usc.edu\n\n## Copyright and License Information\nCopyright (C) 2019 The Authors\n\nAuthors: Rishvanth K. Prabakar and Andrew D. Smith\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program. If not, see \u003chttp://www.gnu.org/licenses/\u003e.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmithlabcode%2Fsmurfseq_scripts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmithlabcode%2Fsmurfseq_scripts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmithlabcode%2Fsmurfseq_scripts/lists"}