https://github.com/cmdoret/strangerseq
Generate DNA sequences minimizing microhomology
https://github.com/cmdoret/strangerseq
dna-sequences genomics k-mer
Last synced: 6 days ago
JSON representation
Generate DNA sequences minimizing microhomology
- Host: GitHub
- URL: https://github.com/cmdoret/strangerseq
- Owner: cmdoret
- License: gpl-3.0
- Created: 2019-07-11T12:28:25.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2021-03-05T13:46:09.000Z (about 4 years ago)
- Last Synced: 2025-02-17T23:47:36.022Z (3 months ago)
- Topics: dna-sequences, genomics, k-mer
- Language: Go
- Size: 48.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### strangerseq
[](https://circleci.com/gh/cmdoret/strangerseq/tree/master)
**cmdoret**strangerseq is a Go program that generates DNA sequences minimizing or maximizing microhomology to a reference genome.
This is achieved using an l-order Markov chain from the K-mer profile of the genome (l = k-1). Sequences are initiated
by picking a random k-mer, with their frequency used as probability weight. Extension is then performed iteratively using
the markov chain. When minimizing microhomology, inverse probabilities from the chain are used to pick rare k-mers instead
of frequent ones.It is possible to add a GC deviation constraint to the sequences to force GC content to be similar to the genome.
#### Usage
```
strangerseq -help
Program to generate sequences with minimal microhomology and optionally, similar GC content to the input genome.
Multiple sequences are generated and sorted by score.
-comp.seq
Enable to return scores in addition to sequences and include randomly generated GC-weighted sequences for comparison. Columns of the output are: 1. sequence type (generated through markov model or randomly picked with GC weight), 2. Score without accounting for GC divergence, 3. Score corrected for GC divergence, 4. Sequence.
-fasta string
Path to genome file in FASTA format. (required)
-fixed.gc float
Fixed target GC content to use as target. The default is to use the input genome's GC content.
-gc.weight float
Weight given to the GC content when scoring sequences. (default 1)
-kmer.size int
Length of K-mers on which to optimize sequences. (default 8)
-n.seq int
Number of sequences to generate. (default 100)
-seq.len int
Length of the sequences to generate. (default 1000)
-similar
Generate similar sequences (frequent k-mers) instead of different ones (rare k-mers).
-version
Shows version number of the binary.
```#### Example
Generate 50 GC-constrained sequences of 1000 base pairs minimizing microhomology:```bash
./strangerseq -fasta genome.fa -gc.weight 1 -n.seq 50
```