https://github.com/mbelmadani/motifgp
Motif discovery for DNA sequences using multiobjective optimization and genetic programming.
https://github.com/mbelmadani/motifgp
bioinformatics chip-seq deap dna dna-sequences genetic-programming jaspar motif motif-discovery multiobjective-optimization network-expressions nsga-ii pareto-front python regular-expressions sequences strongly-typed transcription-factor-binding transcription-factors
Last synced: 11 days ago
JSON representation
Motif discovery for DNA sequences using multiobjective optimization and genetic programming.
- Host: GitHub
- URL: https://github.com/mbelmadani/motifgp
- Owner: mbelmadani
- License: lgpl-3.0
- Created: 2016-08-09T07:10:20.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2018-07-24T21:20:28.000Z (almost 7 years ago)
- Last Synced: 2025-05-07T09:14:57.392Z (11 days ago)
- Topics: bioinformatics, chip-seq, deap, dna, dna-sequences, genetic-programming, jaspar, motif, motif-discovery, multiobjective-optimization, network-expressions, nsga-ii, pareto-front, python, regular-expressions, sequences, strongly-typed, transcription-factor-binding, transcription-factors
- Language: Python
- Homepage: https://mbelmadani.github.io/motifgp/
- Size: 204 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.txt
- Changelog: CHANGELOG.txt
- License: LICENSE.txt
Awesome Lists containing this project
README
===============
= MotifGP 0.2 =
===============
MotifGP is a de novo motif discovery tool for discriminatory network expression identification in ChIP-seq datasets.Original author: Manuel Belmadani
[email protected]The project is documented by the following publications.
Manuel Belmadani and Marcel Turcotte. MotifGP: Using multi-objective evolutionary computing for mining network expressions
in DNA sequences. In IEEE International Conference on Computational Intelligence in Bioinformatics and Computational Biology
(CIBCB 2016), Chiang Mai, Thailand, October, 5-7, 2016.
https://doi.org/10.1109/CIBCB.2016.7758133Manuel Belmadani. MotifGP: DNA motif discovery using multiobjective evolution. Master of computer science,
University of Ottawa, School of Electrical Engineering and Computer Science, 2016.
Available from University of Ottawa Research under: http://www.ruor.uottawa.ca/handle/10393/34213Acknowledgements:
MotifGP is using source code from these tools:
-hypergeometric.py from the MEME Suite (License and copyright in source file).
-altschulEriksonDinuclShuffle.py from Peter Clote - CLOTE Computational Biology LAB, http://clavius.bc.edu/~clotelab/RNAdinucleotideShuffle/
This software was also made using the DEAP - Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G.,
Parizeau, M. & Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 13,
2171–2175 (2012).=======================================================================================
License: (see LICENSE.txt)
=======================================================================================
Installation: (see INSTALL.txt)
=======================================================================================
Examples: (see EXAMPLES.txt)
=======================================================================================
Usage: motifgp.py [options]Options:
-h, --help show this help message and exit
-p TRAINING_PATH, --training=TRAINING_PATH
Fasta file to use for training (input) sequence data
-b BACKGROUND_PATH, --background=BACKGROUND_PATH
[Optional] Fasta file to use for background (control)
sequence data. If not provided, a the generated
control sequences will be written to runtime_tmp/
-m MOO, --moo=MOO Multi-objective optimization [SPEA2, NSGA2, NSGAR,
MOEAD]. NSGAR is the NSGA-II_R (NSGA-II Revised)
algorithm improvement of NSGA2.
-f FITNESS, --fitness=FITNESS
Objective fitness function. Available objectives: D=Di
scrimination,F=Fisher,I=ScipyFisher,O=OddsRatio,Q=Fals
eDiscoveryRate,S=Support,R=ScipyOddsRatio. Each single
character in the string represents an objective.
Objectives are mapped by the configuration file at
config/objectives. Default is 'DF' for
[Discrimination,Fisher] (2-objectives).
--cxpb=CXPB Probability [0.0 to 1.0] for a crossover during
variation. Requires --mutpb to be set to (1.0-cxpb).
Default is 0.7.
--mutpb=MUTPB Probability [0.0 to 1.0] for a mutation during
variation. Requires --cxpb to be set to (1.0-mutpb).
Default is 0.3.
--short=SHORT Stops reading in after input sequences.
--popsize=POPSIZE Size of the population.
--revcomp Compile regex with reverse complement
--random-seed=RANDOM_SEED
Random seed value to set for execution
-n NGEN, --num-gen=NGEN
Generation where runtime stops (even in the case of
resumed checkpoints)
--timelimit=TIMELIMIT
Time limit on the GP loop execution.
--matcher=MATCHER Use a different matcher. Options: 'grep', 'python'.
'grep' is faster on large datasets, while 'python' is
a pure python version in case the system doesn't
support grep.
-o OUTPUT_PATH, --output=OUTPUT_PATH
Output directory. Default is ./OUT/
-t TAG, --tag=TAG A tag for the output subdirectory. Use to describes
the run and saves it in the tag's subdirectory in the
output directory. default is 'default'.
-i, --inspector Don't print any files. Can be useful with python -i
(interactive mode).
--hardmask Replace tandem repeats (lower-case typed nucleotides)
by N
-g GRAMMAR, --grammar=GRAMMAR
Grammar for the STGP [min, iupac, full, ne]. Default
is iupac. 'min' only uses nucleotides. 'iupac' is a
network expression grammar. 'full' is a network
expression grammar with additional regular expression
tokens. 'ne' is like iupac, but built with string
primitives instead of booleans.
-e ERASE, --erase=ERASE
Input .nef(t) file to delete from the dataset prior to
execution. Used for sequential coverage.
--backpad Pads background sequences with consecutive nucleotides
(ie. AAAAAAAA,CCCCCCCC,GGGGGGGG,TTTTTTTT) of length 8
every set of 4 sequences.
--bg-algo=BG_ALGO Shuffling algorithm for background. Default is
'dinuclShuffle', if no background dataset it provided.
Currently, dinuclShuffle is the only implemented
method.
--ncpu=NCPU Number of CPUs to use when mapping evaluation of
solutions. Use an integer, "auto" to automatically
dertmine the maximum number. Default is no
parallelism.
--termination=TERMINATION
Use automatic termination algorithm. User 'auto' to
used the automatic termination algorithm for MOEAs.
--hamming [Experimental] Generates statistics on the hamming
distance from a template regex and hof candidates.
--seeded-population [Experimental] Use population seeds
-c CHECKPOINT_PATH, --checkpoint=CHECKPOINT_PATH
[Temporarily disabled] Load a checkpoint at path.
-q, --quiet [Unimplemented] don't print status messages to stdoutAlso consider looking at EXAMPLES.txt for basic examples of MotifGP usage.