Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kinggerm/getorganelledb
GetOrganelle Default Databases
https://github.com/kinggerm/getorganelledb
Last synced: 25 days ago
JSON representation
GetOrganelle Default Databases
- Host: GitHub
- URL: https://github.com/kinggerm/getorganelledb
- Owner: Kinggerm
- License: gpl-3.0
- Created: 2020-04-09T22:11:17.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-07-13T20:06:26.000Z (over 2 years ago)
- Last Synced: 2024-08-12T00:37:55.903Z (3 months ago)
- Language: Python
- Homepage:
- Size: 19.1 MB
- Stars: 4
- Watchers: 4
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GetOrganelle Database
All versions of default databases of [GetOrganelle](https://github.com/Kinggerm/GetOrganelle)
### Default directory
By default, the initialized database will be located at `~/.GetOrganelle`. It can be changed, by using the command line parameter `--config-dir` for a single run, or by using the shell environment variable `GETORG_PATH` for the entire running environment.
For example, one may change the default database directory into `/home/shared/.GetOrganelle` by adding
> GETORG_PATH=/home/shared/.GetOrganelle
> export GETORG_PATHto `/etc/profile` for **system-wide** usage,
or to `~/.bashrc` for **user-wide** usage in **Ubuntu Desktop**,
or to `~/.bash_profile` for **user-wide** usage in **bash**,
or to `~/.zshrc` for **user-wide** usage in **Zsh**and restarting the shell before initialization.
### Option 1 Initialization from Github
By default, `get_organelle_config.py` will automatically access this repository to download and build the SeedDatabase and LabelDatabase of the latest version. e.g.
get_organelle_config.py -a fungus_mt
Due to the unstable accessibility to Github in some regions, the `get_organelle_config.py` sometimes fails with connection error (e.g., timeout, sha256_unmatch). However, trying the above command more times will simply work in most cases.
### Option 2 Initialization from local files
If `Initialization from Github` still fails after many trials, download this repository and run `get_organelle_config.py` with the flag `--use-local`. Making your own database is feasible if you use the same directory structure but not recommended.
Supposing you want to install version `0.0.1` of `embplant_pt` and `embplant_mt`, you can choose any one of the following code blocks to install:
1. Use `curl` to download the released compressed file (ca. 20 MB -> 80 MB):
curl -L https://github.com/Kinggerm/GetOrganelleDB/releases/download/0.0.1/v0.0.1.tar.gz | tar zx
get_organelle_config.py -a embplant_pt,embplant_mt --use-local ./0.0.1
2. Use svn to download part of the repository (ca. 80 MB):
svn co https://github.com/Kinggerm/GetOrganelleDB/trunk/0.0.1
get_organelle_config.py -a embplant_pt,embplant_mt --use-local ./0.0.1
3. Use git clone to clone the entire repository (ca. 200 MB):
git clone https://github.com/Kinggerm/GetOrganelleDB
get_organelle_config.py -a embplant_pt,embplant_mt --use-local ./GetOrganelleDB/0.0.1
## Updates
* **0.0.1.minima** A minimal subset of 0.0.1 for test only (GetOrganelle 1.7.6+ required).
* **0.0.1** fungus_nr added
* **0.0.0** Initial version
## How to contribute
### Welcome to the community!
For generating previous databases, I downloaded all available **complete** plastomes/mitogenomes or **complete** nr region from the Genbank, semi-manually cleaned them, masked the simple repeats. However, using all sequences as the default database takes too much space for most users. So I made a customized script, `scripts/generate_bowtie2_seed.py`, to balance the taxa coverage and the size. The basic idea is randomly picking a set of sequences that can represent/recruit all the candidate sequences given a certain gapping threshold (default: 2000 bp). Then I used the sequence as the seed database and extracted the non-tRNA genes as the label database. Of course, each step should be accompanied by error-proof checking.
### Guidelines
1. The sequence header of the seed database should include the Genbank accession numbers.
2. The sequence header of the label database should be in the form of `>name type - sequence_info_without_space`, e.g. `>rbcL gene - Amborella_AJ506156_2`. The sequence info should at least include the Genbank accession number.
3. For the seed database, single- and oligo-nucleotide repeats should be masked using `N`s to alleviate the computational burden from calling irrelevant reads.
4. After using `scripts/generate_bowtie2_seed.py`, manually check and replace the representative sequences.
5. For the label database, remember to unify the gene names (e.g., upper/lower cases, abbreviations) and exclude the short genes such as tRNAs.
6. Test the database with a range of WGS from different taxa.### Uploading
Follow the following steps to add the compiled database to the community:
1. fork the [GetOrganelleDB repo](https://github.com/Kinggerm/GetOrganelleDB).
2. duplicate the subdirectory of the latest version and rename it to a newer one.
> e.g. if the latest is 0.0.1, rename the duplicate as 0.0.2
3. add the compiled seed/label databased inside the new subdirectory.
4. send a pull request.