https://github.com/tracywong117/ncbi-get-all-children-organism-under-ancestor
This Python script retrieve all children organism under an ancestor in NCBI taxonomy.
https://github.com/tracywong117/ncbi-get-all-children-organism-under-ancestor
bio-data bio-dataset dataset ncbi ncbi-database ncbi-sra ncbi-taxonomy taxonomy-database
Last synced: 3 months ago
JSON representation
This Python script retrieve all children organism under an ancestor in NCBI taxonomy.
- Host: GitHub
- URL: https://github.com/tracywong117/ncbi-get-all-children-organism-under-ancestor
- Owner: tracywong117
- License: mit
- Created: 2024-02-01T09:19:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-19T09:08:44.000Z (3 months ago)
- Last Synced: 2025-02-19T10:23:48.667Z (3 months ago)
- Topics: bio-data, bio-dataset, dataset, ncbi, ncbi-database, ncbi-sra, ncbi-taxonomy, taxonomy-database
- Language: Python
- Homepage: https://huggingface.co/datasets/tracywong117/NCBI-Taxonomy
- Size: 10.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NCBI-get-all-children-organism-under-ancestor
Goal: retrieve all children organism under an ancestor in NCBI taxonomy
### 1a. Download preprocessed data (last update: 1 Feb 2024) [here](https://huggingface.co/datasets/tracywong117/NCBI-Taxonomy)
Download `taxonomy_with_all_children.csv` which is the csv you may need to analyze NCBI taxonomy tree.### 1b. Or download latest NCBI taxonomy and preprocess data by yourself
You can also use the Pyton scripts as follow to download latest taxonomy from NCBI FTP and preprocess the data.1. Download taxdmp.zip from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/.
2. Unzip taxdmp.zip and place `nodes.dmp` and `names.dmp` in this folder.
3. Run `nodes_to_csv.py` and `names_to_csv.py` to get `nodes.csv` and `names.csv` respectively.
4. Run `concat_names_to_nodes.py` to get `taxonomy.csv`.
5. Compute the direct children of each organism (node) using `get_direct_children_from_tax.py` to get `taxonomy_with_direct_children.csv`.
6. Compute all children (may take several hours) using `get_all_children_from_tax.py` to get `taxonomy_with_all_children.csv`.
7. Run `query.py --ancestor 8782` to retrieve all chilren organism with the ancestor Aves. Replace 8782 with the tax_id of the ancestor you decide.`taxonomy_with_all_children.csv` is the final csv you may need to analyze NCBI taxonomy tree.
#### Alternative to Step 6: Using create_library_index.py
Instead of get_all_children_from_tax.py, you can use create_library_index.py to generate hierarchical library indices for each node in the taxonomy.
A library index is a hierarchical numbering system that encodes the parent-child relationships in the taxonomy tree. It assigns each node an index that reflects its position in the hierarchy.Example:
```
2 is child of 1: 1.2
3 is child of 1: 1.3
4 is child of 2: 1.2.4
```
Benefits: We can Retrieve all descendants of an ancestor by filtering for library indices that start with the ancestor's library index.
Run `python create_library_index.py` to get `taxonomy_with_library_index.csv`
Run `query.py --ancestor 8782 --method library` to use `taxonomy_with_library_index.csv`## 2. query.py:
- get all children of any organism
- after getting all scientific_names of all children of an organism (ancestor), you can retrieve all SRA data related to all organisms with the same ancestor from [BigQuery](https://cloud.google.com/bigquery) by running the generated SQL in BigQueryNote: NCBI hosts SRA data in BigQuery. It is convenient for large amount of data retrieval.
## Remark: Example of retrieval of SRA data from BigQuery
```SQL
SELECT *
FROM `nih-sra-datastore.sra.metadata`,
WHERE organism = "Homo sapiens";
```