https://github.com/lucacappelletti94/mesh
Python package helping to work with the MESH dataset.
https://github.com/lucacappelletti94/mesh
Last synced: 6 months ago
JSON representation
Python package helping to work with the MESH dataset.
- Host: GitHub
- URL: https://github.com/lucacappelletti94/mesh
- Owner: LucaCappelletti94
- License: mit
- Created: 2024-10-10T09:50:42.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-10-11T09:26:51.000Z (8 months ago)
- Last Synced: 2024-11-27T02:12:03.897Z (6 months ago)
- Language: Python
- Size: 38.1 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MESH
Python package helping to work with the [MESH dataset](https://www.ncbi.nlm.nih.gov/mesh/). This package is currently primarily focused on [the chemicals and drugs category of the MESH dataset](https://www.ncbi.nlm.nih.gov/mesh/1000068) and integrates the associated [PubChem database](https://pubchem.ncbi.nlm.nih.gov/) SMILES and InChI keys.
## Installation
At this moment, the package is not available on PyPI. To install it, you can clone the repository and install it using `pip`:
```bash
pip install .
```## Usage
The package provides two main functionalities: downloading a pre-built MESH dataset and generating a custom MESH dataset. Once you have the dataset, you can use the `Dataset` class to work with it.
### Downloading a pre-built MESH dataset
While this package allows you to build a custom MESH dataset, since building the dataset requires reources, we also provide pre-built datasets which [we host on Zenodo](). The structure of any of the hosted tarballs is as follows:
```
mesh_chemistry_2024.tar.gz
├── chemicals.csv
├── descriptors.csv
├── chemicals_to_descriptors.csv
├── mesh_dag.csv
├── metadata.json
```Where (you can see examples of these files just below):
- `chemicals.csv` contains information about chemicals and drugs.
- `descriptors.csv` contains information about descriptors.
- `chemicals_to_descriptors.csv` contains the relationships between chemicals and descriptors.
- `mesh_dag.csv` contains the Directed Acyclic Graph (DAG) of the MESH dataset.
- `metadata.json` contains metadata about the dataset.To download a pre-built dataset, you can use the following code:
```python
from mesh import Datasetdataset = Dataset.load("mesh_chemistry_2024")
```Find the available rasterized datasets [on Zenodo]().
Here's some statistics regarding the rasterized MESH datasets, all created with the same settings described in the next section:
| Version name | Number of nodes | Number of edges | Number of chemicals | Number of descriptors |
|--------------|-----------------|-----------------|----------------------|-----------------------|
| MESH 2024 | 334220 | 367694 | 323679 | 10542 |
| MESH 2023 | 332999 | 365801 | 322591 | 10409 |
| MESH 2022 | 330106 | 364653 | 319739 | 10367 |
| MESH 2021 | 328884 | 363505 | 318391 | 10325 |### Generating a custom MESH dataset
The package provides a `Dataset` class that allows you to work with the MESH dataset. The dataset is built using the `DatasetSettings` class, which allows you to specify which parts of the dataset you want to include. The `ChemicalsAndDrugsSettings` class allows you to specify which parts of the chemicals and drugs category you want to include.
Particularly helpful, is the ability to include SMILES and InChI keys for the chemicals and drugs. This is done by specifying the `include_smiles` and `include_inchi_keys` methods of the `ChemicalsAndDrugsSettings` class.
```python
from mesh.settings import DatasetSettings, ChemicalsAndDrugsSettings
from mesh import Datasetdef build_mesh_chemistry_2024() -> Dataset:
"""Build MESH 2024 dataset."""
# First, we need to define the settings for the dataset.
cad: ChemicalsAndDrugsSettings = (
ChemicalsAndDrugsSettings()
# In this case, we are including all of the submodules of
# categories of chemicals and drugs.
.include_all_submodules()
# We also want to include SMILES, which we obtain from the
# PUBCHEM database.
.include_smiles()
# Analogously, we want to include InChI keys, which we obtain
# from the PUBCHEM database.
.include_inchi_keys()
)
settings = (
# We are using the MESH 2024 version.
DatasetSettings(version=2024)
# We want to retrieve data only regarding chemicals and drugs.
.include_chemicals_and_drugs(cad)
# And we want to print the progress of the dataset retrieval.
.set_verbose(True)
)
# Now, we build the dataset. This will download the necessary files
# and rasterize the dataset.
dataset = Dataset.build(settings)
return datasetif __name__ == "__main__":
# We build the MESH 2024 dataset.
mesh_chemistry_2024: Dataset = build_mesh_chemistry_2024()
# And we save it to disk.
mesh_chemistry_2024.save("mesh_chemistry_2024", tarball=False)
```#### Resulting CSVs
The resulting CSVs will be saved in the `mesh_chemistry_2024` directory. The directory will contain the following CSVs:
##### `chemicals.csv`
|unique_identifier|name |compound_id|substance_id|smiles |inchi |inchikey |
|-----------------|---------------------------------------|-----------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|
|C000002 |bevonium |31800.0 |500762995.0 |C[N+]1(CCCCC1COC(=O)C(C2=CC=CC=C2)(C3=CC=CC=C3)O)C |InChI=1S/C22H28NO3/c1-23(2)16-10-9-15-20(23)17-26-21(24)22(25,18-11-5-3-6-12-18)19-13-7-4-8-14-19/h3-8,11-14,20,25H,9-10,15-17H2,1-2H3/q+1 |UHUMRJKDOOEQIG-UHFFFAOYSA-N|
|C000009 |N-acetylglucosaminylasparagine |123826.0 |500203198.0 |CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1NC(=O)C[C@@H](C(=O)O)N)CO)O)O |InChI=1S/C12H21N3O8/c1-4(17)14-8-10(20)9(19)6(3-16)23-11(8)15-7(18)2-5(13)12(21)22/h5-6,8-11,16,19-20H,2-3,13H2,1H3,(H,14,17)(H,15,18)(H,21,22)/t5-,6+,8+,9+,10+,11+/m0/s1 |YTTRPBWEMMPYSW-HRRFRDKFSA-N|
|C000011 |5-(n-acetaminophenylazo)-8-oxyquinoline|114081.0 |484035752.0 |CC(=O)NC1=CC=C(C=C1)N=NC2=C3C=CC=NC3=C(C=C2)O |InChI=1S/C17H14N4O2/c1-11(22)19-12-4-6-13(7-5-12)20-21-15-8-9-16(23)17-14(15)3-2-10-18-17/h2-10,23H,1H3,(H,19,22) |DKRPSSOODLBKPQ-UHFFFAOYSA-N|
|C000015 |N-acetyl-L-arginine |67427.0 |500710457.0 |CC(=O)N[C@@H](CCCN=C(N)N)C(=O)O |InChI=1S/C8H16N4O3/c1-5(13)12-6(7(14)15)3-2-4-11-8(9)10/h6H,2-4H2,1H3,(H,12,13)(H,14,15)(H4,9,10,11)/t6-/m0/s1 |SNEIUMQYRCDYCH-LURJTMIESA-N|
|C000020 |N-acetylneuraminoyllactose | |489852514.0 | | | |
|C000021 |acetylnovadral | | | | | |##### `descriptors.csv`
| unique_identifier | name | compound_id | substance_id | smiles | inchikey |
|-------------------|-----------------------------------|----------------|----------------|---------------------------------------------------------------------------------------------------------------|----------------------------------|
| D000001 | Calcimycin | 139593372.0 | 500766157.0 | C[C@@H]1CCC2([C@H](C[C@@H]([C@@H](O2)C(C)C(=O)C3=CC=CN3)C)C)O[C@@H]1CC4=NC5=C(O4)C=CC(=C5C(=O)O)NC | HIYAVKIYRIFSCZ-LGHBZWQHSA-N |
| D000002 | Temefos | 5392.0 | 500974612.0 | COP(=S)(OC)OC1=CC=C(C=C1)SC2=CC=C(C=C2)OP(=S)(OC)OC | WWJZWCUNLNYYAU-UHFFFAOYSA-N |
| D000017 | ABO Blood-Group System | | | | |
| D000019 | Abortifacient Agents | | | | |
| D000020 | Abortifacient Agents, Nonsteroidal | | | | |
| D000021 | Abortifacient Agents, Steroidal | | | | |
| D000036 | Abrin | | 486451862.0 | | |
| D000040 | Abscisic Acid | 5702609.0 | 500195639.0 | CC1=CC(=O)CC([C@]1(/C=C/C(=C/C(=O)O)/C)O)(C)C | JLIDBLDQVAYHNE-IBPUIESWSA-N |##### `chemicals_to_descriptors.csv`
| chemical | descriptor |
|----------|------------|
| C000002 | D001561 |
| C000006 | D061389 |
| C000009 | D000117 |
| C000011 | D015125 |
| C000015 | D001120 |
| C000020 | D007785 |##### `mesh_dag.csv`
| parent | child |
|---------|------------|
| D000001 | D000095662 |
| D000001 | D001583 |
| D000002 | D063086 |
| D000017 | D001789 |
| D000019 | D012102 |
| D000020 | D000019 |
| D000021 | D000019 |##### `metadata.json`
```json
{
"version": {
"version": 2024,
"descriptors": "https://nlmpubs.nlm.nih.gov/projects/mesh/2024/asciimesh/20240101/d2024.bin",
"chemicals": "https://nlmpubs.nlm.nih.gov/projects/mesh/2024/asciimesh/20240101/c2024.bin"
},
"roots": [
{
"root": "Chemicals and Drugs",
"included_codes": [
"D01",
"D02",
"D03",
"D04",
"D05",
"D06",
"D08",
"D09",
"D10",
"D12",
"D13",
"D20",
"D23",
"D25",
"D26",
"D27"
],
"include_smiles": true
}
],
"downloads_directory": "downloads"
}
```### To NetworkX
Since the MESH dataset is a Directed Acyclic Graph (DAG), you can convert it to a NetworkX graph. This is done by calling the `to_networkx` method of the `Dataset` class.
```python
import networkx as nx# We convert the MESH dataset to a NetworkX graph.
graph: nx.DiGraph = mesh_chemistry_2024.to_networkx()# Now, we can use the NetworkX graph as we would any other NetworkX graph.
print(nx.info(graph))
```In this case, the output will be:
```
DiGraph with 334220 nodes and 367694 edges
```## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.