Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/longemen3000/ChemicalIdentifiers.jl

chemical identifiers (CAS, PubChemID, SMILES,InChI, InChI keys, names) from text search
https://github.com/longemen3000/ChemicalIdentifiers.jl

cas chemical-identifiers chemistry chemoinformatics identifiers inchi inchi-key pubchem smiles

Last synced: 3 months ago
JSON representation

chemical identifiers (CAS, PubChemID, SMILES,InChI, InChI keys, names) from text search

Host: GitHub
URL: https://github.com/longemen3000/ChemicalIdentifiers.jl
Owner: longemen3000
License: mit
Created: 2021-05-06T09:28:45.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2023-09-06T02:43:23.000Z (10 months ago)
Last Synced: 2024-03-28T20:12:13.601Z (4 months ago)
Topics: cas, chemical-identifiers, chemistry, chemoinformatics, identifiers, inchi, inchi-key, pubchem, smiles
Language: Julia
Homepage:
Size: 44.9 KB
Stars: 16
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-stars - longemen3000/ChemicalIdentifiers.jl - chemical identifiers (CAS, PubChemID, SMILES,InChI, InChI keys, names) from text search (Julia)

README

        # ChemicalIdentifiers.jl

[![Build Status](https://github.com/longemen3000/ChemicalIdentifiers.jl/workflows/CI/badge.svg)](https://github.com/longemen3000/ChemicalIdentifiers.jl/actions)

A chemical identifiers search package, using the databases present at CalebBell/chemicals.

## Instalation:

```

using Pkg

Pkg.add("ChemicalIdentifiers.jl")

```

The databases are downloaded, parsed, processed and stored as Apache Arrow files at the first package usage, so the first usage may take some time.

## Usage

This package exports `search_chemical`, that, given a search string, performs a search on a database of over 70000 compounds, returning a Named Tuple with the identifiers of the substance in question. 

```

julia>using ChemicalIdentifiers

julia> res = search_chemical("water")

(pubchemid = 962, CAS = (7732, 18, 5), formula = "H2O", MW = 18.01528, smiles = "O", InChI = "H2O/h1H2", InChI_key = "XLYOFNOQVPJJNP-UHFFFAOYSA-N", iupac_name = "oxidane", common_name = "water")

#worst case scenario, not found on the present databases

julia> @btime search_chemical("dimethylpyruvic acid22",nothing)

  273.700 μs (264 allocations: 15.05 KiB)   

missing

#common compound found in the short database

julia> @btime search_chemical("methane",nothing)

  7.075 μs (57 allocations: 2.97 KiB)       

(pubchemid = 297, CAS = (74, 82, 8), formula = "CH4", MW = 16.04246, smiles = "C", InChI = "CH4/h1H4", InChI_key = "VNWKTOKETHGBQD-UHFFFAOYSA-N", iupac_name = "methane", common_name = "methane")

```

A query is usually a string and its type is detected automatically when possible. the supported query types are:

- **PubChemID**: by using any ``<:Integer`(or a string containing an Integer)

```

julia> search_chemical(8003)

(pubchemid = 8003, CAS = (109, 66, 0), formula = "C5H12", MW = 72.14878, smiles = 

"CCCCC", InChI = "C5H12/c1-3-5-4-2/h3-5H2,1-2H3", InChI_key = "OFBQJSOFQDEBGM-UHFFFAOYSA-N", iupac_name = "pentane", common_name = "pentane")

```

- **CAS registry number**:by using a Tuple of integers or a string with the digits separated by `-` :

```

julia> search_chemical((67,56,1))        

(pubchemid = 887, CAS = (67, 56, 1), formula = "CH4O", MW = 32.04186, smiles = "CO", InChI = "CH4O/c1-2/h2H,1H3", InChI_key = "OKKJLVBELUTLKV-UHFFFAOYSA-N", iupac_name = "methanol", common_name = "methanol")

 search_chemical((67,56,1),nothing) == search_chemical("67-56-1",nothing) #true  

```

- **SMILES**: by using a string starting with `SMILES=` :

```

julia> search_chemical("SMILES=N")       

(pubchemid = 222, CAS = (7664, 41, 7), formula = "H3N", MW = 17.03052, smiles = "N", InChI = "H3N/h1H3", InChI_key = "QGZKDVFQNNGYKY-UHFFFAOYSA-N", iupac_name = "azane", common_name = "ammonia")

```

- **InChI** : by using a string starting with `InChI=1/` or `InChI=1S/` :

```

julia> search_chemical("InChI=1/C2H4/c1-2/h1-2H2")     

(pubchemid = 6325, CAS = (74, 85, 1), formula = "C2H4", MW = 28.05316, smiles = "C=C", InChI = "C2H4/c1-2/h1-2H2", InChI_key = "VGGSQFUCUMXWEO-UHFFFAOYSA-N", iupac_name = "ethene", common_name = "ethene") 

```

- **InChI key** : by using a string with the pattern `XXXXXXXXXXXXXX-YYYYYYYYFV-P`:

```

julia> search_chemical("IMROMDMJAWUWLK-UHFFFAOYSA-N")

(pubchemid = 11199, CAS = (9002, 89, 5), 

formula = "C2H4O", MW = 44.05256, smiles 

= "C=CO", InChI = "C2H4O/c1-2-3/h2-3H,1H2", InChI_key = "IMROMDMJAWUWLK-UHFFFAOYSA-N", iupac_name = "ethenol", common_name 

= "ethenol")

```

Searches by CAS and PubChemID are a little bit faster thanks to being encoded as native numeric types, other properties are stored as strings.

The package stores each query in `ChemicalIdentifiers.SEARCH_CACHE` as a `Dict{String,Any}`, so subsequent queries on the same (or similar) strings, dont pay the cost of searching in the database.

If you don't want to store the query, you could use `search_chemical(query,nothing)`, or, if you want your own cache to be used, pass your own cache via `search_chemical(query,mycache)`. 

## Custom Databases

If you want to add your own databases, you could use the (unexported) data utilities to do so. lets say we also want to add the inorganic database located at https://github.com/CalebBell/chemicals/blob/master/chemicals/Identifiers/Inorganic%20db.tsv. we could do:

```

using ChemicalIdentifiers

inorganic_url = "https://github.com/CalebBell/chemicals/blob/master/chemicals/Identifiers/Inorganic%20db.tsv"

ChemicalIdentifiers.load_data!(:inorganic,url = inorganic_url)

ChemicalIdentifiers.load_db!(:inorganic)

```

or if you already have a local database:

```

using ChemicalIdentifiers

filepath = "path/to/my/db.tsv"

ChemicalIdentifiers.load_data!(:custom,file = filepath)

ChemicalIdentifiers.load_db!(:custom)

```

`ChemicalIdentifiers.load_data!` will generate a named tuple of file paths (stored in `ChemicalIdentifiers.DATA_INFO`), and `ChemicalIdentifiers.load_db!` will use that data to generate the corresponding Apache Arrow files and store those in a [scratch](https://github.com/JuliaPackaging/Scratch.jl) space (`ChemicalIdentifiers.download_cache`). This download cache can be cleaned (in case a download goes wrong) with `ChemicalIdentifiers.clear_download_cache!()`

The raw databases are then stored in `ChemicalIdentifiers.DATA_DB`. if the data was already processed, then the arrow files are read directly, saving significant loading time.

In case of adding user databases, those are searched first, so there is a possibility of collision.