{"id":23438504,"url":"https://github.com/lszeremeta/molstruct","last_synced_at":"2025-04-13T06:26:12.962Z","repository":{"id":57442849,"uuid":"199512410","full_name":"lszeremeta/molstruct","owner":"lszeremeta","description":"Convert chemical molecule data CSV files to structured data formats","archived":false,"fork":false,"pushed_at":"2024-06-24T04:17:10.000Z","size":583,"stargazers_count":4,"open_issues_count":2,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-26T23:02:17.095Z","etag":null,"topics":["bioschemas","cheminformatics","chemoinformatics","csv","docker-image","json-ld","microdata","molecularentity","molecule","molecule-data","molecules","rdfa","schema","schema-org","schemaorg"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/molstruct/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lszeremeta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-07-29T19:10:38.000Z","updated_at":"2023-06-08T15:22:15.000Z","dependencies_parsed_at":"2024-02-21T01:43:56.936Z","dependency_job_id":"8c842e51-ee2c-41d6-8b1b-cde17c9fc236","html_url":"https://github.com/lszeremeta/molstruct","commit_stats":{"total_commits":179,"total_committers":3,"mean_commits":"59.666666666666664","dds":0.1005586592178771,"last_synced_commit":"98ac29c4a566502b0e11fce821b33f078de98df8"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lszeremeta%2Fmolstruct","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lszeremeta%2Fmolstruct/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lszeremeta%2Fmolstruct/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lszeremeta%2Fmolstruct/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lszeremeta","download_url":"https://codeload.github.com/lszeremeta/molstruct/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248672202,"owners_count":21143237,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioschemas","cheminformatics","chemoinformatics","csv","docker-image","json-ld","microdata","molecularentity","molecule","molecule-data","molecules","rdfa","schema","schema-org","schemaorg"],"created_at":"2024-12-23T14:49:49.213Z","updated_at":"2025-04-13T06:26:12.906Z","avatar_url":"https://github.com/lszeremeta.png","language":"Python","readme":"# \u003cimg src=\"https://raw.githubusercontent.com/lszeremeta/molstruct/master/logo/molstruct.png\" alt=\"Molstruct logo\" width=\"300\"\u003e\n\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/3602c4be20d14be1b750db5a1875263a)](https://www.codacy.com/gh/lszeremeta/molstruct/dashboard?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=lszeremeta/molstruct\u0026amp;utm_campaign=Badge_Grade) [![PyPI](https://img.shields.io/pypi/v/molstruct)](https://pypi.org/project/molstruct/) [![Docker Image Size (latest by date)](https://img.shields.io/docker/image-size/lszeremeta/molstruct?label=Docker%20image%20size)](https://hub.docker.com/r/lszeremeta/molstruct)\n\nMolstruct is a lightweight Python CLI tool that converts chemical molecule data [Comma Separated Values (CSV)](https://en.wikipedia.org/wiki/Comma-separated_values) files to structured data formats - [JSON-LD](https://json-ld.org/), [RDFa](http://rdfa.info/), and [Microdata](https://schema.org/docs/gs.html). Molstruct has a lot of customization options that you can, but don't have to use. Python 3.2+ is supported and no dependencies are required. Sounds good so far? What would you say to a really tiny [Molstruct Docker container](https://hub.docker.com/r/lszeremeta/molstruct)? Just try Molstruct!\n\n## What is structured data\n\n[Structured data](https://developers.google.com/search/docs/guides/intro-structured-data) is additional data placed on websites. It is not visible to ordinary internet users but can be easily processed by machines. There are 3 formats that we can use to save structured data - [JSON-LD](https://json-ld.org/), [RDFa](http://rdfa.info/), and [Microdata](https://www.w3.org/TR/microdata/). Molstruct supports them all and uses the [MolecularEntity profile](https://bioschemas.org/profiles/MolecularEntity/0.5-RELEASE/).\n\n## Where to find a CSV file with molecule data\n\nThere are many possibilities. The easiest way is to download a CSV file from one of the chemical databases, e.g. [DrugBank](https://www.drugbank.ca/releases/latest#open-data). You can also create the CSV file yourself.\n\n## Quick start\n\nUse Molstruct in 3 easy steps. In this example, we will use the [DrugBank open dataset](https://www.drugbank.ca/releases/latest#open-data). You need Python 3.2+ and pip installed.\n\n1. Open a terminal and install Molstruct\n\nYou can install the Molstruct from [PyPI](https://pypi.org/project/molstruct/):\n\n```shell\npip install molstruct\n```\n\nMolstruct is also available as a [Docker image](#docker-image). In most cases, installing Molstruct from PyPI or using Docker should be sufficient and convenient, but you may want to [run Molstruct from sources or build a Docker image yourself](https://github.com/lszeremeta/molstruct/wiki/Run-from-sources-and-manual-Docker-build).\n\n2. Download [DrugBank open dataset](https://go.drugbank.com/releases/latest/downloads/all-drugbank-vocabulary) in CSV format and unzip downloaded archive.\n3. Molstruct has a [predefined preset](#predefined-presets) for this dataset. You just need to select the output format and enter the path to the CSV file. Assuming the `drugbank vocabulary.csv` file is in the current directory and the output format you're interested in is RDFa, the command will be as follows:\n\n```shell\nmolstruct -p drugbank-open -f rdfa \"drugbank vocabulary.csv\" \u003e drugbank_cc0_rdfa.html\n```\n\nThat's all. Now you have the RDFa file ready in the current directory. You can try other output formats and options as described below. You can also use Molstruct to convert other data in CSV format.\n\n## Docker image\n\nIf you have [Docker](https://docs.docker.com/engine/install/) installed, you can use a tiny Molstruct image from [Docker Hub](https://hub.docker.com/r/lszeremeta/molstruct).\n\nBecause the tool is closed inside the container, you have to [mount](https://docs.docker.com/storage/bind-mounts/#start-a-container-with-a-bind-mount) the local directory with your input file. The default working directory of the image is `/app`. You need to mount your local directory inside it (e.g. `/app/input`):\n\n```shell\ndocker run --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest\n```\n\nIn this case, the local directory `/home/user/input` has been mounted under `/app/input`.\n\nYou can also simply mount the current working directory using `$(pwd)` sub-command:\n\n```shell\ndocker run --rm --name molstruct-app --mount type=bind,source=\"$(pwd)\",target=/app/input,readonly lszeremeta/molstruct:latest\n```\n\n## Usage\n\n```\nusage: molstruct [-h] [--version] -f {jsonldhtml,jsonld,rdfa,microdata} [-i IDENTIFIER]\n                 [-n NAME] [-ink INCHIKEY] [-in INCHI] [-sm SMILES] [-u URL]\n                 [-iu IUPACNAME] [-mf MOLECULARFORMULA] [-w MOLECULARWEIGHT]\n                 [-mw MONOISOTOPICMOLECULARWEIGHT] [-d DESCRIPTION]\n                 [-dd DISAMBIGUATINGDESCRIPTION] [-img IMAGE] [-an ALTERNATENAME]\n                 [-sa SAMEAS] [-p {drugbank-open} | -c] [-s {iri,uuid,bnode}] [-b BASE]\n                 [-vd VALUE_DELIMITER] [-l LIMIT]\n                 file\n```\n\nSupported [MolecularEntity](https://bioschemas.org/profiles/MolecularEntity/0.5-RELEASE/) properties that correspond to default CSV column names: `identifier`, `name`, `inChIKey`, `inChI`, `smiles`, `url`, `iupacName`, `molecularFormula`, `molecularWeight`, `monoisotopicMolecularWeight`, `description`, `disambiguatingDescription`, `image`, `alternateName` and `sameAs`. You can rename the columns if needed (see [Column name change arguments](#column-name-change-arguments) below). You can also use a [preset](#predefined-presets) with the appropriate settings for your dataset.\n\n### Informative arguments\n\n* `-h`, `--help` show help message and exit\n* `--version` show program version and exit\n\n### Required arguments\n\n* `-f {jsonldhtml,jsonld,rdfa,microdata}`, `--format {jsonldhtml,jsonld,rdfa,microdata}` output format\n* `file` CSV file path with molecule data to convert\n\nRemember about the appropriate file path when using the Docker image. Suppose you mounted your local directory `/home/user/input` under `/app/input` and the path to the CSV file you want to use in Molstruct is `/home/user/input/file.csv`. In this case, enter the path `/app/input/file.csv` or `input/file.csv` as `file` argument value.\n\n### Column name change arguments\n\nArguments for changing the default column names\n\n* `-i IDENTIFIER`, `--identifier IDENTIFIER` identifier column name ('identifier' by default), Text\n* `-n NAME`, `--name NAME` name column name ('name' by default), Text\n* `-ink INCHIKEY`, `--inChIKey INCHIKEY` inChIKey column name ('inChIKey' by default), Text\n* `-in INCHI`, `--inChI INCHI` inChI column name ('inChI' by default), Text\n* `-sm SMILES`, `--smiles SMILES` smiles column name ('smiles' by default), Text\n* `-u URL`, `--url URL` url column name ('url' by default), URL\n* `-iu IUPACNAME`, `--iupacName IUPACNAME` iupacName column name ('iupacName' by default), Text\n* `-mf MOLECULARFORMULA`, `--molecularFormula MOLECULARFORMULA` molecularFormula column name ('molecularFormula' by default), Text\n* `-w MOLECULARWEIGHT`, `--molecularWeight MOLECULARWEIGHT` molecularWeight column name ('molecularWeight' by default), Mass e.g. 0.01 mg)\n* `-mw MONOISOTOPICMOLECULARWEIGHT`, `--monoisotopicMolecularWeight MONOISOTOPICMOLECULARWEIGHT` monoisotopicMolecularWeight column name ('monoisotopicMolecularWeight' by default), Mass e.g. 0.01 mg\n* `-d DESCRIPTION`, `--description DESCRIPTION` description column name ('description' by default), Text\n* `-dd DISAMBIGUATINGDESCRIPTION`, `--disambiguatingDescription DISAMBIGUATINGDESCRIPTION` disambiguatingDescription column name ('disambiguatingDescription' by default), Text\n* `-img IMAGE`, `--image IMAGE` image column name ('image' by default), URL\n* `-an ALTERNATENAME`, `--alternateName ALTERNATENAME` alternateName column name ('alternateName' by default), Text\n* `-sa SAMEAS`, `--sameAs SAMEAS` sameAs column name ('sameAs' by default), URL\n\n### Additional settings arguments\n\n* `-p {drugbank-open}`, `--preset {drugbank-open}` apply presets for individual CSV sources to avoid setting individual options manually (['drugbank-open'](#drugbank-open))\n* `-c`, `--columns` use only columns with renamed names; not available when using a preset\n* `-s {iri,uuid,bnode}`, `--subject {iri,uuid,bnode}` molecule subject type ('iri' by default)\n* `-b BASE`, `--base BASE` molecule subject base for 'iri' subject type ('http://example.com/molecule#entity' by default)\n* `-vd VALUE_DELIMITER`, `--value-delimiter VALUE_DELIMITER` value delimiter (' | ' by default)\n* `-l LIMIT`, `--limit LIMIT` maximum number of results (unlimited by default)\n\nAvailable options may vary depending on the version. To display all available options with their descriptions use ``molstruct -h``.\n\n## Predefined presets\n\nTo make your work easier, Molstruct has built-in preset support. Thanks to this, you do not have to set everything manually, you just select the appropriate preset and it's ready. The presets are flexible. If you want to change, e.g. the column names selected for a preset, you can do so. At the moment you can use the [DrugBank open](https://www.drugbank.ca/releases/latest#open-data) preset. There are plans to add more in the future. Any [suggestions](https://github.com/lszeremeta/molstruct/issues/new?template=new-preset-suggestion.md) are welcome!\n\n### `drugbank-open`\n\nSettings for the [open DrugBank dataset](https://www.drugbank.ca/releases/latest#open-data) in CSV file:\n\n* `--value-delimiter` is set to ' | '\n* `--identifier` is set to 'CAS'\n* `--name` is set to 'Common name'\n* `--inChIKey` is set to 'Standard InChI Key'\n* `--alternateName` is set to 'Synonyms'\n\n## Additional examples\n\n```shell\nmolstruct -f jsonldhtml data.csv\n```\n\nReturns simple HTML with added JSON-LD. Assumes that the column names in the CSV file are the default ones.\n\n```shell\nmolstruct -f microdata -mf \"formula\" data.csv\n```\n\nReturns simple HTML with added Microdata. Assumes that the column names in CSV file are the default ones but replaces default `molecularformula` column name by `formula`.\n\n```shell\nmolstruct -f microdata --columns --id \"CAS\" --name \"Common name\" --inChIKey \"Standard InChI Key\" --limit 50 \"drugbank vocabulary.csv\"\n```\n\nReturns simple HTML with added Microdata. When generating a file, only selected columns will be taken into account. A limit of 50 molecules has been specified.\n\n```shell\nmolstruct -f microdata --columns --id \"CAS\" --name \"Common name\" --inChIKey \"Standard InChI Key\" --limit 50 \"drugbank vocabulary.csv\" \u003e output.html\n```\n\nDoes the same as the example above but saves results to `output.html`.\n\n```shell\ndocker run --rm --name molstruct-app --mount type=bind,source=/home/user/input,target=/app/input,readonly lszeremeta/molstruct:latest -f microdata --columns --id \"CAS\" --name \"Common name\" --inChIKey \"Standard InChI Key\" --limit 50 \"input/drugbank vocabulary.csv\" \u003e output.html\n```\n\nDoes the same as the example above (run from pre-built Docker image).\n\nReturns simple HTML with added [Microdata](https://www.w3.org/TR/microdata/) and redirects output to `molecules.html` file. Run from pre-build Docker image.\n\n## Contribution\n\nWould you like to improve this project? Great! We are waiting for your help and suggestions. If you are new to open source contributions, read [How to Contribute to Open Source](https://opensource.guide/how-to-contribute/).\n\n## License\n\nDistributed under [MIT License](https://github.com/lszeremeta/molstruct/blob/master/LICENSE).\n\n## See also\n\nThese projects can also be useful:\n\n* [SDFEater](https://github.com/lszeremeta/SDFEater) - Always hungry SDF chemical file format parser with many output formats\n* [MEgen](https://github.com/lszeremeta/MEgen) - Convenient online form to generate structured data about molecules\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flszeremeta%2Fmolstruct","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flszeremeta%2Fmolstruct","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flszeremeta%2Fmolstruct/lists"}