https://github.com/beebus/osra-iterate

Bash Script to iterate through .TIF Images in a folder and run the OSRA program to attempt to convert the TIF images into ChemDraw files (.CDXML).
https://github.com/beebus/osra-iterate

bash bash-script bash-scripting chemical-structures cheminformatics chemistry image-processing image-recognition jpg linux molecular-structures molecule molecules ocr optical-recognition organic-chemistry osra pdf reactions tif-images

Last synced: 4 months ago
JSON representation

Bash Script to iterate through .TIF Images in a folder and run the OSRA program to attempt to convert the TIF images into ChemDraw files (.CDXML).

Host: GitHub
URL: https://github.com/beebus/osra-iterate
Owner: beebus
License: mit
Created: 2017-09-29T19:45:45.000Z (almost 8 years ago)
Default Branch: main
Last Pushed: 2021-03-20T19:01:06.000Z (over 4 years ago)
Last Synced: 2025-02-06T11:55:48.912Z (5 months ago)
Topics: bash, bash-script, bash-scripting, chemical-structures, cheminformatics, chemistry, image-processing, image-recognition, jpg, linux, molecular-structures, molecule, molecules, ocr, optical-recognition, organic-chemistry, osra, pdf, reactions, tif-images
Language: Shell
Homepage:
Size: 3.91 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# osra-iterate
Bash Script to Iterate through TIF Images in Folder and Run OSRA

This shows the command line usage of the OSRA open source software.

Execute the osra_iterate.sh bash script by using the following command (or similar) within a Linux terminal and with the folder that contains osra_iterate.sh as the current working directory:

./osra_iterate.sh ~/Share/input/ ~/Share/output/

What is OSRA?

OSRA (Optical Structure Recognition Application) is a utility designed to convert graphical representations of chemical structures and reactions, as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES or MOL files – a computer recognizable molecular structure format. OSRA can read a document in any of the over 90 graphical formats parseable by GraphicsMagick (https://sourceforge.net/p/osra/wiki/Dependencies#GraphicsMagick) – including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or MOL representation of the molecular structure images encountered within that document, or RSMI/RXN for reactions.

Note that any software designed for optical recognition is unlikely to be perfect, and the output produced might, and probably will, contain errors, so curation by a human knowledgeable in chemical structures is highly recommended.

OSRA can process the following types of images:
* Computer-generated 2D structures, such as found on the PubChem website (http://pubchem.ncbi.nlm.nih.gov/), black-and-white and color.
* Black-and-white PDF and PostScript files, including multi-page ones.
* Scanned images – black-and-white, a resolution of 300 dpi is recommended, though 150 dpi can also produce fair results. Please make sure the scanned image is of reasonable quality – an input that's too noisy will only generate garbage output.
* Reactions and Polymers

You can download a free version (https://sourceforge.net/p/osra/wiki/Download/) of the source code or support OSRA development by purchasing binary installation executables for Windows (https://store.payproglobal.com/checkout?products[1][id]=38760), and Linux (https://store.payproglobal.com/checkout?products[1][id]=38761).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/beebus/osra-iterate

Awesome Lists containing this project

README