Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/beebus/osra-iterate

Bash Script to iterate through .TIF Images in a folder and run the OSRA program to attempt to convert the TIF images into ChemDraw files (.CDXML).
https://github.com/beebus/osra-iterate

bash bash-script bash-scripting chemical-structures cheminformatics chemistry image-processing image-recognition jpg linux molecular-structures molecule molecules ocr optical-recognition organic-chemistry osra pdf reactions tif-images

Last synced: 11 days ago
JSON representation

Bash Script to iterate through .TIF Images in a folder and run the OSRA program to attempt to convert the TIF images into ChemDraw files (.CDXML).

Awesome Lists containing this project

README

        

# osra-iterate
Bash Script to Iterate through TIF Images in Folder and Run OSRA

This shows the command line usage of the OSRA open source software.

Execute the osra_iterate.sh bash script by using the following command (or similar) within a Linux terminal and with the folder that contains osra_iterate.sh as the current working directory:

./osra_iterate.sh ~/Share/input/ ~/Share/output/

What is OSRA?

OSRA (Optical Structure Recognition Application) is a utility designed to convert graphical representations of chemical structures and reactions, as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES or MOL files – a computer recognizable molecular structure format. OSRA can read a document in any of the over 90 graphical formats parseable by GraphicsMagick (https://sourceforge.net/p/osra/wiki/Dependencies#GraphicsMagick) – including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or MOL representation of the molecular structure images encountered within that document, or RSMI/RXN for reactions.

Note that any software designed for optical recognition is unlikely to be perfect, and the output produced might, and probably will, contain errors, so curation by a human knowledgeable in chemical structures is highly recommended.

OSRA can process the following types of images:
* Computer-generated 2D structures, such as found on the PubChem website (http://pubchem.ncbi.nlm.nih.gov/), black-and-white and color.
* Black-and-white PDF and PostScript files, including multi-page ones.
* Scanned images – black-and-white, a resolution of 300 dpi is recommended, though 150 dpi can also produce fair results. Please make sure the scanned image is of reasonable quality – an input that's too noisy will only generate garbage output.
* Reactions and Polymers

You can download a free version (https://sourceforge.net/p/osra/wiki/Download/) of the source code or support OSRA development by purchasing binary installation executables for Windows (https://store.payproglobal.com/checkout?products[1][id]=38760), and Linux (https://store.payproglobal.com/checkout?products[1][id]=38761).