Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rdmpage/biostor-jats
BioStor articles marked up using Journal Archiving Tag Set
https://github.com/rdmpage/biostor-jats
Last synced: 6 days ago
JSON representation
BioStor articles marked up using Journal Archiving Tag Set
- Host: GitHub
- URL: https://github.com/rdmpage/biostor-jats
- Owner: rdmpage
- Created: 2013-12-03T16:18:04.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2013-12-06T14:54:34.000Z (about 11 years ago)
- Last Synced: 2023-03-13T03:37:36.089Z (almost 2 years ago)
- Language: PHP
- Size: 44 MB
- Stars: 1
- Watchers: 5
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
biostor-jats
============Experiments in marking up BioStor articles up using Journal Archiving Tag Set (JATS; formerly NLM DTD).
Idea is to create JATS marked-up XML for BioStor articles. Initially simply article-level metadata and links to page scans, we then add content based on analysis of the page scans and associated DjVu and ABBYY XML.
BioStor provides starting point in form of archive with article metadata in JATS XML format, images (B&W and original), and DjVu and ABBYY OCR XML for article pages.
The scripts here then extract text to create hOCR files, which can then be used to generate a PDF with searchable text, and extract figures. Here is a [live example](https://dl.dropboxusercontent.com/u/639486/65706/65706.html).
## Scripts
PHP scripts in the tools directory are used to extract and add content to the base provided by BioStor.djvu2html converts DjVu XML to HTML including hOCR tags, see [The hOCR Embedded OCR Workflow and Output Format](https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0).
php tools/djvu2html.php examples/65706
abby_pictures extracts picture and table blocks from ABBYY OCR, extracts corresponding part of image and puts these in the "figures" folder. It analyses the colours in the images to decide whether to use the B&W or original image for the figure, then adds links to the JATS XML.
php tools/abbyy_pictures.php examples/65706
hocr2pdf extracts text from the HTML files, combines it with the page scans and uses FPF to generate a PDF with searchable text.
php tools/hocr2pdf.php examples/65706
## Stylesheet
XSLT style sheets are used to display the article.