Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ndlrf-rnd/awesome-digitization

A list of list of resources related to heritage digitization priograms tech
https://github.com/ndlrf-rnd/awesome-digitization

List: awesome-digitization

Last synced: 16 days ago
JSON representation

A list of list of resources related to heritage digitization priograms tech

Awesome Lists containing this project

README

        

# Awesome digitization by NDL RF

[Russian Digital Library (НЭБ)](https://rusneb.ru) reading materials, codes, data and other related materials curated list.

## Home materials

* [Комплекс программных решений и рабочий процесс по оцифровке изданийю Февраль 2020, Российская Государственная Библиотека. (ЛИР, ОИЗ)](docs/ndl-rf-digitization-concept-2020-02.pdf) - Russian Digital Library digitization concept slides (on Russian).

* [Conceptual high-level schema of human operators feedback loop and math models lifecycles for generic digitization pipelines](doc/ndl-rf-dpub-reconstruction-pipeline-2020-09-02.pdf) (English titles)

## Pipelines

* __dhSegment__ - A generic deep-learning approach for document segmentation

- [![DOI](https://zenodo.org/badge/DOI/10.1109/ICFHR-2018.2018.00011.svg) dhSegment: A generic deep-learning approach for document segmentation](https://doi.org/10.1109/ICFHR-2018.2018.00011) - pipeline description.

- [GitHub src](https://github.com/dhlab-epfl/dhSegment)

- [NDL RF in-home fork](https://github.com/ndlrf-rnd/dhSegment)

* Bavaria state library

* [Вольф Томас, Центр оцифровки Баварской государственной библиотеке. 2014. Оцифровка в Баварской государственной библиотеке.](https://www.digitale-sammlungen.de/content/dokumente/2014-02-19_MDZ_Wolf.pdf) (rus)

* [Marcus Bitzl and Ralf Eichinger, DB/MDZ/IWA, 04.04.2019. Software-Development at the MDZ](https://www.digitale-sammlungen.de/content/dokumente/MDZ-Software-Entwicklung_und_IIIF.pdf)

* [OCR-D system description](https://ocr-d.github.io/en/)
* [Transcription Guidelines for Ground Truth](https://ocr-d.de/gt//trans_documentation/index.html)
* [Recommended third party Ground-Truth datasets](https://github.com/cneud/ocr-gt/blob/master/ocr-gt.yml)

* [PriMa Toolchain](https://www.primaresearch.org/tools)

* [Europeana.eu](https://europeana.ru)
- [Issue 13: OCR. EuropeanaTech Insight is a multimedia publication about R&D developments by the EuropeanaTech Community. Gregory Markus. Posted on Wednesday July 31, 2019, Europeana PRO - Tech](https://pro.europeana.eu/page/issue-13-ocr) - About PIVAJ Europeana newspapers article extraction pipeline.
- [![DOI](https://zenodo.org/badge/DOI/10.2759/524581.svg) Europeana strategy 2020-2025 - EMPOWERING DIGITAL CHANGE.](https://doi.org/10.2759/524581) (ISBN 978-92-76-17398-4)
## Segmentation and scene reading

* [Sumit Saha. December 15 2018. A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

* ICDAR 20xx contest - document analysis and recognition:

* [ICDAR 2019](http://icdar2019.org/) [ICDA 2019 Preceedings](https://dblp.org/db/conf/icdar/icdar2019.html) (ISBN: 978-1-7281-3014-9)

* [Full whitepapers list](https://dblp1.uni-trier.de/db/conf/icdar/)

* [Xiang Bai, Kyoto, November 15. Deep Neural Networks for Scene Text Reading](http://u-pat.org/ICDAR2017/keynotes/ICDAR2017_Keynote_Prof_Bai.pdf)

* [![DOI](https://zenodo.org/badge/DOI/10.5445/IR/1000089239.svg) Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities.](https://doi.org/10.5445/IR/1000089239)

* [Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features](https://arxiv.org/abs/1710.03006)

[![DOI](https://zenodo.org/badge/DOI/10.1007/s10579-019-09476-2.svg) Wiedemann, Gregor & Heyer, Gerhard. (2019). Multi-modal page stream segmentation with convolutional neural networks. Language Resources and Evaluation.](https://doi.org/10.1007/s10579-019-09476-2)

* [Graph-based tables segmentation](https://arxiv.org/pdf/1905.13391.pdf)

* Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

*

* [![DOI](https://zenodo/badges/DOI/10.1016/j.engappai.2017.08.002.svg) Nikos Vasilopoulos, Ergina Kaval lieratou. Complex layout analysis based on contour classification and morphological operations.](https://doi.org/10.1016/j.engappai.2017.08.002)

* [![DOI](https://zenodo/badges/DOI/10.5445/IR/1000089239.svg) Chandna, Swati. Published on 01/09/2019 Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities](https://doi.org/10.5445/IR/1000089239)

# Editors:

* [Structify](http://dbis-halvar.uibk.ac.at/dokuwiki/doku.php?id=main:structify) (METS)

* [Aletheia](https://www.primaresearch.org/tools/Aletheia) (Page.XML)

## OCR

* [Visual guide how RNN works:](https://distill.pub/2019/memorization-in-rnns/)

* [How tesseract works:](https://machinelearningmedium.com/2019/01/15/breaking-down-tesseract-ocr/)

*

* Our Search for the Best OCR Tool, and What We Found

* [A side-by-side comparison of seven OCR tools using multiple kinds of documents, from Factful](https://source.opennews.org/articles/so-many-ocr-options/)

* [Source code](https://github.com/factful/ocr_testing)

## Noticeable digitization projects

*

*

## Datasets

* [Home-grained:](https://drive.google.com/drive/folders/1zq_OPCVLRJk9RdhDuE86e26cAvcsMJt0?usp=sharing)

* PriMa lab:

* [Layout analysis](https://www.primaresearch.org/dataset/)

* [Europeana newspapers](https://www.primaresearch.org/datasets/ENP)

* [Impact](https://www.primaresearch.org/datasets/IMPACT_Digitisation)

* Others: [https://www.primaresearch.org/datasets](https://www.primaresearch.org/datasets/ENP)

* PubXNet:

* PubLayNet:

* [Official page](https://developer.ibm.com/exchanges/data/all/publaynet/)

* [Mask/Fast-RNN pre-trained models](https://github.com/ibm-aur-nlp/PubLayNet/tree/master/pre-trained-models)

* PubTabNet:

*

* [Image-based table recognition: data, model, and evaluation](https://arxiv.org/abs/1911.10683)

## Images similarity

* [Images hashing](https://www.digitale-sammlungen.de/content/dokumente/2015-06-24-Wolf_Image_Recognition.pdf)

* [Semantic embeddings](https://www.researchgate.net/figure/Illustration-of-the-proposed-constraints-for-learning-visual-semantic-embeddings_fig2_317356830)

## Text content NER/Topic/Subject/Context

* [Context-Aware Representations for Knowledge Base Relation Extraction (UKP Lab)](https://github.com/UKPLab/emnlp2017-relation-extraction)

* [Europeana NER datasets](https://lab.kb.nl/dataset/europeana-newspapers-ner)

* Spacy:

* [Project site](https://spacy.io/)

* [Spacy-ru](https://github.com/buriy/spacy-ru)

* [Tomita parser](https://github.com/yandex/tomita-parser/blob/master/docs/ru/tutorial/README.md)

* [FB Duckling](https://github.com/facebook/duckling)

* [Yandex Mystem (awesome stemmer)](https://yandex.ru/dev/mystem/)

* [OCLC Library linked data in cloud](https://www.oclc.org/research/publications/books/library-linked-data-in-the-cloud/chapter1.html)

*

* [Temporal corpus dynamics](http://ceur-ws.org/Vol-2461/paper_4.pdf)

* [RANLP 2019 Natural Language Processing in a Deep Learning World](http://lml.bas.bg/ranlp2019/proceedings-ranlp-2019.pdf)

## Related areas overview materials

* [IBM - Scaling AI](https://www.research.ibm.com/artificial-intelligence/publications/2018/download/pdf/scalingAI.pdf)

* [From the library catalogue to the Semantic Web conversion to bibframe](https://figshare.com/articles/From_the_library_catalogue_to_the_Semantic_Web_-_The_conversion_process_from_MARC_21_to_BIBFRAME_2_0/6731129/1)

* [Lattice calc on GPU](https://indico.cern.ch/event/764552/contributions/3428328/attachments/1865778/3067959/lattice_2019.pdf)

## Related tech at circuits level

* [What is backpropagation really doing? | Deep learning, chapter 3 by Grant Sanderson](https://www.youtube.com/watch?v=Ilg3gGewQ5U)

* About CNN features [![DOI](https://zenodo.org/badge/DOI/10.23915/distill.00007.svg) Olah, et al., "Feature Visualization", Distill, 2017.](https://doi.org/10.23915/distill.00007)