Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ndlrf-rnd/awesome-digitization

A list of list of resources related to heritage digitization priograms tech
https://github.com/ndlrf-rnd/awesome-digitization

List: awesome-digitization

Last synced: 2 months ago
JSON representation

A list of list of resources related to heritage digitization priograms tech

Host: GitHub
URL: https://github.com/ndlrf-rnd/awesome-digitization
Owner: ndlrf-rnd
License: mit
Created: 2020-12-12T23:30:06.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2020-12-13T01:44:14.000Z (about 4 years ago)
Last Synced: 2024-12-03T00:02:37.375Z (3 months ago)
Size: 2.22 MB
Stars: 2
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

ultimate-awesome - awesome-digitization - A list of list of resources related to heritage digitization priograms tech. (Other Lists / Julia Lists)

README

        # Awesome digitization by NDL RF

[Russian Digital Library (НЭБ)](https://rusneb.ru) reading materials, codes, data and other related materials curated list.

## Home materials

* [Комплекс программных решений и рабочий процесс по оцифровке изданийю Февраль 2020, Российская Государственная Библиотека. (ЛИР, ОИЗ)](docs/ndl-rf-digitization-concept-2020-02.pdf) - Russian Digital Library digitization concept slides (on Russian).

* [Conceptual high-level schema of human operators feedback loop and math models lifecycles for generic digitization pipelines](doc/ndl-rf-dpub-reconstruction-pipeline-2020-09-02.pdf) (English titles)

## Pipelines

* __dhSegment__ - A generic deep-learning approach for document segmentation

  - [![DOI](https://zenodo.org/badge/DOI/10.1109/ICFHR-2018.2018.00011.svg) dhSegment: A generic deep-learning approach for document segmentation](https://doi.org/10.1109/ICFHR-2018.2018.00011) - pipeline description.

  

  - [GitHub src](https://github.com/dhlab-epfl/dhSegment)

  - [NDL RF in-home fork](https://github.com/ndlrf-rnd/dhSegment)

* Bavaria state library

  * [Вольф Томас, Центр оцифровки Баварской государственной библиотеке. 2014. Оцифровка в Баварской государственной библиотеке.](https://www.digitale-sammlungen.de/content/dokumente/2014-02-19_MDZ_Wolf.pdf) (rus)

  * [Marcus Bitzl and Ralf Eichinger, DB/MDZ/IWA, 04.04.2019. Software-Development at the MDZ](https://www.digitale-sammlungen.de/content/dokumente/MDZ-Software-Entwicklung_und_IIIF.pdf)

  * [OCR-D system description](https://ocr-d.github.io/en/)

  * [Transcription Guidelines for Ground Truth](https://ocr-d.de/gt//trans_documentation/index.html)

  * [Recommended third party Ground-Truth datasets](https://github.com/cneud/ocr-gt/blob/master/ocr-gt.yml)

* [PriMa Toolchain](https://www.primaresearch.org/tools)

* [Europeana.eu](https://europeana.ru)

    - [Issue 13: OCR. EuropeanaTech Insight is a multimedia     publication about R&D developments by the EuropeanaTech Community. Gregory Markus. Posted on Wednesday July 31, 2019, Europeana PRO - Tech](https://pro.europeana.eu/page/issue-13-ocr) - About PIVAJ Europeana newspapers article extraction pipeline.

    - [![DOI](https://zenodo.org/badge/DOI/10.2759/524581.svg) Europeana strategy 2020-2025 - EMPOWERING DIGITAL CHANGE.](https://doi.org/10.2759/524581) (ISBN 978-92-76-17398-4)

## Segmentation and scene reading

* [Sumit Saha. December 15 2018. A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way

](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

* ICDAR 20xx contest - document analysis and recognition:

  * [ICDAR 2019](http://icdar2019.org/) [ICDA 2019 Preceedings](https://dblp.org/db/conf/icdar/icdar2019.html) (ISBN: 978-1-7281-3014-9)

  * [Full whitepapers list](https://dblp1.uni-trier.de/db/conf/icdar/)

* [Xiang Bai, Kyoto, November 15. Deep Neural Networks for Scene Text Reading](http://u-pat.org/ICDAR2017/keynotes/ICDAR2017_Keynote_Prof_Bai.pdf)

* [![DOI](https://zenodo.org/badge/DOI/10.5445/IR/1000089239.svg) Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities.](https://doi.org/10.5445/IR/1000089239)

* [Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features](https://arxiv.org/abs/1710.03006)

[![DOI](https://zenodo.org/badge/DOI/10.1007/s10579-019-09476-2.svg) Wiedemann, Gregor & Heyer, Gerhard. (2019). Multi-modal page stream segmentation with convolutional neural networks. Language Resources and Evaluation.](https://doi.org/10.1007/s10579-019-09476-2)

* [Graph-based tables segmentation](https://arxiv.org/pdf/1905.13391.pdf)

* Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

* 

* [![DOI](https://zenodo/badges/DOI/10.1016/j.engappai.2017.08.002.svg) Nikos Vasilopoulos, Ergina Kaval lieratou. Complex layout analysis based on contour classification and morphological operations.](https://doi.org/10.1016/j.engappai.2017.08.002)

* [![DOI](https://zenodo/badges/DOI/10.5445/IR/1000089239.svg) Chandna, Swati. Published on 01/09/2019 Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities](https://doi.org/10.5445/IR/1000089239)

# Editors:

* [Structify](http://dbis-halvar.uibk.ac.at/dokuwiki/doku.php?id=main:structify) (METS)

* [Aletheia](https://www.primaresearch.org/tools/Aletheia) (Page.XML)

## OCR

* [Visual guide how RNN works:](https://distill.pub/2019/memorization-in-rnns/)

* [How tesseract works:](https://machinelearningmedium.com/2019/01/15/breaking-down-tesseract-ocr/)

* 

* Our Search for the Best OCR Tool, and What We Found

* [A side-by-side comparison of seven OCR tools using multiple kinds of documents, from Factful](https://source.opennews.org/articles/so-many-ocr-options/)

  * [Source code](https://github.com/factful/ocr_testing)

## Noticeable digitization projects

* 

* 

## Datasets

* [Home-grained:](https://drive.google.com/drive/folders/1zq_OPCVLRJk9RdhDuE86e26cAvcsMJt0?usp=sharing)

* PriMa lab:

  * [Layout analysis](https://www.primaresearch.org/dataset/)

  * [Europeana newspapers](https://www.primaresearch.org/datasets/ENP)

  * [Impact](https://www.primaresearch.org/datasets/IMPACT_Digitisation)

  * Others: [https://www.primaresearch.org/datasets](https://www.primaresearch.org/datasets/ENP)

* PubXNet:

  * PubLayNet:

    * [Official page](https://developer.ibm.com/exchanges/data/all/publaynet/)

    * [Mask/Fast-RNN pre-trained models](https://github.com/ibm-aur-nlp/PubLayNet/tree/master/pre-trained-models)

  * PubTabNet:

    * 

    * [Image-based table recognition: data, model, and evaluation](https://arxiv.org/abs/1911.10683)

## Images similarity

* [Images hashing](https://www.digitale-sammlungen.de/content/dokumente/2015-06-24-Wolf_Image_Recognition.pdf)

* [Semantic embeddings](https://www.researchgate.net/figure/Illustration-of-the-proposed-constraints-for-learning-visual-semantic-embeddings_fig2_317356830)

## Text content NER/Topic/Subject/Context

* [Context-Aware Representations for Knowledge Base Relation Extraction (UKP Lab)](https://github.com/UKPLab/emnlp2017-relation-extraction)

* [Europeana NER datasets](https://lab.kb.nl/dataset/europeana-newspapers-ner)

* Spacy:

  * [Project site](https://spacy.io/)

  * [Spacy-ru](https://github.com/buriy/spacy-ru)

* [Tomita parser](https://github.com/yandex/tomita-parser/blob/master/docs/ru/tutorial/README.md)

* [FB Duckling](https://github.com/facebook/duckling)

* [Yandex Mystem (awesome stemmer)](https://yandex.ru/dev/mystem/)

* [OCLC Library linked data in cloud](https://www.oclc.org/research/publications/books/library-linked-data-in-the-cloud/chapter1.html)

* 

* [Temporal corpus dynamics](http://ceur-ws.org/Vol-2461/paper_4.pdf)

* [RANLP 2019 Natural Language Processing in a Deep Learning World](http://lml.bas.bg/ranlp2019/proceedings-ranlp-2019.pdf)

## Related areas overview materials

* [IBM - Scaling AI](https://www.research.ibm.com/artificial-intelligence/publications/2018/download/pdf/scalingAI.pdf)

* [From the library catalogue to the Semantic Web conversion to bibframe](https://figshare.com/articles/From_the_library_catalogue_to_the_Semantic_Web_-_The_conversion_process_from_MARC_21_to_BIBFRAME_2_0/6731129/1)

* [Lattice calc on GPU](https://indico.cern.ch/event/764552/contributions/3428328/attachments/1865778/3067959/lattice_2019.pdf)

## Related tech at circuits level

* [What is backpropagation really doing? | Deep learning, chapter 3 by Grant Sanderson](https://www.youtube.com/watch?v=Ilg3gGewQ5U)

* About CNN features [![DOI](https://zenodo.org/badge/DOI/10.23915/distill.00007.svg) Olah, et al., "Feature Visualization", Distill, 2017.](https://doi.org/10.23915/distill.00007)