Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ndlrf-rnd/awesome-digitization
A list of list of resources related to heritage digitization priograms tech
https://github.com/ndlrf-rnd/awesome-digitization
List: awesome-digitization
Last synced: 16 days ago
JSON representation
A list of list of resources related to heritage digitization priograms tech
- Host: GitHub
- URL: https://github.com/ndlrf-rnd/awesome-digitization
- Owner: ndlrf-rnd
- License: mit
- Created: 2020-12-12T23:30:06.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2020-12-13T01:44:14.000Z (about 4 years ago)
- Last Synced: 2024-12-03T00:02:37.375Z (19 days ago)
- Size: 2.22 MB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- ultimate-awesome - awesome-digitization - A list of list of resources related to heritage digitization priograms tech. (Other Lists / PowerShell Lists)
README
# Awesome digitization by NDL RF
[Russian Digital Library (НЭБ)](https://rusneb.ru) reading materials, codes, data and other related materials curated list.
## Home materials
* [Комплекс программных решений и рабочий процесс по оцифровке изданийю Февраль 2020, Российская Государственная Библиотека. (ЛИР, ОИЗ)](docs/ndl-rf-digitization-concept-2020-02.pdf) - Russian Digital Library digitization concept slides (on Russian).
* [Conceptual high-level schema of human operators feedback loop and math models lifecycles for generic digitization pipelines](doc/ndl-rf-dpub-reconstruction-pipeline-2020-09-02.pdf) (English titles)
## Pipelines
* __dhSegment__ - A generic deep-learning approach for document segmentation
- [![DOI](https://zenodo.org/badge/DOI/10.1109/ICFHR-2018.2018.00011.svg) dhSegment: A generic deep-learning approach for document segmentation](https://doi.org/10.1109/ICFHR-2018.2018.00011) - pipeline description.
- [GitHub src](https://github.com/dhlab-epfl/dhSegment)- [NDL RF in-home fork](https://github.com/ndlrf-rnd/dhSegment)
* Bavaria state library
* [Вольф Томас, Центр оцифровки Баварской государственной библиотеке. 2014. Оцифровка в Баварской государственной библиотеке.](https://www.digitale-sammlungen.de/content/dokumente/2014-02-19_MDZ_Wolf.pdf) (rus)
* [Marcus Bitzl and Ralf Eichinger, DB/MDZ/IWA, 04.04.2019. Software-Development at the MDZ](https://www.digitale-sammlungen.de/content/dokumente/MDZ-Software-Entwicklung_und_IIIF.pdf)
* [OCR-D system description](https://ocr-d.github.io/en/)
* [Transcription Guidelines for Ground Truth](https://ocr-d.de/gt//trans_documentation/index.html)
* [Recommended third party Ground-Truth datasets](https://github.com/cneud/ocr-gt/blob/master/ocr-gt.yml)* [PriMa Toolchain](https://www.primaresearch.org/tools)
* [Europeana.eu](https://europeana.ru)
- [Issue 13: OCR. EuropeanaTech Insight is a multimedia publication about R&D developments by the EuropeanaTech Community. Gregory Markus. Posted on Wednesday July 31, 2019, Europeana PRO - Tech](https://pro.europeana.eu/page/issue-13-ocr) - About PIVAJ Europeana newspapers article extraction pipeline.
- [![DOI](https://zenodo.org/badge/DOI/10.2759/524581.svg) Europeana strategy 2020-2025 - EMPOWERING DIGITAL CHANGE.](https://doi.org/10.2759/524581) (ISBN 978-92-76-17398-4)
## Segmentation and scene reading* [Sumit Saha. December 15 2018. A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way
](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)* ICDAR 20xx contest - document analysis and recognition:
* [ICDAR 2019](http://icdar2019.org/) [ICDA 2019 Preceedings](https://dblp.org/db/conf/icdar/icdar2019.html) (ISBN: 978-1-7281-3014-9)
* [Full whitepapers list](https://dblp1.uni-trier.de/db/conf/icdar/)
* [Xiang Bai, Kyoto, November 15. Deep Neural Networks for Scene Text Reading](http://u-pat.org/ICDAR2017/keynotes/ICDAR2017_Keynote_Prof_Bai.pdf)
* [![DOI](https://zenodo.org/badge/DOI/10.5445/IR/1000089239.svg) Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities.](https://doi.org/10.5445/IR/1000089239)
* [Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features](https://arxiv.org/abs/1710.03006)
[![DOI](https://zenodo.org/badge/DOI/10.1007/s10579-019-09476-2.svg) Wiedemann, Gregor & Heyer, Gerhard. (2019). Multi-modal page stream segmentation with convolutional neural networks. Language Resources and Evaluation.](https://doi.org/10.1007/s10579-019-09476-2)
* [Graph-based tables segmentation](https://arxiv.org/pdf/1905.13391.pdf)
* Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
*
* [![DOI](https://zenodo/badges/DOI/10.1016/j.engappai.2017.08.002.svg) Nikos Vasilopoulos, Ergina Kaval lieratou. Complex layout analysis based on contour classification and morphological operations.](https://doi.org/10.1016/j.engappai.2017.08.002)
* [![DOI](https://zenodo/badges/DOI/10.5445/IR/1000089239.svg) Chandna, Swati. Published on 01/09/2019 Automatic Layout Analysis and Visual Exploration of Multidimensional Datasets with Applications in the Digital Humanities](https://doi.org/10.5445/IR/1000089239)
# Editors:
* [Structify](http://dbis-halvar.uibk.ac.at/dokuwiki/doku.php?id=main:structify) (METS)
* [Aletheia](https://www.primaresearch.org/tools/Aletheia) (Page.XML)
## OCR
* [Visual guide how RNN works:](https://distill.pub/2019/memorization-in-rnns/)
* [How tesseract works:](https://machinelearningmedium.com/2019/01/15/breaking-down-tesseract-ocr/)
*
* Our Search for the Best OCR Tool, and What We Found
* [A side-by-side comparison of seven OCR tools using multiple kinds of documents, from Factful](https://source.opennews.org/articles/so-many-ocr-options/)
* [Source code](https://github.com/factful/ocr_testing)
## Noticeable digitization projects
*
*
## Datasets
* [Home-grained:](https://drive.google.com/drive/folders/1zq_OPCVLRJk9RdhDuE86e26cAvcsMJt0?usp=sharing)
* PriMa lab:
* [Layout analysis](https://www.primaresearch.org/dataset/)
* [Europeana newspapers](https://www.primaresearch.org/datasets/ENP)
* [Impact](https://www.primaresearch.org/datasets/IMPACT_Digitisation)
* Others: [https://www.primaresearch.org/datasets](https://www.primaresearch.org/datasets/ENP)
* PubXNet:
* PubLayNet:
* [Official page](https://developer.ibm.com/exchanges/data/all/publaynet/)
* [Mask/Fast-RNN pre-trained models](https://github.com/ibm-aur-nlp/PubLayNet/tree/master/pre-trained-models)
* PubTabNet:
*
* [Image-based table recognition: data, model, and evaluation](https://arxiv.org/abs/1911.10683)
## Images similarity
* [Images hashing](https://www.digitale-sammlungen.de/content/dokumente/2015-06-24-Wolf_Image_Recognition.pdf)
* [Semantic embeddings](https://www.researchgate.net/figure/Illustration-of-the-proposed-constraints-for-learning-visual-semantic-embeddings_fig2_317356830)
## Text content NER/Topic/Subject/Context
* [Context-Aware Representations for Knowledge Base Relation Extraction (UKP Lab)](https://github.com/UKPLab/emnlp2017-relation-extraction)
* [Europeana NER datasets](https://lab.kb.nl/dataset/europeana-newspapers-ner)
* Spacy:
* [Project site](https://spacy.io/)
* [Spacy-ru](https://github.com/buriy/spacy-ru)
* [Tomita parser](https://github.com/yandex/tomita-parser/blob/master/docs/ru/tutorial/README.md)
* [FB Duckling](https://github.com/facebook/duckling)
* [Yandex Mystem (awesome stemmer)](https://yandex.ru/dev/mystem/)
* [OCLC Library linked data in cloud](https://www.oclc.org/research/publications/books/library-linked-data-in-the-cloud/chapter1.html)
*
* [Temporal corpus dynamics](http://ceur-ws.org/Vol-2461/paper_4.pdf)
* [RANLP 2019 Natural Language Processing in a Deep Learning World](http://lml.bas.bg/ranlp2019/proceedings-ranlp-2019.pdf)
## Related areas overview materials
* [IBM - Scaling AI](https://www.research.ibm.com/artificial-intelligence/publications/2018/download/pdf/scalingAI.pdf)
* [From the library catalogue to the Semantic Web conversion to bibframe](https://figshare.com/articles/From_the_library_catalogue_to_the_Semantic_Web_-_The_conversion_process_from_MARC_21_to_BIBFRAME_2_0/6731129/1)
* [Lattice calc on GPU](https://indico.cern.ch/event/764552/contributions/3428328/attachments/1865778/3067959/lattice_2019.pdf)
## Related tech at circuits level
* [What is backpropagation really doing? | Deep learning, chapter 3 by Grant Sanderson](https://www.youtube.com/watch?v=Ilg3gGewQ5U)
* About CNN features [![DOI](https://zenodo.org/badge/DOI/10.23915/distill.00007.svg) Olah, et al., "Feature Visualization", Distill, 2017.](https://doi.org/10.23915/distill.00007)