Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bertsky/dta-lexdb-applications

formatting and integrating the Deutches Textarchiv dictionary into various applications
https://github.com/bertsky/dta-lexdb-applications

dta

Last synced: 29 days ago
JSON representation

formatting and integrating the Deutches Textarchiv dictionary into various applications

Host: GitHub
URL: https://github.com/bertsky/dta-lexdb-applications
Owner: bertsky
Created: 2024-01-29T00:08:36.000Z (10 months ago)
Default Branch: master
Last Pushed: 2024-03-01T23:27:32.000Z (9 months ago)
Last Synced: 2024-03-02T00:31:23.357Z (9 months ago)
Topics: dta
Language: Makefile
Homepage:
Size: 33.2 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# dta-lexdb-applications

[![CD](https://github.com/bertsky/dta-lexdb-applications/actions/workflows/makefile.yml/badge.svg)](https://github.com/bertsky/dta-lexdb-applications/actions/workflows/makefile.yml)

> formatting and integrating the Deutches Textarchiv dictionary into various applications

[Deutsches Textarchiv](https://www.deutsches-textarchiv.de) (DTA) is a large collection of curated and manually corrected
reference corpora in New High German from the 17th to 20th century.

[LexDB](https://www.dwds.de/r/lexdb) are a collection of lexical databases (i.e. dictionaries) distilled from DTA
by the BBAW. They include the full-form, lemmatization, normalized orthography and part-of-speech.

This repository provides scripts to extract and re-format dictionaries for re-use in other applications.
The results will be available as Github release assets.

## Tesseract OCR models with added language model

[Tesseract](https://tesseract-ocr.github.io/) models (both the originally provided ones, trained on
synthetic data, and the community generated ones, finetuned on annotated scan data or trained from scratch)
can be amended with a simple language model by providing dictionaries/grammars for punctuation, numbers and words.

We will pick publicly available models for German Antiqua and Fraktur prints, as well as handwriting,
and republish them with DTA as language model.

For currently selected models, see https://github.com/bertsky/dta-lexdb-applications/blob/83e5d5c3404da3b14886fe5eeed044ee1f630bdd/Makefile#L13-L34

## Hunspell

[Hunspell](http://hunspell.github.io/) is a widely used dictionary based, morphology aware spell checker.

We will produce a DTA dictionary for it.

For currently selected rules, see https://github.com/bertsky/dta-lexdb-applications/blob/83e5d5c3404da3b14886fe5eeed044ee1f630bdd/Makefile#L60-L63

## ...

Others to come. Please raise an issue if you have ideas!