https://github.com/michmech/lemmatization-lists
Machine-readable lists of lemma-token pairs in 23 languages.
https://github.com/michmech/lemmatization-lists
lemmatization nlp
Last synced: 3 months ago
JSON representation
Machine-readable lists of lemma-token pairs in 23 languages.
- Host: GitHub
- URL: https://github.com/michmech/lemmatization-lists
- Owner: michmech
- License: odbl-1.0
- Created: 2018-05-11T15:57:01.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2022-01-29T10:34:31.000Z (over 3 years ago)
- Last Synced: 2023-11-07T18:00:13.067Z (over 1 year ago)
- Topics: lemmatization, nlp
- Size: 21.5 MB
- Stars: 259
- Watchers: 17
- Forks: 93
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Lemmatization Lists
These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.
These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.
- Asturian (ast) (108,792 pairs)
- Bulgarian (bg) (30,323 pairs)
- Catalan (ca) (591,534 pairs)
- Czech (cs) (36,400 pairs)
- English (en) (41,760 pairs)
- Estonian (et) (80,536 pairs)
- French (fr) (224,002 pairs)
- Galician (gl) (392,856 pairs)
- German (de) (358,473 pairs)
- Hungarian (hu) (39,898 pairs)
- Irish (ga) (415,502 pairs)
- Manx Gaelic (gv) (67,177 pairs)
- Italian (it) (341,074 pairs)
- Persian/Farsi (fa) (6,273 pairs)
- Polish (pl) (3,296,232 pairs)
- Portuguese (pt) (850,264 pairs)
- Romanian (ro) (314,810 pairs)
- Russian (ru) (537,810 pairs)
- Scottish Gaelic (gd) (51,624 pairs)
- Slovak (sk) (858,414 pairs)
- Slovene (sl) (99,063 pairs)
- Spanish (es) (497,560 pairs)
- Swedish (sv) (675,137 pairs)
- Ukrainian (uk) (193,703 pairs)
- Welsh (cy) (359,224 pairs)Licence
- Available under the [Open Database License](http://opendatacommons.org/licenses/odbl/summary/)
Sources
- [Various Hunspell dictionaries](http://extensions.services.openoffice.org/en/dictionaries) from the OpenOffice.org website
- [Deutsches Morphologie-Lexikon](http://www.danielnaber.de/morphologie/) by Daniel Naber
- [Lexique](http://www.lexique.org/) by Boris New and Christophe Pallier
- [e_lemma.txt](http://www.lexically.net/downloads/BNC_wordlists/e_lemma.txt) by Yasumasa Someya
- [Multext East](http://nl.ijs.si/ME/) (only those morphological lexicons that are under a free licence are used)
- Morphological dictionaries from [FreeLing](http://nlp.lsi.upc.edu/freeling/index.php)
- [SALDO](http://spraakbanken.gu.se/eng/saldo) morphological lexicon
- [Irish National Morphology Database](http://www.teanglann.ie/en/gram/_download)
- Various lists by [Kevin Scannell](https://cadhan.com/)
- [OpenRussian.org](https://en.openrussian.org/)