https://github.com/macbre/faroese-corpus
Some Faroese language statistics taken from fo.wikipedia.org content dump
https://github.com/macbre/faroese-corpus
corpus-linguistics faroe faroese faroese-language linguistic-analysis linguistics python3-script wikipedia-corpus wikipedia-dump
Last synced: 8 months ago
JSON representation
Some Faroese language statistics taken from fo.wikipedia.org content dump
- Host: GitHub
- URL: https://github.com/macbre/faroese-corpus
- Owner: macbre
- License: mit
- Created: 2018-10-26T18:20:50.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T06:37:11.000Z (over 3 years ago)
- Last Synced: 2025-02-22T19:12:54.283Z (over 1 year ago)
- Topics: corpus-linguistics, faroe, faroese, faroese-language, linguistic-analysis, linguistics, python3-script, wikipedia-corpus, wikipedia-dump
- Language: Python
- Homepage:
- Size: 8.79 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# faroese-corpus
Faroese corpus taken from Wikipedia dumps.
This repository will contain corpus of Faroese language taken from [the content dump](https://dumps.wikimedia.org/fowikisource/latest/) of [Faroese Wikipedia](https://fo.wikipedia.org).
## `pipenv`
This project uses `pipenv`. [How to install `pipenv`](https://pipenv.readthedocs.io/en/latest/install/#pragmatic-installation-of-pipenv).
## Dependencies
In order to read 7zip archives (used by Wikia's XML dumps) you need to install [`libarchive`](http://libarchive.org/):
```
pipenv install
sudo apt install libarchive-dev
```
## Links
* [ FTS - Färöisk textsamling](https://spraakbanken.gu.se/korp/?mode=faroe)
* [Current XML dump](https://dumps.wikimedia.org/fowikisource/latest/fowikisource-latest-pages-meta-current.xml.bz2) (~14 MB)
* [MediaWiki XML dump format](https://www.mediawiki.org/wiki/Help:Export#Export_format)
## Scripts
Run `pipenv shell` before running them.
### `words_from_dump.py`
Shows the longest words taken from the dump:
```
1 llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch - 58
2 samvinnufelagiðsamvinnufelagnum - 31
3 krabbameinsgranskingarstovnurin - 31
4 southernplayalisticadillacmuzik - 31
5 barnabókavirðislønavinnararnar - 30
6 norðurlandameistarakappingini - 29
7 sjónvarpsundirhaldssendingini - 29
8 bókmentakritikaraheiðurslønir - 29
9 einstaklingaítróttargreinunum - 29
10 vegsúkklukappingarmeistaranum - 29
```