https://github.com/operavaria/huncor2vec
Automation tools for the Hungarian Webcorpus 2.0
https://github.com/operavaria/huncor2vec
digital-humanities hungarian-language linguistics vectorization word2vec
Last synced: 3 months ago
JSON representation
Automation tools for the Hungarian Webcorpus 2.0
- Host: GitHub
- URL: https://github.com/operavaria/huncor2vec
- Owner: OperaVaria
- License: gpl-3.0
- Created: 2024-06-25T15:40:33.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-07-17T13:04:36.000Z (10 months ago)
- Last Synced: 2024-12-28T00:42:49.264Z (5 months ago)
- Topics: digital-humanities, hungarian-language, linguistics, vectorization, word2vec
- Language: Python
- Homepage: https://github.com/OperaVaria/huncor2vec
- Size: 82 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: docs/README.md
- License: COPYING.md
Awesome Lists containing this project
README
# HunCor2Vec
This lightweight Python app provides automation tools to easily
retrieve material form the [Hungarian Webcorpus 2.0](https://hlt.bme.hu/en/resources/webcorpus2), train a Word2Vec
model with the said texts, and evaluate the results.The app features an easy-to-use command line menu structure,
implemented with the [pick](https://github.com/aisk/pick) package.Training and querying tasks utilize the [gensim](https://github.com/piskvorky/gensim) library's Word2Vec module.
Available tools:
- Webcorpus 2.0 Scraper and Webcorpus 2.0 Downloader: retrieve all file links and automate the entire corpus file download process.
- Word2Vec Trainer: easily train a Word2Vec model with any plain-text or CoNLL-U formatted, multi-file corpus. Saving and resuming is supported.
- Word2Vec Query: evaluate the trained model with the most common methods.Tested on: Windows 11 and Lubuntu 22.04 LTS with Python version 3.10.11.
---
**[Contact](mailto:[email protected])**
[](https://www.gnu.org/licenses/gpl-3.0)