An open API service indexing awesome lists of open source software.

https://github.com/operavaria/huncor2vec

Automation tools for the Hungarian Webcorpus 2.0
https://github.com/operavaria/huncor2vec

digital-humanities hungarian-language linguistics vectorization word2vec

Last synced: 3 months ago
JSON representation

Automation tools for the Hungarian Webcorpus 2.0

Awesome Lists containing this project

README

        

# HunCor2Vec

This lightweight Python app provides automation tools to easily
retrieve material form the [Hungarian Webcorpus 2.0](https://hlt.bme.hu/en/resources/webcorpus2), train a Word2Vec
model with the said texts, and evaluate the results.

The app features an easy-to-use command line menu structure,
implemented with the [pick](https://github.com/aisk/pick) package.

Training and querying tasks utilize the [gensim](https://github.com/piskvorky/gensim) library's Word2Vec module.

Available tools:

- Webcorpus 2.0 Scraper and Webcorpus 2.0 Downloader: retrieve all file links and automate the entire corpus file download process.
- Word2Vec Trainer: easily train a Word2Vec model with any plain-text or CoNLL-U formatted, multi-file corpus. Saving and resuming is supported.
- Word2Vec Query: evaluate the trained model with the most common methods.

Tested on: Windows 11 and Lubuntu 22.04 LTS with Python version 3.10.11.

---

**[Contact](mailto:[email protected])**

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)