Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mathisve/latintextdataset
Latin text dataset for machine learning and procedural text generation
https://github.com/mathisve/latintextdataset
Last synced: about 2 months ago
JSON representation
Latin text dataset for machine learning and procedural text generation
- Host: GitHub
- URL: https://github.com/mathisve/latintextdataset
- Owner: mathisve
- License: mit
- Created: 2019-03-29T11:30:54.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2024-06-03T04:06:55.000Z (7 months ago)
- Last Synced: 2024-06-03T05:26:50.865Z (7 months ago)
- Language: Python
- Homepage:
- Size: 33 MB
- Stars: 14
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Latin Text Dataset
28.7 million+ character dataset containing latin texts for machine learning, language generation and analysation.
## About
This is a small snippet of what the dataset looks like:
```
Cum venisset accitus praedicto die, advocato omni quod aderat commilitio, tribunali ad altiorem
suggestum erecto, quod aquilae circumdederunt et signa, Augustus insistens eumque manu retinens
dextera, haec sermone placido peroravit: Adsistimus apud illos, optimi rei publicae defensores,
causae communi uno paene omnium spiritu vindicandae, quam acturus tamquam apud aequos iudices.
```
As you can see it's all authentic latin written in the roman times by historic figures such as: [Ceasar](https://en.wikipedia.org/wiki/Julius_Caesar), [Augustus](https://en.wikipedia.org/wiki/Augustus) and many many more.There are still certain kinks I have not been able to resolve such as the occasional title or capitalised roman numeral, but because the dataset is so large it shouldn't make a difference as its result is diluted enough for LSTM's (or GRU's) not to pick up on them.
All data and text originates from [thelatinlibrary.com](https://www.thelatinlibrary.com/cred.html) which is to my knowledge in public domain.
## Getting StartedYou can either use the pre-scraped and pre-processed file called `latincorpus.txt` or run / modify the `main.py` file and configure it to your liking! Scraping all the text data takes about 3-5 minutes on a computer with a moderately fast cpu and ethernet connection.
### Prerequisites
The following libraries are required to run `main.py`, to install these automatically go to Installing down below.
```
selenium==3.141.0
beautifulsoup4==4.7.1
tqdm==4.31.1
```### Installing
To install the python libraries described above execute this command:
```
pip3 install -r requirements.txt
```