https://github.com/mathisve/latintextdataset

Latin text dataset for machine learning and procedural text generation
https://github.com/mathisve/latintextdataset

Last synced: 2 months ago
JSON representation

Latin text dataset for machine learning and procedural text generation

Host: GitHub
URL: https://github.com/mathisve/latintextdataset
Owner: mathisve
License: mit
Created: 2019-03-29T11:30:54.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-06-03T04:06:55.000Z (about 1 year ago)
Last Synced: 2025-04-18T06:41:19.763Z (3 months ago)
Language: Python
Homepage:
Size: 33 MB
Stars: 16
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Latin Text Dataset

28.7 million+ character dataset containing latin texts for machine learning, language generation and analysation.

## About
This is a small snippet of what the dataset looks like:
```
Cum venisset accitus praedicto die, advocato omni quod aderat commilitio, tribunali ad altiorem
suggestum erecto, quod aquilae circumdederunt et signa, Augustus insistens eumque manu retinens
dextera, haec sermone placido peroravit: Adsistimus apud illos, optimi rei publicae defensores,
causae communi uno paene omnium spiritu vindicandae, quam acturus tamquam apud aequos iudices.
```
As you can see it's all authentic latin written in the roman times by historic figures such as: [Ceasar](https://en.wikipedia.org/wiki/Julius_Caesar), [Augustus](https://en.wikipedia.org/wiki/Augustus) and many many more.

There are still certain kinks I have not been able to resolve such as the occasional title or capitalised roman numeral, but because the dataset is so large it shouldn't make a difference as its result is diluted enough for LSTM's (or GRU's) not to pick up on them.

All data and text originates from [thelatinlibrary.com](https://www.thelatinlibrary.com/cred.html) which is to my knowledge in public domain.
## Getting Started

You can either use the pre-scraped and pre-processed file called `latincorpus.txt` or run / modify the `main.py` file and configure it to your liking! Scraping all the text data takes about 3-5 minutes on a computer with a moderately fast cpu and ethernet connection.

### Prerequisites

The following libraries are required to run `main.py`, to install these automatically go to Installing down below.

```
selenium==3.141.0
beautifulsoup4==4.7.1
tqdm==4.31.1
```

### Installing

To install the python libraries described above execute this command:
```
pip3 install -r requirements.txt
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mathisve/latintextdataset

Awesome Lists containing this project

README