https://github.com/laion-ai/interesting-text-datasets

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/laion-ai/interesting-text-datasets
Owner: LAION-AI
Created: 2022-11-15T09:33:24.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-12-28T20:40:50.000Z (over 3 years ago)
Last Synced: 2025-05-07T18:13:48.078Z (about 1 year ago)
Size: 22.5 KB
Stars: 43
Watchers: 6
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

Books & Documents:

https://huggingface.co/datasets/the_pile_books3

Description:
This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1).

On s3: Not yet.

Converted to training format: not yet

https://the-eye.eu/libraries.html

Description:
libgen & zlib

On s3: yes
Converted to training format: not yet

https://archive.org/details/fanfictiondotnet_repack

https://archive.org/details/Fanfictiondotnet1011dump

fanfiction.net ID 11M+ should get scraped

Description:
dump of fanfiction.net
Many short stories, books, ...

On s3: Yes

Converted to training format: not yet

https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent

Description:
16 M ebooks from IA

On s3: Not Yet

Converted to training format: not yet

https://the-eye.eu/public/Books/
Description:
5+M ebooks from different domains

On s3: Not Yet

Converted to training format: not yet

all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks
Description:
many differentr ebook torrents

On s3: Not Yet

Converted to training format: not yet

https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/

https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/

Description:
many TV captions / subtitles - need to be checked

On s3: Yes.

Converted to training format: not yet

https://huggingface.co/datasets/bookcorpusopen

https://huggingface.co/datasets/demelin/moral_stories

Subs:
https://the-eye.eu/public/Random/archive.org_dumps/archive.org_tvarchive_CaptionProject_December1st2022.tar.zst

Description:
many TV captions / subtitles - need to be checked

On s3: Yes.

Converted to training format: not yet

Largescale Webtext:

https://huggingface.co/datasets/oscar

https://huggingface.co/datasets/mc4

https://huggingface.co/datasets/the_pile

https://huggingface.co/datasets/spanish_billion_words

https://huggingface.co/datasets/arabic_billion_words

https://huggingface.co/datasets/olm/wikipedia

https://huggingface.co/datasets/cc100

https://files.pushshift.io/reddit/comments/
https://arxiv.org/abs/2001.08435

Description:
Reddit comments dumps

On s3: Not yet

Converted to training format: not yet

https://the-eye.eu/public/social/twitter/

Code:
https://huggingface.co/datasets/bigcode/the-stack-dedup

https://huggingface.co/datasets/code_search_net

https://huggingface.co/datasets/codeparrot/github-code

Law:
https://openreview.net/forum?id=3HCT3xfNm9r
https://huggingface.co/datasets/pile-of-law/pile-of-law

Scientific papers:

Translation:
https://huggingface.co/datasets/opus100

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/laion-ai/interesting-text-datasets

Awesome Lists containing this project

README