An open API service indexing awesome lists of open source software.

https://github.com/laion-ai/interesting-text-datasets


https://github.com/laion-ai/interesting-text-datasets

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

Books & Documents:

https://huggingface.co/datasets/the_pile_books3

Description:
This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1).

On s3: Not yet.

Converted to training format: not yet

https://the-eye.eu/libraries.html

Description:
libgen & zlib

On s3: yes
Converted to training format: not yet

https://archive.org/details/fanfictiondotnet_repack

https://archive.org/details/Fanfictiondotnet1011dump

fanfiction.net ID 11M+ should get scraped

Description:
dump of fanfiction.net
Many short stories, books, ...

On s3: Yes

Converted to training format: not yet

https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent

Description:
16 M ebooks from IA

On s3: Not Yet

Converted to training format: not yet

https://the-eye.eu/public/Books/
Description:
5+M ebooks from different domains

On s3: Not Yet

Converted to training format: not yet

all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks
Description:
many differentr ebook torrents

On s3: Not Yet

Converted to training format: not yet

https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/

https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/

Description:
many TV captions / subtitles - need to be checked

On s3: Yes.

Converted to training format: not yet

https://huggingface.co/datasets/bookcorpusopen

https://huggingface.co/datasets/demelin/moral_stories

Subs:
https://the-eye.eu/public/Random/archive.org_dumps/archive.org_tvarchive_CaptionProject_December1st2022.tar.zst

Description:
many TV captions / subtitles - need to be checked

On s3: Yes.

Converted to training format: not yet

Largescale Webtext:

https://huggingface.co/datasets/oscar

https://huggingface.co/datasets/mc4

https://huggingface.co/datasets/the_pile

https://huggingface.co/datasets/spanish_billion_words

https://huggingface.co/datasets/arabic_billion_words

https://huggingface.co/datasets/olm/wikipedia

https://huggingface.co/datasets/cc100

https://files.pushshift.io/reddit/comments/
https://arxiv.org/abs/2001.08435

Description:
Reddit comments dumps

On s3: Not yet

Converted to training format: not yet

https://the-eye.eu/public/social/twitter/

Code:
https://huggingface.co/datasets/bigcode/the-stack-dedup

https://huggingface.co/datasets/code_search_net

https://huggingface.co/datasets/codeparrot/github-code

Law:
https://openreview.net/forum?id=3HCT3xfNm9r
https://huggingface.co/datasets/pile-of-law/pile-of-law

Scientific papers:

Translation:
https://huggingface.co/datasets/opus100