https://github.com/laion-ai/interesting-text-datasets
https://github.com/laion-ai/interesting-text-datasets
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/laion-ai/interesting-text-datasets
- Owner: LAION-AI
- Created: 2022-11-15T09:33:24.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-12-28T20:40:50.000Z (over 2 years ago)
- Last Synced: 2025-05-07T18:13:40.247Z (about 1 month ago)
- Size: 22.5 KB
- Stars: 43
- Watchers: 6
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
Books & Documents:
https://huggingface.co/datasets/the_pile_books3
Description:
This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1).
On s3: Not yet.
Converted to training format: not yet
https://the-eye.eu/libraries.html
Description:
libgen & zlibOn s3: yes
Converted to training format: not yethttps://archive.org/details/fanfictiondotnet_repack
https://archive.org/details/Fanfictiondotnet1011dump
fanfiction.net ID 11M+ should get scraped
Description:
dump of fanfiction.net
Many short stories, books, ...On s3: Yes
Converted to training format: not yet
https://the-eye.eu/public/Random/torrents/archiveorg_DjVuTXT_Part1.torrent
Description:
16 M ebooks from IAOn s3: Not Yet
Converted to training format: not yet
https://the-eye.eu/public/Books/
Description:
5+M ebooks from different domainsOn s3: Not Yet
Converted to training format: not yet
all ebook torrents from piratebay: https://pirate-bays.net/search?q=ebooks
Description:
many differentr ebook torrentsOn s3: Not Yet
Converted to training format: not yet
https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/Miscellaneous%20Texts/
https://the-eye.eu/public/Site-Dumps/campdivision.com/camp/Text%20Files/PDF/
Description:
many TV captions / subtitles - need to be checkedOn s3: Yes.
Converted to training format: not yet
https://huggingface.co/datasets/bookcorpusopen
https://huggingface.co/datasets/demelin/moral_stories
Subs:
https://the-eye.eu/public/Random/archive.org_dumps/archive.org_tvarchive_CaptionProject_December1st2022.tar.zstDescription:
many TV captions / subtitles - need to be checkedOn s3: Yes.
Converted to training format: not yet
Largescale Webtext:
https://huggingface.co/datasets/oscar
https://huggingface.co/datasets/mc4
https://huggingface.co/datasets/the_pile
https://huggingface.co/datasets/spanish_billion_words
https://huggingface.co/datasets/arabic_billion_words
https://huggingface.co/datasets/olm/wikipedia
https://huggingface.co/datasets/cc100
https://files.pushshift.io/reddit/comments/
https://arxiv.org/abs/2001.08435Description:
Reddit comments dumpsOn s3: Not yet
Converted to training format: not yet
https://the-eye.eu/public/social/twitter/
Code:
https://huggingface.co/datasets/bigcode/the-stack-deduphttps://huggingface.co/datasets/code_search_net
https://huggingface.co/datasets/codeparrot/github-code
Law:
https://openreview.net/forum?id=3HCT3xfNm9r
https://huggingface.co/datasets/pile-of-law/pile-of-lawScientific papers:
Translation:
https://huggingface.co/datasets/opus100