Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yet-another-account/openwebtext
An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.
https://github.com/yet-another-account/openwebtext
Last synced: 3 months ago
JSON representation
An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.
- Host: GitHub
- URL: https://github.com/yet-another-account/openwebtext
- Owner: yet-another-account
- Created: 2019-02-18T01:06:24.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-03-26T15:33:13.000Z (8 months ago)
- Last Synced: 2024-05-27T12:02:12.883Z (6 months ago)
- Language: Python
- Homepage:
- Size: 90.8 KB
- Stars: 377
- Watchers: 10
- Forks: 60
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenWebText
This project is a clone of the GPT-2 WebText dataset as outlined in the [OpenAI paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). This project is still heavily WIP.
Huge thanks to [jcpeterson](https://github.com/jcpeterson/openwebtext) for letting me use his download code. His version of OpenWebText is super well written, so please check it out!
## Dependencies
Pipenv, Python 3,
To install python dependencies:
```
pipenv install
```[Newspaper](https://github.com/codelucas/newspaper#get-it-now) Dependencies:
On Ubuntu:
```
sudo apt-get install libxml2-dev libxslt-dev
```
On OS X:
```
brew install libxml2 libxslt
```
## Usage1. Get list of URLs from reddit:
```
pipenv run python get_urls.py
```2. Download data from URLs:
```
pipenv run python download.py
```Resulting files will be deposited in `data/` with format `{domain}-{sha256 hash of url}.txt`.
Enjoy!