Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/EleutherAI/stackexchange-dataset
Python tools for processing the stackexchange data dumps into a text dataset for Language Models
https://github.com/EleutherAI/stackexchange-dataset
Last synced: 3 months ago
JSON representation
Python tools for processing the stackexchange data dumps into a text dataset for Language Models
- Host: GitHub
- URL: https://github.com/EleutherAI/stackexchange-dataset
- Owner: EleutherAI
- License: mit
- Created: 2020-09-08T23:33:00.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2023-12-06T00:30:43.000Z (11 months ago)
- Last Synced: 2024-07-18T22:20:24.851Z (4 months ago)
- Language: Python
- Size: 49.8 KB
- Stars: 71
- Watchers: 3
- Forks: 14
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# stackexchange_dataset
A python tool for downloading & processing the [stackexchange data dumps](https://archive.org/details/stackexchange) into a text dataset for Language Models.Download the whole processed dataset [here](https://eaidata.bmk.sh/data/stackexchange_dataset.tar)
# Setup
```
git clone https://github.com/EleutherAI/stackexchange_dataset/
cd stackexchange_dataset
pip install -r requirements.txt
```
# UsageTo download *every* stackexchange dump & parse to text, simply run
```
python3 main.py --names all
```To download only a single stackexchange, you can add the name as an optional argument. E.G:
```
python3 main.py --names security.stackexchange
```To download a list of multiple stackexchanges, you can add the names separated by commas. E.G:
```
python3 main.py --names ru.stackoverflow,money.stackexchange
```The name should be the url of the stackoverflow site, minus `http(s)://` and `.com`. You can view all available stackoverflow dumps [here](https://archive.org/download/stackexchange).
## All Usage Options:
```
usage: main.py [-h] [--names NAMES]CLI for stackexchange_dataset - A tool for downloading & processing
stackexchange dumps in xml form to a raw question-answer pair text dataset for
Language Modelsoptional arguments:
-h, --help show this help message and exit
--names NAMES names of stackexchanges to download, extract & parse,
separated by commas. If "all", will download, extract & parse
*every* stackoverflow site
```# TODO:
- [ ] should we add metadata to the text (i.e name of stackexchange & tags)?
- [ ] add flags to change min_score / max_responses args.
- [ ] add flags to turn off downloading / extraction
- [ ] add flags to select number of workers for multiprocessing
- [ ] output as [lm dataformat](https://github.com/leogao2/lm_dataformat)