{"id":13994078,"url":"https://github.com/EleutherAI/stackexchange-dataset","last_synced_at":"2025-07-22T18:33:11.662Z","repository":{"id":105114345,"uuid":"293951390","full_name":"EleutherAI/stackexchange-dataset","owner":"EleutherAI","description":"Python tools for processing the stackexchange data dumps into a text dataset for Language Models","archived":false,"fork":false,"pushed_at":"2023-12-06T00:30:43.000Z","size":51,"stargazers_count":81,"open_issues_count":5,"forks_count":18,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-24T18:51:19.212Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EleutherAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-08T23:33:00.000Z","updated_at":"2025-02-02T02:24:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"238dbd92-82ed-4321-a850-812f0fdc3ebe","html_url":"https://github.com/EleutherAI/stackexchange-dataset","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/EleutherAI/stackexchange-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fstackexchange-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fstackexchange-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fstackexchange-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fstackexchange-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EleutherAI","download_url":"https://codeload.github.com/EleutherAI/stackexchange-dataset/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EleutherAI%2Fstackexchange-dataset/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266552544,"owners_count":23947177,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-09T14:02:41.764Z","updated_at":"2025-07-22T18:33:06.805Z","avatar_url":"https://github.com/EleutherAI.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# stackexchange_dataset\nA python tool for downloading \u0026 processing the [stackexchange data dumps](https://archive.org/details/stackexchange) into a text dataset for Language Models.\n\nDownload the whole processed dataset [here](https://eaidata.bmk.sh/data/stackexchange_dataset.tar)\n\n# Setup\n```\ngit clone https://github.com/EleutherAI/stackexchange_dataset/\ncd stackexchange_dataset\npip install -r requirements.txt\n```\n# Usage\n\nTo download *every* stackexchange dump \u0026 parse to text, simply run\n\n```\npython3 main.py --names all\n```\n\nTo download only a single stackexchange, you can add the name as an optional argument. E.G: \n\n```\npython3 main.py --names security.stackexchange\n```\n\nTo download a list of multiple stackexchanges, you can add the names separated by commas. E.G:\n\n```\npython3 main.py --names ru.stackoverflow,money.stackexchange\n```\n\nThe name should be the url of the stackoverflow site, minus `http(s)://` and `.com`. You can view all available stackoverflow dumps [here](https://archive.org/download/stackexchange).\n\n## All Usage Options:\n\n```\nusage: main.py [-h] [--names NAMES]\n\nCLI for stackexchange_dataset - A tool for downloading \u0026 processing\nstackexchange dumps in xml form to a raw question-answer pair text dataset for\nLanguage Models\n\noptional arguments:\n  -h, --help     show this help message and exit\n  --names NAMES  names of stackexchanges to download, extract \u0026 parse,\n                 separated by commas. If \"all\", will download, extract \u0026 parse\n                 *every* stackoverflow site\n```\n\n# TODO:\n\n- [ ] should we add metadata to the text (i.e name of stackexchange \u0026 tags)?\n- [ ] add flags to change min_score / max_responses args.\n- [ ] add flags to turn off downloading / extraction\n- [ ] add flags to select number of workers for multiprocessing\n- [ ] output as [lm dataformat](https://github.com/leogao2/lm_dataformat)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEleutherAI%2Fstackexchange-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FEleutherAI%2Fstackexchange-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FEleutherAI%2Fstackexchange-dataset/lists"}