{"id":17823227,"url":"https://github.com/purarue/pushshift_comment_export","last_synced_at":"2025-03-18T16:30:21.915Z","repository":{"id":94610547,"uuid":"292736093","full_name":"purarue/pushshift_comment_export","owner":"purarue","description":"Exports all accessible reddit comments for an account using pushshift","archived":false,"fork":false,"pushed_at":"2024-10-25T17:42:52.000Z","size":44,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-16T16:15:46.017Z","etag":null,"topics":["data-export","pushshift","reddit"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/pushshift-comment-export/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/purarue.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-04T03:08:33.000Z","updated_at":"2024-10-25T17:42:56.000Z","dependencies_parsed_at":"2024-10-27T18:21:54.946Z","dependency_job_id":null,"html_url":"https://github.com/purarue/pushshift_comment_export","commit_stats":{"total_commits":31,"total_committers":1,"mean_commits":31.0,"dds":0.0,"last_synced_commit":"a4b4837c2a2c4b5f8022a12e2fe471c529aae583"},"previous_names":["seanbreckenridge/pushshift_comment_export"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purarue%2Fpushshift_comment_export","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purarue%2Fpushshift_comment_export/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purarue%2Fpushshift_comment_export/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/purarue%2Fpushshift_comment_export/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/purarue","download_url":"https://codeload.github.com/purarue/pushshift_comment_export/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243893897,"owners_count":20364919,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-export","pushshift","reddit"],"created_at":"2024-10-27T17:57:04.877Z","updated_at":"2025-03-18T16:30:21.907Z","avatar_url":"https://github.com/purarue.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Exports all accessible reddit comments for an account using [pushshift](https://pushshift.io/).\n\n[![PyPi version](https://img.shields.io/pypi/v/pushshift_comment_export.svg)](https://pypi.python.org/pypi/pushshift_comment_export) [![Python 3.6|3.7|3.8](https://img.shields.io/pypi/pyversions/pushshift_comment_export.svg)](https://pypi.python.org/pypi/pushshift_comment_export) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)\n\n# REDDIT API CHANGES\n\nSince the API restrictions on reddit, the terms of use for pushshift have changed, see \u003chttps://pushshift.io/signup\u003e\n\nI no longer use this (I only really used it once to get all my historical data), but in order to use this one would need to supply a pushshift API token. Since I don't meet the terms of use ('user certifies that they are a registered user of Reddit and a Reddit moderator (a “Mod\") and may only access Reddit Services and Data through Pushshift Services for the express limited purposes of community moderation, enforcing Reddit community guidelines, and ensuring community member safety'), I don't see a reason to update this for myself, but if someone else wants to make a PR, feel free to do so.\n\nYou should be able to get your historical data through the [data request](https://www.reddit.com/settings/data-request), if all you want is to backup your data.\n\n### Install\n\nRequires `python3.6+`\n\nTo install with pip, run:\n\n    pip install pushshift_comment_export\n\nIs accessible as the script `pushshift_comment_export`, or by using `python3 -m pushshift_comment_export`.\n\n---\n\nReddit (supposedly) only indexes the last 1000 items per query, so there are lots of comments that I don't have access to using the official reddit API (I run [`rexport`](https://github.com/karlicoss/rexport/) periodically to pick up any new data.)\n\nThis downloads all the comments that pushshift has, which is typically more than the 1000 query limit. This is only really meant to be used once per account, to access old data that I don't have access to.\n\nFor more context see the comments [here](https://github.com/karlicoss/rexport/#api-limitations).\n\nReddit has recently added a [data request](https://www.reddit.com/settings/data-request) which may let you get comments going further back, but pushshifts JSON response contains a bit more info than what the GDPR request does\n\nComplies to the rate limit [described here](https://github.com/dmarx/psaw#features)\n\n```\n$ pushshift_comment_export \u003creddit_username\u003e --to-file ./data.json\n.....\n[D 200903 19:51:49 __init__:43] Have 4700, now searching for comments before 2015-10-07 23:32:03...\n[D 200903 19:51:49 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username\u0026limit=100\u0026sort_type=created_utc\u0026sort=desc\u0026before=1444260723...\n[D 200903 19:51:52 __init__:43] Have 4800, now searching for comments before 2015-09-22 13:55:00...\n[D 200903 19:51:52 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username\u0026limit=100\u0026sort_type=created_utc\u0026sort=desc\u0026before=1442930100...\n[D 200903 19:51:57 __init__:43] Have 4860, now searching for comments before 2014-08-28 07:10:14...\n[D 200903 19:51:57 __init__:17] Requesting https://api.pushshift.io/reddit/comment/search?author=username\u0026limit=100\u0026sort_type=created_utc\u0026sort=desc\u0026before=1409209814...\n[I 200903 19:52:01 __init__:64] Done! writing 4860 comments to file ./data.json\n```\n\npushshift doesn't require authentication, if you want to preview what this looks like, just go to \u003chttps://api.pushshift.io/reddit/comment/search?author=\u003e\n\n#### Usage in HPI\n\nThis has been merged into [karlicoss/HPI](https://github.com/karlicoss/HPI), which combines the periodic results of `rexport` (to pick up new comments), with any from the past using this, which looks like [this](https://github.com/karlicoss/HPI/tree/master/my/reddit); my config looking like:\n\n```reddit\nclass reddit:\n    class rexport:\n        export_path: Paths = \"~/data/rexport/*.json\"\n    class pushshift:\n        export_path: Paths = \"~/data/pushshift/*.json\"\n```\n\nThen importing from `my.reddit.all` combines the data from both of them:\n\n```\n\u003e\u003e\u003e from my.reddit.rexport import comments as rcomments\n\u003e\u003e\u003e from my.reddit.pushshift import comments as pcomments\n\u003e\u003e\u003e from my.reddit.all import comments\n\u003e\u003e\u003e from more_itertools import ilen\n\u003e\u003e\u003e ilen(rcomments())\n1020\n\u003e\u003e\u003e ilen(pcomments())\n4891\n\u003e\u003e\u003e ilen(comments())\n4914\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpurarue%2Fpushshift_comment_export","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpurarue%2Fpushshift_comment_export","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpurarue%2Fpushshift_comment_export/lists"}