{"id":17086796,"url":"https://github.com/belval/reddit-json-dump-parser","last_synced_at":"2025-06-30T22:05:51.738Z","repository":{"id":66067527,"uuid":"130600818","full_name":"Belval/reddit-json-dump-parser","owner":"Belval","description":"A parser for the reddit data dump","archived":false,"fork":false,"pushed_at":"2018-09-02T15:51:17.000Z","size":6,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-23T14:42:20.289Z","etag":null,"topics":["dataset","reddit"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Belval.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-22T19:32:36.000Z","updated_at":"2018-09-02T20:18:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"b1b3c3ca-fc05-4b80-8c86-053c15378fe1","html_url":"https://github.com/Belval/reddit-json-dump-parser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Belval/reddit-json-dump-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Belval%2Freddit-json-dump-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Belval%2Freddit-json-dump-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Belval%2Freddit-json-dump-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Belval%2Freddit-json-dump-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Belval","download_url":"https://codeload.github.com/Belval/reddit-json-dump-parser/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Belval%2Freddit-json-dump-parser/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262857264,"owners_count":23375490,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","reddit"],"created_at":"2024-10-14T13:29:33.338Z","updated_at":"2025-06-30T22:05:51.670Z","avatar_url":"https://github.com/Belval.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# reddit-json-dump-parser\n\nA parser for the reddit data dump that can be found here: [reddit](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)\n\n## How does the loader work?\n\n1. You edit the config to include the path where the uncompressed dump files can be found.\n2. You run `python3 run.py`\n3. You wait for it to complete (Took a few hours)\n4. The sqlite3 database file is now ready to be queried!\n\n## How does the sanitizer work?\n\n1. Create a task for each batch of 10000 comments for preprocessing.\n2. Preprocess the string using the following technique (tunable via config.json)\n    1. Replace names by the tag \\\u003cname\u003e\n    2. Replace numbers by the tag \\\u003cnumber\u003e\n    3. Remove ponctuation\n    4. Replace words not part of the provided dictionary by the tag \\\u003cunk\u003e\n3. Save the resulting text as sanitized_body in the sqlite3 db\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbelval%2Freddit-json-dump-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbelval%2Freddit-json-dump-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbelval%2Freddit-json-dump-parser/lists"}