{"id":17099017,"url":"https://github.com/dinhanhx/rct","last_synced_at":"2025-03-23T18:12:43.396Z","repository":{"id":112393560,"uuid":"607972359","full_name":"dinhanhx/rct","owner":"dinhanhx","description":"r/cosplay title crawler","archived":false,"fork":false,"pushed_at":"2023-03-02T13:29:35.000Z","size":14,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-28T23:50:02.319Z","etag":null,"topics":["computer-vision","cosplay","dataset","image-captioning","nlp","python","reddit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dinhanhx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-01T03:29:42.000Z","updated_at":"2023-03-01T11:19:01.000Z","dependencies_parsed_at":"2023-05-14T08:00:35.795Z","dependency_job_id":null,"html_url":"https://github.com/dinhanhx/rct","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dinhanhx%2Frct","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dinhanhx%2Frct/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dinhanhx%2Frct/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dinhanhx%2Frct/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dinhanhx","download_url":"https://codeload.github.com/dinhanhx/rct/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245144972,"owners_count":20568056,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","cosplay","dataset","image-captioning","nlp","python","reddit"],"created_at":"2024-10-14T15:08:48.118Z","updated_at":"2025-03-23T18:12:43.378Z","avatar_url":"https://github.com/dinhanhx.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# r/cosplay title crawler\n\n[Available on Kaggle](https://www.kaggle.com/datasets/inhanhv/rcosplay-hot-top-images-with-titles)\n\nPlease take time to read all this readme before using the dataset. Yes I'm serious!\n\n# Setup\n\n```\npip install -e .\n```\n\nGo to [this PRAW doc page](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html#prerequisites), follow the instructions to get your client id, client secret, and user agent.\n\nThen store them in `confidential/reddit.json` like this (don't actually write \"spooky\"):\n```json\n{\n    \"id\": \"spooky\",\n    \"secret\": \"spooky\",\n    \"user-agent\": \"windows-10:spooky:v0.0.1 (by u/spooky)\"\n}\n```\n\n# Run\n## Download all posts in top and hot \n(but [the number in each category limited by Reddit](https://stackoverflow.com/a/54046328/13358358))\n- Output file: `data/cosplay.jsonl`\n- 2161 posts (on 01/03/2023)\n```\npython rct/crawl.py\n```\n\n## Clean text \n(in post's title) enclosed by square brackets such as `[self]`, `[found]`, ... \n- Input file: `data/cosplay.jsonl`\n- Output file: `data/clean_cosplay.jsonl`\n```\npython rct/clean.py\n```\n\n## Download images \n- Input file: `data/clean_cosplay.jsonl`\n- Output file: `data/map_cosplay.jsonl`, `data/bad_response.jsonl`\n- 2160 downloaded images, 1 bad/delete/deprecated image (on 02/03/2023)\n```\npython rct/download.py\n``` \n\n⚠ The `image_id`, and `image_path` attributes' values are NOT linearly continuous. For example,\n\nin `data/bad_response.jsonl`\n```python\n{\"image_id\": \"001912\", \"image_path\": \"data/image/001912.jpg\"}\n```\nand in `data/map_cosplay.jsonl`\n```python\n# omit other json objects \n{\"image_id\": \"001911\", \"image_path\": \"data/image/001911.jpg\"}\n{\"image_id\": \"001913\", \"image_path\": \"data/image/001913.jpg\"}\n# omit other json objects\n```\n\n⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts.\n\n⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on __Kaggle__*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdinhanhx%2Frct","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdinhanhx%2Frct","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdinhanhx%2Frct/lists"}