https://github.com/dinhanhx/rct
r/cosplay title crawler
https://github.com/dinhanhx/rct
computer-vision cosplay dataset image-captioning nlp python reddit
Last synced: 7 months ago
JSON representation
r/cosplay title crawler
- Host: GitHub
- URL: https://github.com/dinhanhx/rct
- Owner: dinhanhx
- License: mit
- Created: 2023-03-01T03:29:42.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-03-02T13:29:35.000Z (over 2 years ago)
- Last Synced: 2025-01-28T23:50:02.319Z (8 months ago)
- Topics: computer-vision, cosplay, dataset, image-captioning, nlp, python, reddit
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# r/cosplay title crawler
[Available on Kaggle](https://www.kaggle.com/datasets/inhanhv/rcosplay-hot-top-images-with-titles)
Please take time to read all this readme before using the dataset. Yes I'm serious!
# Setup
```
pip install -e .
```Go to [this PRAW doc page](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html#prerequisites), follow the instructions to get your client id, client secret, and user agent.
Then store them in `confidential/reddit.json` like this (don't actually write "spooky"):
```json
{
"id": "spooky",
"secret": "spooky",
"user-agent": "windows-10:spooky:v0.0.1 (by u/spooky)"
}
```# Run
## Download all posts in top and hot
(but [the number in each category limited by Reddit](https://stackoverflow.com/a/54046328/13358358))
- Output file: `data/cosplay.jsonl`
- 2161 posts (on 01/03/2023)
```
python rct/crawl.py
```## Clean text
(in post's title) enclosed by square brackets such as `[self]`, `[found]`, ...
- Input file: `data/cosplay.jsonl`
- Output file: `data/clean_cosplay.jsonl`
```
python rct/clean.py
```## Download images
- Input file: `data/clean_cosplay.jsonl`
- Output file: `data/map_cosplay.jsonl`, `data/bad_response.jsonl`
- 2160 downloaded images, 1 bad/delete/deprecated image (on 02/03/2023)
```
python rct/download.py
```⚠ The `image_id`, and `image_path` attributes' values are NOT linearly continuous. For example,
in `data/bad_response.jsonl`
```python
{"image_id": "001912", "image_path": "data/image/001912.jpg"}
```
and in `data/map_cosplay.jsonl`
```python
# omit other json objects
{"image_id": "001911", "image_path": "data/image/001911.jpg"}
{"image_id": "001913", "image_path": "data/image/001913.jpg"}
# omit other json objects
```⚠ `image_path` attribute's values are `data/image/*.jpg`. They are relative to the folder `data` containing all `.jsonl` files and `image` folder. The folder `data` is produced by Python scripts.
⚠ `image_path` attribute's values MISMATCH with *the name of folder containing all `.jsonl` files and `image` folder on __Kaggle__*. When you load the data from Kaggle Dataset, `data/image/000000.jpg`'s `data` should be replaced with Kaggle path (see [this notebook](https://www.kaggle.com/code/inhanhv/rct-demo)). It shall become `/kaggle/input/rcosplay-hot-top-images-with-titles/image/000000.jpg`