{"id":13468379,"url":"https://github.com/alex000kim/nsfw_data_scraper","last_synced_at":"2025-05-14T00:05:53.688Z","repository":{"id":37612293,"uuid":"165164444","full_name":"alex000kim/nsfw_data_scraper","owner":"alex000kim","description":"Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier","archived":false,"fork":false,"pushed_at":"2024-01-21T23:49:42.000Z","size":8472,"stargazers_count":12416,"open_issues_count":9,"forks_count":2876,"subscribers_count":423,"default_branch":"main","last_synced_at":"2025-05-12T17:03:57.042Z","etag":null,"topics":["content-moderation","deep-learning","machine-learning","nsfw","nsfw-classifier","pornography"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alex000kim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-11T02:21:40.000Z","updated_at":"2025-05-12T13:23:34.000Z","dependencies_parsed_at":"2024-08-07T14:14:27.848Z","dependency_job_id":"d02da7d3-0d0e-41e5-aa93-0142a52250a8","html_url":"https://github.com/alex000kim/nsfw_data_scraper","commit_stats":null,"previous_names":["alexkimxyz/nsfw_data_scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex000kim%2Fnsfw_data_scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex000kim%2Fnsfw_data_scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex000kim%2Fnsfw_data_scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alex000kim%2Fnsfw_data_scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alex000kim","download_url":"https://codeload.github.com/alex000kim/nsfw_data_scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254043424,"owners_count":22004950,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["content-moderation","deep-learning","machine-learning","nsfw","nsfw-classifier","pornography"],"created_at":"2024-07-31T15:01:09.816Z","updated_at":"2025-05-14T00:05:53.671Z","avatar_url":"https://github.com/alex000kim.png","language":"Shell","funding_links":[],"categories":["Shell","Image Generation \u0026 Editing","其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# NSFW Data Scraper\n\n## Note: use with caution - the dataset is noisy\n\n## Description\n\nThis is a set of scripts that allows for an automatic collection of _tens of thousands_ of images for the following (loosely defined) categories to be later used for training an image classifier:\n- `porn` - pornography images\n- `hentai` - hentai images, but also includes pornographic drawings\n- `sexy` - sexually explicit images, but not pornography. Think nude photos, playboy, bikini, etc.\n- `neutral` - safe for work neutral images of everyday things and people\n- `drawings` - safe for work drawings (including anime)\n\nHere is what each script (located under `scripts` directory) does:\n- `1_get_urls_.sh` - iterates through text files under `scripts/source_urls` downloading URLs of images for each of the 5 categories above. The `ripme` application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports.\n*Note*: I already ran this script for you, and its outputs are located in `raw_data` directory. No need to rerun unless you edit files under `scripts/source_urls`.\n- `2_download_from_urls_.sh` - downloads actual images for urls found in text files in `raw_data` directory.\n- `3_optional_download_drawings_.sh` - (optional) script that downloads SFW anime images from the [Danbooru2018](https://www.gwern.net/Danbooru2018) database.\n- `4_optional_download_neutral_.sh` - (optional) script that downloads SFW neutral images from the [Caltech256](http://www.vision.caltech.edu/Image_Datasets/Caltech256/) dataset\n- `5_create_train_.sh` - creates `data/train` directory and copy all `*.jpg` and `*.jpeg` files into it from `raw_data`. Also removes corrupted images.\n- `6_create_test_.sh` - creates `data/test` directory and moves `N=2000` random files for each class from `data/train` to `data/test` (change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will move `N` images for each class from `data/train` to `data/test`.\n\n## Prerequisites\n\n- Docker\n\n## How to collect data\n\n```bash\n$ docker build . -t docker_nsfw_data_scraper\nSending build context to Docker daemon  426.3MB\nStep 1/3 : FROM ubuntu:18.04\n ---\u003e 775349758637\nStep 2/3 : RUN apt update  \u0026\u0026 apt upgrade -y  \u0026\u0026 apt install wget rsync imagemagick default-jre -y\n ---\u003e Using cache\n ---\u003e b2129908e7e2\nStep 3/3 : ENTRYPOINT [\"/bin/bash\"]\n ---\u003e Using cache\n ---\u003e d32c5ae5235b\nSuccessfully built d32c5ae5235b\nSuccessfully tagged docker_nsfw_data_scraper:latest\n$ # Next command might run for several hours. It is recommended to leave it overnight\n$ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh\nGetting images for class: neutral\n...\n...\n$ ls data\ntest  train\n$ ls data/train/\ndrawings  hentai  neutral  porn  sexy\n$ ls data/test/\ndrawings  hentai  neutral  porn  sexy\n```\n\n## How to train a CNN model\n- Install [fastai](https://github.com/fastai/fastai): `conda install -c pytorch -c fastai fastai`\n- Run `train_model.ipynb` top to bottom\n\n## Results\n\nI was able to train a CNN classifier to 91% accuracy with the following confusion matrix:\n\n![alt text](confusion_matrix.png)\n\nAs expected,  `drawings` and `hentai` are confused with each other more frequently than with other classes.\n\nSame with `porn` and `sexy` categories.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falex000kim%2Fnsfw_data_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falex000kim%2Fnsfw_data_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falex000kim%2Fnsfw_data_scraper/lists"}