{"id":37142167,"url":"https://github.com/rlleshi/phar","last_synced_at":"2026-01-14T16:41:27.090Z","repository":{"id":40586120,"uuid":"456001091","full_name":"rlleshi/phar","owner":"rlleshi","description":"deep learning sex position classifier","archived":false,"fork":false,"pushed_at":"2024-03-31T02:01:14.000Z","size":1808,"stargazers_count":270,"open_issues_count":5,"forks_count":28,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-06-01T16:20:05.901Z","etag":null,"topics":["action-recognition","deep-learning","human-action-recognition","porn-filter","pornhub","pytorch","sex","sex-classifier","video-classification","video-understanding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rlleshi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-02-05T22:30:30.000Z","updated_at":"2025-05-24T00:22:20.000Z","dependencies_parsed_at":"2022-08-09T23:50:12.393Z","dependency_job_id":"466f9586-e4fc-44d1-a1dd-d541d3f70f95","html_url":"https://github.com/rlleshi/phar","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/rlleshi/phar","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rlleshi%2Fphar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rlleshi%2Fphar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rlleshi%2Fphar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rlleshi%2Fphar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rlleshi","download_url":"https://codeload.github.com/rlleshi/phar/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rlleshi%2Fphar/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28426157,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T16:38:47.836Z","status":"ssl_error","status_checked_at":"2026-01-14T16:34:59.695Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["action-recognition","deep-learning","human-action-recognition","porn-filter","pornhub","pytorch","sex","sex-classifier","video-classification","video-understanding"],"created_at":"2026-01-14T16:41:26.424Z","updated_at":"2026-01-14T16:41:27.082Z","avatar_url":"https://github.com/rlleshi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# P-HAR: Porn Human Action Recognition\n\n## Update\n\n:star: In the meantime, I've trained models that surpass **94%** accuracy on [20 action categories](https://github.com/rlleshi/phar/blob/master/resources/annotations/current_annotations.txt). They are readily available via an easy-to-use API. [Get in touch](mailto:phar.ai@protonmail.com) for more details!\n\nHow this AI can benefit you:\n\n1. 🏷️ **Automated Tagging**: Can easily extend to more categories as required.\n2. ⏱️ **Automated Timestamp Generation**: Allows users to swiftly navigate to any section of the video, offering a user-friendly experience akin to YouTube's.\n3. 🔍 **Improved Recommendation System**: Enhances content suggestions by analyzing the occurrences and timings within the video, providing more relevant and tailored recommendations.\n4. 🚫 **Content Filtering**: Facilitates the filtering of specific content, such as non-sexual content, or certain actions and positions, allowing for a more personalized user experience.\n5. 🎞️ **Shorts**: Enables the extraction of specific actions from videos to create concise and engaging clips, a feature particularly popular among Gen Z users.\n\nIf you're interested in some of the technical details of the first version, read on!\n\n## Introduction\n\nThis is just a fun, side-project to see how State-of-the-art (SOTA) Human Action Recognition (HAR) models fare in the pornographic domain. HAR is a relatively new, active field of research in the deep learning domain, its goal being the identification of human actions from various input streams (e.g. video or sensor).\n\nThe pornography domain is interesting from a technical perspective because of its inherent difficulties. Light variations, occlusions, and a tremendous variations of different camera angles and filming techniques (POV, dedicated camera person) make position (action) recognition hard. We can have two identical positions (actions) and yet be captured in such a different camera perspective to entirely confuse the model in its predictions.\n\nThis repository uses three different input streams in order to get the best possible results: rgb frames, human skeleton, and audio. Correspondingly three different models are trained on these input streams and their results are merged through late fusion.\n\nThe best current accuracy reached by this multi-model model currently is **75.64%**, which is promising considering the small [training set](#dataset). This result will be improved in the future.\n\nThe models work on spatio-temporal data, meaning that they processes video clips rather than single images ([miles-deep](https://github.com/ryanjay0/miles-deep) is using single images for example). This is an inherently superior way of performing action recognition.\n\nCurrently, 17 actions are supported. You can find the complete list [here](resources/annotations/annotations.txt). More data would be needed to further improve the models (help is welcomed). Read on for more information!\n\n## Supported Features\n\nFirst download the human detector [here](http://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_2x_coco/faster_rcnn_r50_fpn_2x_coco_bbox_mAP-0.384_20200504_210434-a5d8aa15.pth), pose model [here](https://download.openmmlab.com/mmpose/top_down/hrnet/hrnet_w32_coco_256x192-c78dce93_20200708.pth), and HAR models [here](https://github.com/rlleshi/phar/releases/tag/v1.0.0). Then move them inside the `checkpoints/har` folder.\n\nOr just use a docker container from the [image](#docker).\n\n### Video Demo\n\nInput a video and get a demo with the top predictions every 7 seconds by default.\n\n`python src/demo/multimodial_demo.py video.mp4 demo.mp4`\n\nAlternatively, the results can also be dumped in a json file by specifying the output file as such.\n\nIf you only want to use the RGB \u0026 Skeleton model, then you can disable the audio model like so:\n\n`python src/demo/multimodial_demo.py video.mp4 demo.json --audio-checkpoint '' --coefficients 0.5 1.0 --verbose`\n\nCheck out the [detailed usage](#multimodial-demo).\n\n### Timestamp Generator\n\nUse the flag `--timestamps`\n\n`python src/demo/multimodial_demo.py video.mp4 demo.json --timestamps`\n\n### Tag Generator\n\nGiven the predictions generated by the multimodial demo (in json), we can grab the top 3 tags (by default) like so:\n\n`python src/top_tags.py demo.json`\n\nCheckout the [detailed usage](#late-fusion).\n\n### Content Filtering\n\nTODO: depending if people need it.\n\n### Deployment\n\nDepends if people find this project useful. Currently one has to install the relevant libraries to use these models. See the installation section below.\n\n## Motivation \u0026 Usages\n\nThe idea behind this project is to try and apply the latest deep learning techniques (i.e. [human action recognition](https://scholar.google.com/scholar?hl=en\u0026as_sdt=0%2C5\u0026q=human+action+recognition\u0026btnG=)) in the pornographic domain.\n\nOnce we have detailed information about the kind of actions/positions that are happening in a video a number of uses-cases can apply:\n\n1. Improving the recommender system\n2. Automatic tag generator\n3. Automatic timestamp generator (when does an action start and finish)\n4. Cutting content out (for example non-sexual content)\n\n## Installation\n\n### Docker\n\nBuild the docker image: `docker build -f docker/Dockerfile . -t rlleshi/phar`\n\n### Manual Installation\n\nThis project is based on [MMAction2](https://github.com/open-mmlab/mmaction2).\n\nThe following installation instructions are for ubuntu (hence should also work for Windows WSL). Check the links for details if you are interested in other operating systems.\n\n0. Clone this repo and its submodules: `git clone --recurse-submodules git@github.com:rlleshi/phar.git` and then create and environment with python 3.8+.\n1. Install torch (of course, it is recommended that you have CUDA \u0026 CUDNN installed).\n2. Install the correct version of `mmcv` based on your CUDA \u0026 Torch, e.g. `pip install mmcv-full==1.3.18 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html`\n3. Install mmaction:2 `cd mmaction2/ \u0026\u0026 pip install cython --no-cache-dir \u0026\u0026 pip install --no-cache-dir -e .`\n4. Install MMPose, [link](https://mmpose.readthedocs.io/en/latest/install.html).\n5. Install MMDetection, [link](https://mmdetection.readthedocs.io/en/latest/get_started.html#installation).\n6. Install extra dependencies: `pip install -r requirements/extra.txt`.\n\n## Models\n\nThe SOTA results are archieved by late-fusing three models based on three input streams. This results in significant improvements compared to only using an RGB-based model. Since more than one action might happen at the same time (and moreover, currently, some of the actions/positions have are conceptually overlapping), it is best to consider the top 2 accuracy as a performance measurement. Hence, currently the multimodial model has a `~75%` accuracy. However, since the dataset is quite small and in total only ~50 experiments have been performed, there is a lot of room for improvement.\n\n### Multi-Modial (Rgb + Skeleton + Audio)\n\nThe best performing models (performance \u0026 runtime wise) are `timesformer` for the RGB stream, `poseC3D` for the skeleton stream, and `resnet101` for the Audio stream. The results of these models are fused together through late fusion. The models do not have the same importance in the late fusion scoring scheme. Currently the fine-tuned weights are: `0.5; 0.6; 1.0` for the RGB, skeleton \u0026 audio model respectively.\n\nAnother approach would be to train a model with two of the input streams at a time (i.e. rgb+skeleton \u0026 rgb+audio) and then perhaps combine their results. But this wouldn't work due to the nature of the data. When it comes to the audio input streams, it can only be exploited for certain actions (e.g. `deepthroat` due to the gag reflex or `anal` due to a higher pitch), while for others it's not possible to derive any insight from their audio (e.g. missionary, doggy and cowgirl do not have any special characteristics to set them apart from an audio perspective).\n\nLikewise, the skeleton-based model can only be used in those instances where the pose estimation is accurate above a certain confidence threshold (for these experiments the threshold used was 0.4). For example, for actions such as `scoop-up` or `the-snake` it's hard to get an accurate pose estimation in most camera angles due to the proximity of the human bodies in the frame (the poses get fuzzy and mixed up). This then influences the accuracy of the HAR model negatively. However, for actions such as doggy, cowgirl or missionary, the pose estimation is generally good enough to train a HAR model.\n\nHowever, if we have a bigger dataset, then we will probably have enough instances of clean samples for the difficult actions such as to train all (17) of them with a skeleton-based model. Skeleton based models are according to the current SOTA literature superior to the rgb-based ones. Ideally of course, the pose estimation models should also be fine tuned in the sex domain in order to get a better overall pose estimation.\n\n#### Metrics\n\nAccuracy | Weights\n--- | ---\nTop 1 Accuracy: 0.6362 \u003cbr\u003e Top 2 Accuracy: 0.7524 \u003cbr\u003e Top 3 Accuracy: 0.8155 \u003cbr\u003e Top 4 Accuracy: 0.8521 \u003cbr\u003e Top 5 Accuracy: 0.8771 | Rgb: 0.5 \u003cbr\u003e Skeleton: 0.6 \u003cbr\u003e Audio: 1.0\n\n### RGB model - [TimeSformer](https://arxiv.org/abs/2102.05095)\n\nThe best results for a 3D RGB model are achieved by the attention-based TimeSformer architecture. This model is also very fast in inference (~0.53s / 7s clips).\n\n#### Metrics\n\nAccuracy | Training Speed | Complexity\n--- | --- | ---\ntop1_acc 0.5669 \u003cbr\u003e top2_acc 0.6834 \u003cbr\u003e top3_acc 0.7632 \u003cbr\u003e top4_acc 0.8096 \u003cbr\u003e top5_acc 0.8411 | Avg iter time: 0.3472 s/iter | Flops: 100.96 GFLOPs \u003cbr\u003e Params: 121.27 M\n\n#### Loss\n\n![alt text](resources/metrics/timesformer_loss.jpg)\n\n#### Classes\n\nAll 17 annotations. See [annotations](resources/annotations/annotations.txt).\n\n### Skeleton model - [PoseC3D](https://arxiv.org/abs/2104.13586)\n\nThe best results for a skeleton-based model are achieved by the CNN-based PoseC3D architecture. This model is also fast in inference (~3.3s / 7s clips).\n\n#### Metrics\n\nAccuracy | Training Speed | Complexity\n--- | --- | ---\ntop1_acc 0.8130 \u003cbr\u003e top2_acc 0.9191 \u003cbr\u003e top3_acc 0.9748 | Avg iter time: 0.8616 s/iter| Flops: 17.83 GFLOPs \u003cbr\u003e Params: 2.0 M\n\nCheck the [confusion matrix](resources/metrics/skeleton_cm.png) for a detailed overview of the performance.\n\n#### Loss\n\n![alt text](resources/metrics/posec3d_loss.jpg)\n\n#### Classes\n\n6 annotations. See [annotations](resources/annotations/annotations_pose.txt).\n\n### Audio Model - Simple ResNet based on [Audiovisual SlowFast](https://arxiv.org/abs/2001.08740)\n\nA simple ResNet 101 (with some small tweaks) was used. This model definitely needs to be swapped with a better architecture. It is very fast in inference (0.05s / 7s audio clips).\n\n#### Metrics\n\nAccuracy | Training Speed\n--- | ---\ntop1_acc 0.6867 \u003cbr\u003e top2_acc 0.9038 \u003cbr\u003e top3_acc 0.9663 | Avg iter time: 0.2747 s/iter\n\nCheck the [confusion matrix](resources/metrics/audio_cm.png) for a detailed overview of the performance.\n\n#### Loss\n\n![alt text](resources/metrics/audio_loss.jpg)\n\n#### Classes\n\n4 annotations. See [annotations](resources/annotations/annotations_audio.txt).\n\n## Dataset\n\nFirst things first, [here](https://www.womenshealthmag.com/sex-and-love/a19943165/sex-positions-guide/) is a list of definitions of the sex positions used in this project in case there is any confusion. `fondling`, in addition to the meaning of the word, was also thought of as a general placeholder, e.g. when it is unclear what action there is. In reality, however, its ability to be a general placeholder is limited because I only got 48 minutes of data for this action.\n\nThe gathered dataset is very inclusive and consists of a variety of recordings such as POV, professionally filmed, amateur, with or without a dedicated camera person, etc. It also includes all kinds of environments, people, and camera angles. The problem is probably much easier to solve if only professional recordings with a dedicated camera person are used and hence this was avoided.\n\nIn general, a train/val split of 0.8/0.2 was used for all the datasets. The length of the clips in training \u0026 validation sets currently is 7 seconds (the main motivation was to include the more ephemeral actions such as `cumshot` or `kissing`). In total there were around 600 videos amounting to **2674** minutes of footage. Check out the [annotation distribution](resources/annotation_distribution(min).json) in time (minutes) for each of the 17 classes for more information. The dataset was not perfectly annotated but the number of wrong annotations should be small and hence the drop in performance should be minimal.\n\nIn general, it can be said that this is a small dataset. Normally ~44 hours of footage would be enough for 17 actions. However, each position has a tremendous variety when it comes to camera perspectives, which makes the recognition task hard if there aren't enough samples. This would also mean that we should ideally have the same amount of footage for each different perspective. However, labeling the dataset was already very time-consuming and I didn't keep track of this point.\n\nA HAR model trained on 3D poses might be able solve this camera-perspective problem. However, due to the fact that 3D pose estimation is less accurate than 2D pose estimation, and I already noticed problems with the accuracy of the 2D (see [here](#2d-pose)), this has not been tried (yet). Ideally, however, if the dataset is big enough then the camera perspective problem should be naturally solved.\n\nThe dataset is also slightly imbalanced, which actually makes the rgb models slightly biased towards the positions (actions) that have more data.\n\nIf you'd like to help with doubling the current size of the dataset, please do open an issue.\n\n### RGB\n\nIn total there are ~17.6K training clips and ~4.9k val clips. [This](resources/ann_dist_clip.jpg) plot shows the number of clips for each class. The RGB is considered the kernel input modality given that the audio modality is only applied to four classes and that the skeleton modality is rather fickle because of the accuracy of 2D pose estimation. Various data augmentation techniques were applied such as rescaling, cropping, flipping, color inversion, gaussian blur, elastic transformation, affine transformation, etc. This further improves the accuracy of the model.\n\n### (2D) Pose\n\nDue to the variety of positions and camera angles, which make the pose estimation difficult as human bodies overlap and are too close, it's only feasible to apply HAR on skeleton data on a few of the actions. The clips generated for the RGB dataset were filtered based on two criteria:\n\n1. The confidence of the pose information. Minimal confidence of 0.4 was chosen.\n\n2. The number of frames in a clip that have a confidence higher than the minimal confidence score. Here a 0.4 rate was also used. In other words, if we have a 7s clip of 210 frames and only 70 frames have pose information with confidence higher than 0.4, then we exclude this clip from the pose dataset because only 33% of the frames have a confidence higher than 0.4 and our minimum threshold is 40%.\n\nAs a result, the pose dataset is significantly smaller than the original RGB dataset. Whereas there are about 4.9K testing clips for the RGB dataset, the pose dataset has only 815 clips. Therefore a bigger dataset is a must here so that we are able to train the skeleton model on all 17 actions.\n\n### Audio\n\nAs a preliminary pre-processing step audios that are not loud enough were first pruned from the dataset. The best results were achieved by prunning the bottom 20% of the quietest audios.\n\nIn total there are about 5.9K training clips \u0026 1.5K validation clips.\n\n## Script Docs\n\n### Multimodial Demo\n\n```shell\npython src/demo/multimodial_demo.py ${VIDEO_FILE} ${OUTPUT_FILE}\n    [--det-config ${HUMAN_DETECTION_CONFIG_FILE}] \\\n    [--det-checkpoint ${HUMAN_DETECTION_CHECKPOINT}] \\\n    [--pose-config ${HUMAN_POSE_ESTIMATION_CONFIG_FILE}] \\\n    [--pose-checkpoint ${HUMAN_POSE_ESTIMATION_CHECKPOINT}] \\\n    [--skeleton-config ${SKELETON_BASED_ACTION_RECOGNITION_CONFIG_FILE}] \\\n    [--skeleton-checkpoint ${SKELETON_BASED_ACTION_RECOGNITION_CHECKPOINT}] \\\n    [--rgb-config ${RGB_BASED_ACTION_RECOGNITION_CONFIG_FILE}] \\\n    [--rgb-checkpoint ${RGB_BASED_ACTION_RECOGNITION_CHECKPOINT}] \\\n    [--audio-config ${AUDIO_BASED_ACTION_RECOGNITION_CONFIG_FILE}] \\\n    [--audio-checkpoint ${AUDIO_BASED_ACTION_RECOGNITION_CHECKPOINT}] \\\n    [--det-score-thr ${HUMAN_DETECTION_SCORE_THRE}] \\\n    [--label-maps ${LIST_OF_ACTION_ANNOTATION_FILES}] \\\n    [--num-processes ${NUM_PROC_USED_FOR_SUBCLIP_EXTRACTION}] \\\n    [--subclip-len ${PREDICTION_WINDOW}] \\\n    [--device ${DEVICE}] \\\n    [--coefficients ${COEFFICIENT_WEIGHTS}] \\\n    [--pose-score-thr ${POSE_ESTIMATION_SCORE_THRESHOLD}] \\\n    [--correct-rate ${RATE_OF_CORRECT_FRAMES_FOR_SKELETON_RECOGNITION}] \\\n    [--loudness-weights ${LOUDNESS_THRESHOLD_FOR_AUDIOS}] \\\n    [--topk ${TOP_K_ACCURACY}]\n    [--timestamps]\n    [--verbose]\n```\n\n### Late Fusion\n\n```shell\npython src/top_tags.py ${JSON_FILE}\n    [--topk ${TOP_K_ACCURACY}]\n    [--label-map ${ANNOTATION_FILE}]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frlleshi%2Fphar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frlleshi%2Fphar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frlleshi%2Fphar/lists"}