{"id":13585504,"url":"https://github.com/nielstenboom/recurring-content-detector","last_synced_at":"2025-04-07T10:31:08.968Z","repository":{"id":37644966,"uuid":"189591307","full_name":"nielstenboom/recurring-content-detector","owner":"nielstenboom","description":"Unsupervised detection of opening / closing credits, recaps, and previews in video files 🎥🍿🎬","archived":false,"fork":false,"pushed_at":"2023-02-25T13:37:58.000Z","size":32307,"stargazers_count":91,"open_issues_count":3,"forks_count":13,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-06T03:42:56.738Z","etag":null,"topics":["closing-credits","credits","detect-recaps","detector","feature-vectors","ffmpeg","intro","recaps","season","skip","tv-shows","video","video-files"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nielstenboom.png","metadata":{"files":{"readme":"README.MD","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-05-31T12:36:30.000Z","updated_at":"2024-08-30T18:56:33.000Z","dependencies_parsed_at":"2024-01-22T00:02:42.826Z","dependency_job_id":null,"html_url":"https://github.com/nielstenboom/recurring-content-detector","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nielstenboom%2Frecurring-content-detector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nielstenboom%2Frecurring-content-detector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nielstenboom%2Frecurring-content-detector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nielstenboom%2Frecurring-content-detector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nielstenboom","download_url":"https://codeload.github.com/nielstenboom/recurring-content-detector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247636105,"owners_count":20970860,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["closing-credits","credits","detect-recaps","detector","feature-vectors","ffmpeg","intro","recaps","season","skip","tv-shows","video","video-files"],"created_at":"2024-08-01T15:04:58.958Z","updated_at":"2025-04-07T10:31:06.901Z","avatar_url":"https://github.com/nielstenboom.png","language":"Python","funding_links":[],"categories":["HarmonyOS","Python"],"sub_categories":["Windows Manager"],"readme":"# Recurring content detector (credits, recaps and previews detection)\n\nThis repository contains the code that was used to conduct experiments for a [master's thesis](https://github.com/nielstenboom/masterthesis/raw/master/main.pdf). The goal was to detect recaps, opening credits, closing credits and previews from video files in an unsupervised manner. This can be used to automate the labeling for the skip functionality of a VOD streaming service for example.\n\n## Quickstart with Docker\n\nTo try this out with Docker on your own data, copy the video files of one season to a folder in the current directory named `videos`, then create a script `detect.py` with the contents:\n\n```python\nimport recurring_content_detector as rcd\n\nresults = rcd.detect(\"videos\")\n\nprint(results)\n```\nNow you can run the detector by running the docker command:\n```\ndocker run -it -v \"$(pwd)\":/opt/recurring-content-detector nielstenboom/recurring-content-detector:latest python detect.py\n```\nIt'll first downsize the videos using ffmpeg, then it will convert the videos to feature vectors and then the algorithm for detecting the recurring content is applied, the `results` variable contains the intervals of all the recurring parts.\n\n## Local Installation\n\nTo install the package, do the following steps:\n\n```bash\npip install mkl\npip install git+https://github.com/nielstenboom/recurring-content-detector.git@v1.1.1\n```\n\nIt is also possible to build a docker container that does all the steps for you with the [Dockerfile](Dockerfile) in the directory.\n\nMake sure [ffmpeg](https://ffmpeg.org/) is in the PATH variable.\n\nRun `pip uninstall recurring-content-detector` to uninstall the package.\n\n## Recurring content detection\n\nWith this code it is possible to detect recaps, opening credits, closing credits and previews in video files from a TV-show unsupervised up to a certain extent. The following image gives a schematic overview of how it works: \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/thesisdiagram.png?raw=true\"\u003e\n\u003c/p\u003e\n\nYou can run the detector in a python program in the following way:\n\n```python\nimport recurring_content_detector as rcd\nrcd.detect(\"/directory/with/season/videofiles\")\n```\nThis will run the detection by building the color histogram feature vectors. Make sure the video files you used can be sorted in the right alphabetical order similar as to when they play in the season! So episode_1 -\u003e episode_2 -\u003e episode_3 -\u003e etc.. You'll get weird results otherwise.\n\n\nThe feature vector function can also be changed:\n```python\n# options for the function are [\"CH\", \"CTM\"]\nrcd.detect(\"/directory/with/season/videofiles\", feature_vector_function=\"CTM\")\n```\n\nThe `detect` function has many more parameters that can be tweaked, the defaults it has, are the parameters I got the best results with on my experiments.\n\n```python\ndef detect(video_dir, feature_vector_function=\"CH\", annotations=None, artifacts_dir=None, \n            framejump=3, percentile=10, resize_width=320, video_start_threshold_percentile=20, \n            video_end_threshold_seconds=15, min_detection_size_seconds=15):\n\"\"\"\nThe main function to call to detect recurring content. Resizes videos, converts to feature vectors\nand returns the locations of recurring content within the videos.\n\narguments\n---------\nvideo_dir : str\n    Variable that should have the folder location of one season of video files.\n\nannotations : str\n    Location of the annotations.csv file, if annotations is given then it will evaluate \n    the detections with the annotations.\n    \nfeature_vector_function : str\n    Which type of feature vectors to use, options: [\"CH\", \"CTM\", \"CNN\"], default is color histograms (CH) \n    because of balance between speed and accuracy. This default is defined in init.py.\n\nartifacts_dir : str\n    Directory location where the artifacts should be saved. Default location is the location defined \n    with the video_dir parameter.\n\nframejump : int\n    The frame interval to use when sampling frames for the detection, a higher number means that \n    less frames will be taken into consideration and will improve the processing time. \n    But will probably cost accuracy.\n\npercentile : int\n    Which percentile of the best matches will be taken into consideration as recurring content. \n    A high percentile will means a higher recall, lower precision. \n    A low percentile means a lower recall and higher precision.\n\nresize_width: int\n    Width to which the videos will be resized. A lower number means higher processing speed but \n    less accuracy and vice versa.\n\nvideo_start_threshold_percentile: int\n    Percentage of the start of the video in which the detections will be marked as detections. \n    As recaps and opening credits only occur at the first parts of video files, this parameter can alter \n    that threshold. So putting 20 in here means that if we find recurring content in the first 20% of \n    frames of the video, it will be marked as a detection. If it's detected later than 20%, then the \n    detection will be ignored.\n\nvideo_end_threshold_seconds: int\n    Number of seconds threshold in which the final detection at the end of the video should end for it \n    to count. Putting 15 here means that a detection at the end of a video will only be marked as a \n    detection if the detection ends in the last 15 seconds of the video.\n\nmin_detection_size_seconds: int\n    Minimal amount of seconds a detection should be before counting it as a detection. As credits \u0026 \n    recaps \u0026 previews generally never consist of a few seconds, it's wise to pick at least a number \n    higher than 10.\n\nreturns\n-------\ndictionary\n    dictionary with timestamp detections in seconds list for every video file name\n    \n    {\"episode1.mp4\" : [(start1, end1), (start2, end2)], \n     \"episode2.mp4\" :  [(start1, end1), (start2, end2)],\n     ...\n     \"episode10.mp4\" : [(start1, end1), (start2, end2)]\n    }\n    \"\"\"\n```\n\n## Annotations\n\nIf you want to quantitively test out how well this works on your own data, fill in the [annotations](annotations_example.csv) file and supply it as the second parameter.\n```python\nrcd.detect(\"directory/with/season/videofiles\", annotations = \"path/to/annotations.csv\")\n```\n\nWhen the program is done, it will give an example with the same outline as the example below:\n\n ```\nDetections for: episode1.mp4\n0:01:21.600000           -               0:02:20.880000\n0:02:49.320000           -               0:03:15.480000\n\nDetections for: episode2.mp4\n0:00:00                  -               0:01:16.920000\n0:01:38.040000           -               0:02:37.440000\n\nDetections for: episode3.mp4\n0:00:00                  -               0:01:27.840000\n0:01:51.120000           -               0:02:50.760000\n0:42:26.400000           -               0:42:54\n\nTotal precision = 0.862\nTotal recall = 0.853\n ```\n\n## Tests\n\nThere's a few tests in the test directory. They can also be run in the docker container:\n```\ndocker run -it -v $(pwd):/opt/recurring-content-detector nielstenboom/recurring-content-detector:latest python -m pytest\n```\n \n## Credits\n- https://github.com/facebookresearch/faiss for the efficient matching of the feature vectors \n\n## Final words\nIf you use and like my project or want to discuss something related, I would  ❤️  to hear about it! You can send me an email at nielstenboom@gmail.com.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnielstenboom%2Frecurring-content-detector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnielstenboom%2Frecurring-content-detector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnielstenboom%2Frecurring-content-detector/lists"}