Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/natliblux/warc-safe
A tool for detecting viruses and NSFW material in WARC files
https://github.com/natliblux/warc-safe
antivirus nsfw-classifier warc warc-safe webarchiving
Last synced: about 2 months ago
JSON representation
A tool for detecting viruses and NSFW material in WARC files
- Host: GitHub
- URL: https://github.com/natliblux/warc-safe
- Owner: natliblux
- Created: 2024-05-03T06:24:50.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-05-03T08:00:19.000Z (about 2 months ago)
- Last Synced: 2024-05-04T12:20:20.496Z (about 2 months ago)
- Topics: antivirus, nsfw-classifier, warc, warc-safe, webarchiving
- Language: Python
- Homepage:
- Size: 487 KB
- Stars: 2
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Lists
- awesome-web-archiving - warc-safe - Automatic detection of viruses and NSFW content in WARC files. (Tools & Software / Utilities)
README
# Introduction
This is a Python program that scans WARC (web archive) files for viruses and NSFW (not-safe-for-work) content:
- It detects violence/nudity using an AI model,
- It detects viruses using the Linux `clamd` antivirus daemon.
You can either run it in test mode (check an individual WARC file) or in server mode (for easy integration into existing workflows) when the server has access to the WARC files via file system.The program accepts both compressed and uncompressed WARC files.
# Installation
Please use Python 3.9+. You can install the requirements as usual:
pip install -r requirements.txt
If you want to use the antivirus feature, you will need to install the `clamd` antivirus daemon. On Ubuntu, you can do so like this:apt-get install clamav clamav-daemon -y
The first setup of `clamd` requires you to stop, update and start the service:systemctl stop clamav-freshclam
freshclam
systemctl start clamav-freshclam# Usage
The tool scan be used in two ways:
- test mode: scan a single warc on the command-line
- server mode: use the REST API to scan WARC files programmatically
Note that the first time, the application will automatically download the classifier model to the current user's home folder. This might take a few seconds (or minutes) depending on your connection. You can check the progress in stdout.## Test mode
You can start the application in test mode from the command-line as follows:python app.py --test-av
python app.py --test-nsfw
The first example above runs the antivirus scan and the second the NSFW classifier.![test mode](pic.png)
## Server mode
You can start the application as a server like so:
python app.py --server
The application in server mode exposes the following endpoints:
- `test_nsfw`: tests only for NSFW material,
- `test_antivirus`: tests only for viruses,
- `test_all`: tests for both of the above.All these endpoints are POST and take a single argument, `file_path`, which is the absolute path to the WARC that you want to analyze (it can be compressed or uncompressed).
Here is an example request with `curl`:
curl -X POST -H "Content-Type: application/json" -d '{"file_path": "/my/path/my.warc.gz"}' localhost:8123/test_all
## Return values
All endpoints return JSON. The root element is `results`, which is a list containing the WARC records together with their filter results. Each entry in the list is identified by its `WARC-Record-ID`. Here is an example:
````
{
"results": {
"": {
"av_details": null,
"av_res": "OK",
"filename": "picture.jpg",
"mime": "image/jpeg",
"nsfw_res": "SFW",
"nsfw_score": 0.35693745957662754
},
...
}
}
````The fields available for each record are the following:
- File name: `filename`,
- Mime type: `mime`,
- Antivirus: `av_details` and `av_res`,
- NSFW: `nsfw_res` and `nsfw_score`,
- Errors: `err`.## NSFW scoring
The `nsfw_score` is a floating-point value between 0 (not NSFW at all) and 1 (certainly NSFW). On the other hand, the `nsfw_res` field returns either `NSFW` or `SFW` depending on what the AI has detected.
## Updating your antivirus database
From time to time it might make sense to update your `clamav` signature database. You can do so by running
freshclam
You might also want to restart the service withsystemctl restart clamav-freshclam