Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/blakejakopovic/nostr-spam-detection
An experiment in building a machine learning model to label Nostr spam content for filtering and relay rejection.
https://github.com/blakejakopovic/nostr-spam-detection
machine-learning nostr proof-of-concept spam-detection spam-filtering
Last synced: 24 days ago
JSON representation
An experiment in building a machine learning model to label Nostr spam content for filtering and relay rejection.
- Host: GitHub
- URL: https://github.com/blakejakopovic/nostr-spam-detection
- Owner: blakejakopovic
- License: mit
- Created: 2023-02-10T13:55:35.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-02-25T21:49:01.000Z (almost 2 years ago)
- Last Synced: 2024-02-17T05:34:34.811Z (10 months ago)
- Topics: machine-learning, nostr, proof-of-concept, spam-detection, spam-filtering
- Language: Jupyter Notebook
- Homepage:
- Size: 9.34 MB
- Stars: 20
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-nostr - nostr-spam-detection - spam-detection.svg?style=social) - An experiment in building a machine learning model to label Nostr spam content for filtering and relay rejection. (Tools / Client reviews and/or comparisons)
README
# Nostr Spam Detection
An experiment in building a machine learning model to label Nostr spam content for filtering and relay rejection.
## Dataset
The latest dataset is `labelled_nostr_events_20230225000.csv` with 92,000 rows. It contains labelled nostr event content with either `spam` (bad - 13k) or `ham` (good - 79k). The data is biased by volume with asian language spam, and english language ham (non-spam) rows - however it has performed well in testing against recent Nostr events. The dataset was best-effort aggregated and largely labelled manually. It contains reviewed events flagged by Nostr event kind=1984, which indicate spam and other undesirable content. I've left the datasets raw to allow others to normalise themselves.One final note. The dataset it likely pretty biased toward the past month of Nostr spam, and is less well rounded. You could train your models using another base spam detection set, or add more training as more Nostr spam labeled data is available.
## Results
Using a Naive Bayes classifier with the dataset, we are able to achieve 98.26%+ accuracy. The model size is around 22Mb, and takes less than a minute to train. To train this model, take a look at `Nostr MultinominalNB Spam Detection.ipynb`.
With more complex language modelling and deep learning, we were able to get upwards of 98% accuracy for spam detection using FastAI. The model is substantially larger however being around 160Mb, and takes significantly longer to train (can be hours on a laptop CPU).
## Dependencies
* Python
* Juypter
* FastAI## Usage
First you will need to build your model. It can take over an hour to train using a CPU instead of GPU. Once you have the dependencies installed, you can run the following to start training your model.
```
git clone https://github.com/blakejakopovic/nostr-spam-detection
cd nostr-spam-detectionjupyter notebook
# The notebooks should run through cleanly, with each code segment being run once.
```
I've included a minimal Python Flask REST API endpoint that loads your model and can be called to get a spam score for an event (or just the event content).
```
# Start the prediction example apps
gunicorn --workers 5 --bind localhost:5000 app:app (app:app for FastAI or app2:app for the bayes model)# To get a label and score for content
curl -X POST 'http://127.0.0.1:5000/test' --header 'Content-Type: application/json' --data-raw '{"content" : "Hello, is this spam or ham?"}'
# {"label": "ham", "score": "0.9913"}# To get a spam_score directly for content
curl -X POST 'http://127.0.0.1:5000/spam_score' --header 'Content-Type: application/json' --data-raw '{"content" : "Hello, how spammy is this?"}'
# {"spam_score": "0.9913"}```