https://github.com/paradite/reddit-post-classifier

A simple classifier for Reddit posts.
https://github.com/paradite/reddit-post-classifier

Last synced: 8 months ago
JSON representation

A simple classifier for Reddit posts.

Host: GitHub
URL: https://github.com/paradite/reddit-post-classifier
Owner: paradite
Created: 2025-04-03T08:58:46.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-06-23T09:40:37.000Z (8 months ago)
Last Synced: 2025-06-23T10:37:52.479Z (8 months ago)
Language: Python
Homepage: https://tracker.16x.engineer/
Size: 1000 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Reddit Post Classifier

This project is a simple classifier for Reddit posts. It uses pre-trained models to classify posts as relevant or irrelevant.

Created by [16x Tracker](https://tracker.16x.engineer/)

![screenshot](screenshots/screenshot.png)

## System Requirements

- Minimum 2GB RAM (4GB recommended)

- Docker and Docker Compose installed

- About 1GB disk space for the models and dependencies

## Sample Results

### Apr 5 run

Pre-processing

```

Total entries processed: 9353

Unique entries: 4549

Duplicate entries: 1191

F5Bot filtered entries: 8

Team ID filtered entries (not team 1): 10359

Status breakdown:

RELEVANT/REPLIED: 241

IGNORED: 4246

NEW: 3

CONTENT_REMOVED: 59

```

Model Results

```

               precision    recall  f1-score   support

           0       0.98      0.95      0.96      1437

           1       0.25      0.50      0.33        50

    accuracy                           0.93      1487

   macro avg       0.62      0.72      0.65      1487

weighted avg       0.96      0.93      0.94      1487

```

### Apr 18 run

Pre-processing

```

Total entries processed: 8473

Unique entries: 4258

Duplicate entries: 1174

F5Bot filtered entries: 8

Team ID filtered entries (not team 1): 0

Status breakdown:

RELEVANT/REPLIED: 274

IGNORED: 3828

NEW: 86

CONTENT_REMOVED: 70

```

Model Results

```

              precision    recall  f1-score   support

           0       0.97      0.94      0.95       766

           1       0.40      0.53      0.46        55

    accuracy                           0.92       821

   macro avg       0.68      0.74      0.71       821

weighted avg       0.93      0.92      0.92       821

```

### Apr 19 run

Pre-processing

```

Total entries processed: 8288

Unique entries: 4076

Duplicate entries: 1171

F5Bot filtered entries: 0

Team ID filtered entries (not team 1): 0

Timestamp filtered entries (older than 90 days): 193

Status breakdown:

RELEVANT/REPLIED: 123

IGNORED: 3828

NEW: 86

CONTENT_REMOVED: 39

```

Model Results

distilbert-base-uncased

```

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       766

           1       0.05      0.04      0.05        25

    accuracy                           0.95       791

   macro avg       0.51      0.51      0.51       791

weighted avg       0.94      0.95      0.94       791

```

roberta-base

```

              precision    recall  f1-score   support

           0       0.97      0.98      0.98       766

           1       0.22      0.16      0.19        25

    accuracy                           0.96       791

   macro avg       0.60      0.57      0.58       791

weighted avg       0.95      0.96      0.95       791

```

### Regressor Model

The regressor model is a simple linear regression model that uses the pre-trained roberta-base model to predict the relevance score of a post.

```

Regressor Test Results Summary:

Total samples tested: 60

Overall accuracy: 80.00%

Irrelevant samples accuracy: 73.33% (22/30)

Relevant samples accuracy: 86.67% (26/30)

Classification Metrics:

Precision: 0.7647

Recall: 0.8667

F1 Score: 0.8125

Regression Metrics:

Mean Squared Error (MSE): 0.4112

R-squared (R²): -0.6447

Confusion Matrix:

True Positives: 26

False Positives: 8

True Negatives: 22

False Negatives: 4

```

### URL Regressor Model

The URL regressor model is a simple linear regression model that uses the pre-trained roberta-base model to predict the relevance score of a post. URL is added as prefix to the post content. The data used is from April 2025.

Model weights: `best_url_regressor_run1_epoch_5.pt`

Optimal threshold: 0.1500

URL prefix logic sample:

```py

output_path = 'sample_url_prefix.txt'

url = 'https://www.google.com'

content = 'This is a test post'

with open(output_path, 'w', encoding='utf-8') as f:

   if url:

      f.write(f"{url}\n\n")

   f.write(content)

```

Results:

```

================================================================================

REGRESSOR MODEL TEST RESULTS - 2025-05-10 16:37:48

================================================================================

Optimal threshold: 0.1500

================================================================================

TESTING 30 RANDOM IRRELEVANT SAMPLES

================================================================================

================================================================================

SUMMARY

================================================================================

Total samples tested: 60

Overall accuracy: 81.67%

Irrelevant samples accuracy: 73.33% (22/30)

Relevant samples accuracy: 90.00% (27/30)

Classification Metrics:

Precision: 0.7714

Recall: 0.9000

F1 Score: 0.8308

Regression Metrics:

Mean Squared Error (MSE): 0.2690

R-squared (R²): -0.0762

Confusion Matrix:

True Positives: 27

False Positives: 8

True Negatives: 22

False Negatives: 3

```

## Running the API Server

### Using Docker Compose (Recommended)

The easiest way to run the service is using Docker Compose. The service will run in a container named `reddit-classifier-api`:

```bash

# pull latest changes from repo, rebuild the image and start the service

git pull && docker compose up --build -d

# view logs

docker compose logs -f

# Stop the service

docker compose down

# view container logs directly (using container name)

docker logs -f reddit-classifier-api

```

### Using Docker

Build the Docker image:

```bash

docker build -t reddit-post-classifier .

```

Run the container:

```bash

docker run -p 9092:9092 reddit-post-classifier

```

### Without Docker

Run the API server directly:

```bash

python api-server.py

```

## API Documentation

See [API_DOC.md](API_DOC.md) for more details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/paradite/reddit-post-classifier

Awesome Lists containing this project

README