https://github.com/weaviate-tutorials/demo-classification-toxic-comment

hacktoberfest

Last synced: over 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/weaviate-tutorials/demo-classification-toxic-comment
Owner: weaviate-tutorials
Created: 2023-09-26T11:56:08.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-11-21T09:14:40.000Z (over 2 years ago)
Last Synced: 2025-02-03T18:11:25.024Z (over 1 year ago)
Topics: hacktoberfest
Language: Python
Homepage:
Size: 15.2 MB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Toxic Comment Classifier

This project's origin is [here](https://github.com/weaviate/weaviate-examples/tree/main/weaviate-toxic-comment-classifier).

## Overview
![Demo](demo.gif)

In this project, you'll discover the power of semantic search.
First, we will index the Toxic Comment Classification dataset in Weaviate.
This dataset comprises two columns: a comment and a binary label indicating whether it is toxic or not.
When a user enters a comment and wants to determine if it is toxic or not,
we will conduct a semantic search and display the label of the comment that is most similar to the entered one.

### Contextual classification
This demo relies on the technique called Contextual classification. It involves making predictions about
cross-references based on the context, without relying on pre-existing training data.
When you need to assess the similarity between a source item and a potential target item,
contextual classification is an excellent choice, particularly when your data features
a robust semantic connection (for instance, like 'The Iconic Statue of Liberty' and 'The Vibrant New York City').

## Technology stack
- Python
- Weaviate
- Streamlit

### Used Weaviate modules/models

[text2vec-contextionary (Contextionary) vectorizer](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-contextionary)
This vectorizer module is used to build 300-dimensional vectors using a Weighted Mean of Word Embeddings (WMOWE) technique.
Contextionary is Weaviate's own language vectorizer that is trained using [fastText](https://fasttext.cc/)
on Wiki and CommonCrawl data.

To be able to use it you need to enable it in the [docker compose file](docker-compose.yml)

## Prerequisites
1. Python3 interpreter installed
1. Ability to execute docker compose
(The most straightforward way to do it on Windows/Mac is to install
[Docker Desktop](https://www.docker.com/products/docker-desktop/))

## Setup instructions

### Start up
1. Clone the repository
1. Create a virtual environment and activate it
```shell
python3 -m venv venv
source venv/bin/activate
```
1. Install all required dependencies
```shell
pip install -r requirements.txt
```
1. Run containerized instance of Weaviate. It also includes vectorizer module to compute the embeddings.

> **Note**: Make sure you don't have anything occupying port 8080
> If you do, you have the option to either stop that process or change the port that Weaviate is using.
```shell
docker compose up
```
1. Index the dataset in Weaviate
```shell
python add_data.py
```
1. Run the Streamlit demo
```shell
streamlit run app.py
```

### Run integration test
```shell
./test.sh
```

### Shut down
1. Both streamlit app and docker compose can be stopped with `Ctrl+C` in the corresponding terminal window
2. To remove created docker containers and volumes use
```shell
docker compose down -v
```

## Usage instructions

1. Enter a comment
2. Press `Classify` button to see if it's classified as toxic or not.

## Dataset license

The dataset used for this example is available on Kaggle:
https://www.kaggle.com/datasets/akashsuper2000/toxic-comment-classification

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/weaviate-tutorials/demo-classification-toxic-comment

Awesome Lists containing this project

README