https://github.com/tushar2704/twitter-toxicity-classification
This project aims to solve the problem of identifying the toxicity of tweets using Natural Language Processing (NLP) techniques. Given the dataset `twitter_parsed_tweets`, I will implement a step-by-step NLP approach to classify tweets into binary values based on their toxicity.
https://github.com/tushar2704/twitter-toxicity-classification
artificial-intelligence data-science nlp nlp-machine-learning text-classification
Last synced: 2 months ago
JSON representation
This project aims to solve the problem of identifying the toxicity of tweets using Natural Language Processing (NLP) techniques. Given the dataset `twitter_parsed_tweets`, I will implement a step-by-step NLP approach to classify tweets into binary values based on their toxicity.
- Host: GitHub
- URL: https://github.com/tushar2704/twitter-toxicity-classification
- Owner: tushar2704
- License: mit
- Created: 2023-07-05T17:17:12.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-11T22:18:45.000Z (almost 2 years ago)
- Last Synced: 2024-12-27T08:16:36.491Z (4 months ago)
- Topics: artificial-intelligence, data-science, nlp, nlp-machine-learning, text-classification
- Language: Jupyter Notebook
- Homepage: https://tushar-aggarwal.com
- Size: 1.91 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Twitter Toxicity Classification
This project aims to solve the problem of identifying the toxicity of tweets using Natural Language Processing (NLP) techniques. Given the dataset `twitter_parsed_tweets`, I will implement a step-by-step NLP approach to classify tweets into binary values based on their toxicity.
## Problem Statement
In this project, we will tackle the task of toxicity classification for tweets. The main objective is to develop a model that can accurately identify toxic tweets and assign them a binary value indicating their toxicity level. The classification will enable us to flag potentially harmful content on social media platforms and take appropriate actions.
## Dataset
The dataset `twitter_parsed_tweets` contains a collection of tweets along with their corresponding annotations. It consists of the following columns:
- `index`: Index of the tweet
- `id`: Unique identifier for each tweet
- `Text`: The text content of the tweet
- `Annotation`: Annotation indicating the toxicity level of the tweet
- `oh_label`: Binary label indicating the toxicity (1 for toxic, 0 for non-toxic)## Approach
The NLP approach to toxicity classification involves several key steps:
1. **Data Cleaning**: We will perform data cleaning techniques to handle missing values, remove duplicates, and preprocess the text data for further analysis.
2. **Text Preprocessing**: In this step, we will apply techniques such as tokenization, removing stopwords, and normalizing the text to prepare it for feature extraction.
3. **Feature Extraction**: We will extract relevant features from the preprocessed text data. This may involve methods such as creating a bag-of-words representation, using TF-IDF, or utilizing word embeddings.
4. **Model Training and Evaluation**: We will train a machine learning model, such as Naive Bayes, Logistic Regression, or a deep learning model like LSTM or BERT, on the extracted features. The model will be evaluated using appropriate evaluation metrics and techniques like cross-validation.
5. **Model Deployment**: Once a satisfactory model is obtained, it can be deployed to classify new, unseen tweets and provide toxicity predictions.
## Installation
To run this project locally, please follow these steps:
1. Clone this repository:
```
git clone https://github.com/tushar2704/twitter-toxicity-classification.git
```2. Install the required dependencies:
```
pip install -r requirements.txt
```3. Run the main script to perform data cleaning, preprocessing, feature extraction, and model training:
```
python main.py
```4. Once the model is trained, you can use it to classify new tweets by calling the relevant functions or scripts.
## Contributing
Contributions to this project are welcome. If you have any suggestions or would like to contribute to improving the toxicity classification model, feel free to open an issue or submit a pull request.
## License
This project is licensed under the [MIT License](LICENSE). You are free to use, modify, and distribute the code for academic, commercial, or personal purposes.
## Acknowledgments
We would like to express our gratitude to the creators of the `twitter_parsed_tweets` dataset for providing the data for this project. Their efforts enable us to tackle the important task of identifying toxic tweets and promote a safer online environment.
## Contact
For any inquiries or further information about this project, please contact [Tushar Aggarwal](mailto:[email protected]).
Let's work together to combat toxicity on Twitter and create a more positive and inclusive social media environment!