Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/soumyajit4419/ai_for_social_good
Using natural language processing to analyze the sentiments of people and detect suicidal ideation on online social content.
https://github.com/soumyajit4419/ai_for_social_good
lstm natural-language-processing random-forest tfidf-vectorizer web-scraping
Last synced: 2 months ago
JSON representation
Using natural language processing to analyze the sentiments of people and detect suicidal ideation on online social content.
- Host: GitHub
- URL: https://github.com/soumyajit4419/ai_for_social_good
- Owner: soumyajit4419
- License: mit
- Created: 2020-05-18T04:25:05.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-03-12T06:23:07.000Z (almost 4 years ago)
- Last Synced: 2024-10-03T12:44:43.649Z (3 months ago)
- Topics: lstm, natural-language-processing, random-forest, tfidf-vectorizer, web-scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 69.8 MB
- Stars: 35
- Watchers: 4
- Forks: 12
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI For Social Good
## Suicidal Ideation Detection In Online Social Content
---
## Getting Started
The rise of social media and online communities creates safe and anonymous spaces for individuals to share their thoughts about their mental health and express their feelings and sufferings in online communities. To prevent suicide, it is necessary to detect suicide-related posts and user's suicide ideation in cyberspace by natural language processing methods. I focused on the online community called Reddit and the social networking website Twitter, and classify user's posts with potential suicide and without suicidal risk through text features processing, machine learning, and deep learning based methods.
## Datasets
Collected two sets of data from Reddit and Twitter. The Reddit data set includes (2958) suicidal ideation samples and a number of non-suicide texts (5381). The Twitter dataset has a total (3000) tweets with suicidal ideation.
Reddit Data was scraped from subreddits like 'suicide watch', 'depression', 'anxiety' etc. Twitter data was collected by querying keywords like 'end my life', 'die' etc.**The Twitter word cloud (left) and Reddit word cloud (right) are shown as follow:**
## Feature Processing and Training
- Performed text cleaning and removed some corpus-specific stopwords. And plotted word cloud to visualize the frequently occurring words in a corpus.
- Performed vectorization using Both Bag of Words and TFIDF Vectorizer.
- Used grid search cv to find the best parameters to train the model using Random Forest Classifier and archived an accuracy of 96%.
- Trained the model using Multilayer Bidirectional LSTM with GLOBE embedding to attain an accuracy of 97%.## Results
Results of different methods applied
| Model | Acc. | Pre. | Rec. | F1 |
| ------------ | ---- | ---- | ---- | ---- |
| RF + TFIDF | 0.96 | 0.96 | 0.96 | 0.96 |
| LSTM + GLOBE | 0.97 | 0.97 | 0.97 | 0.97 |## Usage
- `Dataset` : All the collected and cleaned dataset
- `Data_Collection` : Code for scraping data from reddit and twitter
- `Src` : All The source code for text preprocessing and building ml models
- `Pretrained_Models` : All the Pretrained Models and tokenizers
- `Flask`: Code for server and model deployment### To run the server:
- `cd Flask`
- `python app.py`## License
Distributed under the MIT License. See `LICENSE` for more information.