https://github.com/anjmittu/clpsych2021-shared-task-baseline
The baseline code for the CLPsych 2021 Shared Task.
https://github.com/anjmittu/clpsych2021-shared-task-baseline
baseline-model nlp nlp-machine-learning suicide suicide-risk-model tweets
Last synced: about 1 year ago
JSON representation
The baseline code for the CLPsych 2021 Shared Task.
- Host: GitHub
- URL: https://github.com/anjmittu/clpsych2021-shared-task-baseline
- Owner: anjmittu
- License: mit
- Created: 2021-01-25T21:20:35.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-05-05T15:11:51.000Z (about 5 years ago)
- Last Synced: 2025-03-27T10:38:32.118Z (about 1 year ago)
- Topics: baseline-model, nlp, nlp-machine-learning, suicide, suicide-risk-model, tweets
- Language: Python
- Homepage: https://github.com/seanmacavaney/clpsych2021-shared-task
- Size: 19.5 KB
- Stars: 2
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Suicide Risk Model
This project contains the code for a suicide risk model based on a person's tweets.
When running on NORC Enclave see: [README_ENCLAVE.md](README_ENCLAVE.md)
## Overview
### Directory Structure
```
- risk_model
- baseline_model.py :- Runs the baseline model and creates a results output
- evaluation.py :- Evaluations the results from the model
- print_data_statistics.py :- Prints some stats about the data
- tokenize_data.py :- Preforms the preprocessing on the data.
```
## How to run
### Set up
Python3.6+ is required. Install the needed libraries:
```
pip install -r requirements.txt
python -m nltk.downloader words
```
### Running
All of the script are run with python
#### tokenize_data.py
This script will preprocess the data. The following steps are done:
- URLs are removed from the tweets and tweets are made to be lowercase
- The tweets are tokenized using twikenizer
- User mentions and emojis are removed from the tweets
- Hashtags are split into separate words using three methods. The first to work is used:
- Split by camel case
- Split on underscores
- Smallest split into real words
- Stop words are removed
Below is an example calling this script when the data resides in a folder `practice_data`:
```
python risk_model/tokenize_data.py --input practice_data --output practice_data
```
#### baseline_model.py
This script will create the baseline model from the data and will output a results file. The
results file is a tsv file with the following form:
```
[USER_ID] \t [LABEL] \t [SCORE]
```
Where `USER_ID` is the ID field from the source file, `LABEL` is either `1` for suicide
or `0` for control, and `SCORE` is a real-valued score output score from your system,
where larger numbers indicate the `SUICIDE` class and lower numbers indicate
`CONTROL`.
The baseline model is a bag of words model. It uses count vectors with unigrams and bigrams and
Logistic Regression for classification.
Below is an example calling this script when the data resides in a folder `practice_data`:
```
python risk_model/baseline_model.py --input practice_data --output practice_data
```
#### evaluation.py
This script will read the results created by the baseline_model.py and output a score. The
script will output `, , , , `.
Below is an example calling this script when the data resides in a folder `practice_data`:
```
python risk_model/evaluation.py --results practice_data/results.tsv --truth practice_data/test_truths.jsonl
```