Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/v0idzdev/mbti

Myers-briggs type indicator predictor using a machine learning model.
https://github.com/v0idzdev/mbti

Last synced: about 1 month ago
JSON representation

Myers-briggs type indicator predictor using a machine learning model.

Host: GitHub
URL: https://github.com/v0idzdev/mbti
Owner: v0idzdev
License: apache-2.0
Created: 2022-02-19T21:10:22.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2022-02-22T17:30:17.000Z (almost 3 years ago)
Last Synced: 2024-11-15T03:41:45.656Z (about 1 month ago)
Language: Python
Homepage:
Size: 23.5 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MBTI Predictor
Predicting someone's MBTI type based on their online posts, using AI.

### Approach
I originally attempted to use a simple LSTM, using nltk and Keras. However, this approach led to the model badly overfitting.

I opted for a bidirectional LSTM, treating the data differently with techniques such as lemmatization. This approach led to significantly slower learning - the model improved less and less per epoch. However, the validation accuracy did increase per epoch, as opposed to decreasing or remaining static.

As a last resort, I chose to use the BERT transformer with the AutoTokenizer from 'transformers.' In theory, it would have led to significantly better results - however, the large number of parameters meant it couldn't run on my GTX 1050.

Feel free to download the source code, if your hardware resources could accommodate training the BERT model.

### Findings
I could not find a significant correlation between a post and the MBTI type of its poster, using both a standard LSTM and its bidirectional variant.

The dataset was unbalanced, with the number of posts per MBTI type differing greatly. I decided not to rectify this issue initially because the training data was already small at around 9000 entries. I may revisit this project if I find a more balanced and extensive dataset.

Feel free to download the source code and try for yourself.