Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/v0idzdev/mbti
Myers-briggs type indicator predictor using a machine learning model.
https://github.com/v0idzdev/mbti
Last synced: about 1 month ago
JSON representation
Myers-briggs type indicator predictor using a machine learning model.
- Host: GitHub
- URL: https://github.com/v0idzdev/mbti
- Owner: v0idzdev
- License: apache-2.0
- Created: 2022-02-19T21:10:22.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-22T17:30:17.000Z (almost 3 years ago)
- Last Synced: 2024-11-15T03:41:45.656Z (about 1 month ago)
- Language: Python
- Homepage:
- Size: 23.5 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MBTI Predictor
Predicting someone's MBTI type based on their online posts, using AI.### Approach
I originally attempted to use a simple LSTM, using nltk and Keras. However, this approach led to the model badly overfitting.I opted for a bidirectional LSTM, treating the data differently with techniques such as lemmatization. This approach led to significantly slower learning - the model improved less and less per epoch. However, the validation accuracy did increase per epoch, as opposed to decreasing or remaining static.
As a last resort, I chose to use the BERT transformer with the AutoTokenizer from 'transformers.' In theory, it would have led to significantly better results - however, the large number of parameters meant it couldn't run on my GTX 1050.
Feel free to download the source code, if your hardware resources could accommodate training the BERT model.
### Findings
I could not find a significant correlation between a post and the MBTI type of its poster, using both a standard LSTM and its bidirectional variant.The dataset was unbalanced, with the number of posts per MBTI type differing greatly. I decided not to rectify this issue initially because the training data was already small at around 9000 entries. I may revisit this project if I find a more balanced and extensive dataset.
Feel free to download the source code and try for yourself.