https://github.com/soum-io/gpa_predictor
https://github.com/soum-io/gpa_predictor
categorical-features deep-learning embeddings gpa-data machine-learning prediction semester uiuc
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/soum-io/gpa_predictor
- Owner: soum-io
- Created: 2018-05-08T06:27:54.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-07-19T02:16:26.000Z (about 7 years ago)
- Last Synced: 2025-06-05T04:13:59.347Z (4 months ago)
- Topics: categorical-features, deep-learning, embeddings, gpa-data, machine-learning, prediction, semester, uiuc
- Language: Python
- Size: 2.02 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# UIUC Course GPA Predictor
### Created By: _Michael Shea_
## Goal
The goal of this project is to be able to predict the average GPAs of current and future courses at UIUC using previous GPA data and Machine Learning.## Desired Outcome
I hope that students will be able to use this data to help decide what class to take in an upcoming semester, similarly to the way they do with [previous GPA data visualizations](http://waf.cs.illinois.edu/discovery/gpa_of_every_course_at_illinois/). This will require the predicted data to also be visualized - which will be worked on once the model and predictions are completed.## How to Reproduce My Results
1. Fork the project. Make sure all python libraries that are imported in the code are installed on your local machine. Node.js is needed to run the JavaScript files.
2. Open and run _data_cleanup.py_. It should create a file called _filteredComplete.csv_. This file contains all the training and testing data from previous semesters who have data available for them.
3. Open and run _GetMajorData.js_, which is located in the _future courses_ folder. This should create the file _MajorData.json_, which holds the json data of an API response from [the Course Explorer API](https://courses.illinois.edu/cisdocs/explorer) that contains info on all the majors that courses will be offered for. This data is needed for the next step.
4. Open and run _GetCoursesByMajor.js_, which is located in the _future courses_ folder. This will take a few minutes to run. Sometimes the server will timeout and you will get an error. Keep on trying to run this script until it successfully runs all the way through. For each major found in the _MajorData.json_, it will save all of the info regarding courses for that major for the semester specified in the code. This data is stored in json format in the folder _MajorsData_.
5. Open and run _remove_bad_majors.py_, which is located in the _future courses_ folder. Some of the Majors will have no data at all, and will cause an error in the next step. So this script gets rid of the files.
6. Open and run _FutureCourses.js_, which is located in the _future courses_ folder. This will create a file called _course_teacher.csv_. This file is all of the course data for the semester specified in the code in the format of the data created in step 2.
7. Open and run _NextYearData.py_. This will create a file called _course_teacher_full.csv_. This is the same thing as _course_teacher.csv_, except the teachers name are in the correct format. Unknown teachers will have '_-1_' as their value instead.
8. Now it is time to create a model using _classifier.py_. This uses the library [fastai](https://github.com/fastai/fastai) to train a deep neural network with three hidden layers of sizes 1000, 1000 and 500 and drop out rates of .001, .01. and .02. The tanh activation functions are used at each layer except the last one, which uses softmax. These are the settings that produce the best results based from my experimentation. I used a technique called _categorical embeddings_ on the input features. Traditionally, one-hot-encodings (OHE) are used on categorical features such as the one that we use. But OHE fails to find the optimal relationships between categories, as all features are 0 except one. With embeddings, taken from the inspirations of word embeddings in NLP, the relationships between the categories are learned in the model. So, in reality, the first layer of the neural network is an embedding matrix. After running this script, two files will be created. The first one is _viz_prediction.csv_, which contains three columns. The first column is the GPAs from the validation set, which is defined in the code to be the last semester on record. The second column is the predicted GPAs, and the third column is the difference between the first two. The second file is _{Semester}{Year}Predictions.csv_, which is in the same format as _filteredComplete.csv_. Here, the data is for the future semester specified in the code that the predictions are needed for, and the GPA column contains the predictions.