Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nileshsah/robomax
A machine-learning based open-domain QA chatbot from scratch 🤖
https://github.com/nileshsah/robomax
chatbot ipynb machine-learning natural-language-processing notebook
Last synced: about 1 month ago
JSON representation
A machine-learning based open-domain QA chatbot from scratch 🤖
- Host: GitHub
- URL: https://github.com/nileshsah/robomax
- Owner: nileshsah
- License: mit
- Created: 2018-09-22T23:01:28.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2018-10-31T21:43:34.000Z (about 6 years ago)
- Last Synced: 2024-10-11T15:41:46.643Z (about 1 month ago)
- Topics: chatbot, ipynb, machine-learning, natural-language-processing, notebook
- Language: Jupyter Notebook
- Homepage:
- Size: 87.3 MB
- Stars: 10
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# *Robo*Max 🤖
Meet RoboMax! Your personal assistant for all your queries about the US election ¯\\_(ツ)_/¯
The Project
---
This repository contains self-sustained Jupyter notebooks used for training our assistant RoboMax to make it capable enough to answer open-domain questions. We use tweets as a source for our knowledge-base and attempt to reflect back the `opinion of the world` about your question of interest. At the moment, we've tweaked our RoboMax to answer questions about the 2016 US election from tweets gracefully made available at https://www.kaggle.com/kinguistics/election-day-tweets/#election_day_tweets.csvGetting Started
---
The notebook [robomax-training-notebook.ipynb](robomax-training-notebook.ipynb) serves as the starting point for this project which constitutes of the major data exploration and feature engineering tasks.The notebook [robomax-election-tweets-bot.ipynb](robomax-election-tweets-bot.ipynb) involves tweaking RoboMax in order to answer questions based on election tweets.
Dataset
---
Due to the unavailability of a twitter based question-answer dataset, we resorted to using the pretty standard [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) reading comprehension dataset in a modified way. Instead of predicting the factual answers, we trained our model to identify the sentence containing the required answer.Training
---
We built our model based on rather nominal features with a baseline Random Forest Classifier which leaves a huge scope for improvement. [AuC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) served as our metric to optimize due to the traditional class imbalance issue. We aimed to improve the recall for the sentences containing the correct answer over our prediction precision.Prediction
---
We use a combination of indexing, predicting and summarizing to formulate an answer to the given question. [Whoosh](https://whoosh.readthedocs.io/en/latest/intro.html) serves as our go-to indexing library. Our pre-trained model generates scores for the results from the indexer in terms of which tweet is closest to the question followed by capping the best results using an Edmundson summarizer to finally bake up an answer.