Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pszemraj/ai-msgbot
Training & Implementation of chatbots leveraging GPT-like architecture with the aitextgen package to enable dynamic conversations.
https://github.com/pszemraj/ai-msgbot
ai aitextgen chat-application chatbot deep-learning deepspeed deployment gpt-2 gpt-j gpt-j-6b gradio huggingface huggingface-transformers natural-language-processing nlp nlp-parsing telegram telegram-bot text-generation transformers
Last synced: about 1 month ago
JSON representation
Training & Implementation of chatbots leveraging GPT-like architecture with the aitextgen package to enable dynamic conversations.
- Host: GitHub
- URL: https://github.com/pszemraj/ai-msgbot
- Owner: pszemraj
- Created: 2021-10-15T01:31:37.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-09-06T07:22:08.000Z (about 2 years ago)
- Last Synced: 2023-03-10T06:26:39.282Z (over 1 year ago)
- Topics: ai, aitextgen, chat-application, chatbot, deep-learning, deepspeed, deployment, gpt-2, gpt-j, gpt-j-6b, gradio, huggingface, huggingface-transformers, natural-language-processing, nlp, nlp-parsing, telegram, telegram-bot, text-generation, transformers
- Language: Jupyter Notebook
- Homepage:
- Size: 66.8 MB
- Stars: 33
- Watchers: 3
- Forks: 9
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI Chatbots based on GPT Architecture
> It sure seems like there are a lot of text-generation chatbots out there, but it's hard to find a python package or model that is easy to tune around a simple text file of message data. This repo is a simple attempt to help solve that problem.
`ai-msgbot` covers the practical use case of building a chatbot that sounds like you (or some dataset/persona you choose) by training a text-generation model to generate conversation in a consistent structure. This structure is then leveraged to deploy a chatbot that is a "free-form" model that _consistently_ replies like a human.
There are three primary components to this project:
1. parsing a dataset of conversation-like data
2. training a text-generation model. This repo is designed around using the Google Colab environment for model training.
3. Deploy the model to a chatbot interface for users to interact with, either locally or on a cloud service.It relies on the [`aitextgen`](https://github.com/minimaxir/aitextgen) and [`python-telegram-bot`](https://github.com/python-telegram-bot/python-telegram-bot) libraries. Examples of how to train larger models with DeepSpeed are in the `notebooks/colab-huggingface-API` directory.
```sh
python ai_single_response.py -p "greetings sir! what is up?"... generating...
finished!
('hello, i am interested in the history of vegetarianism. i do not like the '
'idea of eating meat out of respect for sentient life')
```Some of the trained models can be interacted with through the HuggingFace spaces and model inference APIs on the [ETHZ Analytics Organization](https://huggingface.co/ethzanalytics) page on huggingface.co.
* * *
**Table of Contents**
- [Quick outline of repo](#quick-outline-of-repo)
- [Example](#example)
- [Quickstart](#quickstart)
- [Repo Overview and Usage](#repo-overview-and-usage)
- [New to Colab?](#new-to-colab)
- [Parsing Data](#parsing-data)
- [Training a text generation model](#training-a-text-generation-model)
- [Training: Details](#training-details)
- [Interaction with a Trained model](#interaction-with-a-trained-model)
- [Improving Response Quality: Spelling & Grammar Correction](#improving-response-quality-spelling--grammar-correction)
- [WIP: Tasks & Ideas](#wip-tasks--ideas)
- [Extras, Asides, and Examples](#extras-asides-and-examples)
- [examples of command-line interaction with "general" conversation bot](#examples-of-command-line-interaction-with-general-conversation-bot)
- [Other resources](#other-resources)
- [Citations](#citations)* * *
## Quick outline of repo
- training and message EDA notebooks in `notebooks/`
- python scripts for parsing message data into a standard format for training GPT are in `parsing-messages/`
- example data (from the _Daily Dialogues_ dataset) is in `conversation-data/`
- Usage of default models is available via the `dowload_models.py` script### Example
This response is from a [bot on Telegram](https://t.me/GPTPeter_bot), finetuned on the author's messages
The model card can be found [here](https://huggingface.co/pszemraj/opt-peter-2.7B).
## Quickstart
> **NOTE: to build all the requirements, you _may_ need Microsoft C++ Build Tools, found [here](https://visualstudio.microsoft.com/visual-cpp-build-tools/)**
1. clone the repo
2. cd into the repo directory: `cd ai-msgbot/`
3. Install the requirements: `pip install -r requirements.txt`
- if using conda: `conda env create --file environment.yml`
- _NOTE:_ if there are any errors with the conda install, it may ask for an environment name which is `msgbot`
4. download the models: `python download_models.py` _(if you have a GPT-2 model, save the model to the working directory, and you can skip this step)_
5. run the bot: `python ai_single_response.py -p "hey, what's up?"` or enter a "chatroom" with `python conv_w_ai.py -p "hey, what's up?"`
- _Note:_ for either of the above, the `-h` parameter can be passed to see the options (or look in the script file)Put together in a shell block:
```sh
git clone https://github.com/pszemraj/ai-msgbot.git
cd ai-msgbot/
pip install -r requirements.txt
python download_models.py
python ai_single_response.py -p "hey, what's up?"
```* * *
## Repo Overview and Usage
### A general process-flow
### Parsing Data
- the first step in understanding what is going on here is to understand what is happening ultimately is teaching GPT-2 to recognize a "script" of messages and respond as such.
- this is done with the `aitextgen` library, and it's recommended to read through [some of their docs](https://docs.aitextgen.io/tutorials/colab/) and take a look at the _Training your model_ section before returning here.
- essentially, to generate a novel chatbot from just text (without going through too much trouble as required in other libraries.. can you easily abstract your friend's WhatsApp messages into a "persona"?)**An example of what a "script" is:**
speaker a:
hi, becky, what's up?speaker b:
not much, except that my mother-in-law is driving me up the wall.speaker a:
what's the problem?speaker b:
she loves to nit-pick and criticizes everything that i do. i can never do anything right when she's around...._Continued_...
more to come, but check out `parsing-messages/parse_whatsapp_output.py` for a script that will parse messages exported with the standard [whatsapp chat export feature](https://faq.whatsapp.com/196737011380816/?locale=en_US#:~:text=You%20can%20use%20the%20export,with%20media%20or%20without%20media.). consolidate all the WhatsApp message export folders into a root directory, and pass the root directory to this
TODO: more words
### Training a text generation model
The next step is to leverage the text-generative model to reply to messages. This is done by "behind the scenes" parsing/presenting the query with either a real or artificial speaker name and having the response be from `target_name` and, in the case of GPT-Peter, it is me.
Depending on computing resources and so forth, it is possible to keep track of the conversation in a helper script/loop and then feed in the prior conversation and _then_ the prompt, so the model can use the context as part of the generation sequence, with of course the [attention mechanism](https://arxiv.org/abs/1706.03762) ultimately focusing on the last text past to it (the actual prompt)
Then, deploying this pipeline to an endpoint where a user can send in a message, and the model will respond with a response. This repo has several options; see the `deploy-as-bot/` directory, which has an associated README.md file.
#### Training: Details
- an example dataset (_Daily Dialogues_) parsed into the script format can be found locally in the `conversation-data` directory.
- When learning, it is probably best to use a conversational dataset such as _Daily Dialogues_ as the last dataset to finetune the GPT2 model. Still, before that, the model can "learn" various pieces of information from something like a natural questions-focused dataset.
- many more datasets are available online at [PapersWithCode](https://paperswithcode.com/datasets?task=dialogue-generation&mod=texts) and [GoogleResearch](https://research.google/tools/datasets/). Seems that _Google Research_ also has a tool for searching for datasets online.
- Note that training is done in google colab itself. try opening `notebooks/colab-notebooks/GPT_general_conv_textgen_training_GPU.ipynb` in Google Colab (see the HTML button at the top of that notebook or click [this link to a shared git gist](https://colab.research.google.com/gist/pszemraj/06a95c7801b7b95e387eafdeac6594e7/gpt2-general-conv-textgen-training-gpu.ipynb))
- Essentially, a script needs to be parsed and loaded into the notebook as a standard .txt file with formatting as outlined above. Then, the text-generation model will load and train using _aitextgen's_ wrapper around the PyTorch lightning trainer. Essentially, the text is fed into the model, and it self-evaluates for a "test" as to whether a text message chain (somewhere later in the doc) was correctly predicted or not.TODO: more words
### New to Colab?
`aitextgen` is largely designed around leveraging Colab's free-GPU capabilities to train models. Training a text generation model and most transformer models, _is resource intensive_. If new to the Google Colab environment, check out the below to understand more of what it is and how it works.
- [Google's FAQ](https://research.google.com/colaboratory/faq.html)
- [Medium Article on Colab + Large Datasets](https://satyajitghana.medium.com/working-with-huge-datasets-800k-files-in-google-colab-and-google-drive-bcb175c79477)
- [Google's Demo Notebook on I/O](https://colab.research.google.com/notebooks/io.ipynb)
- [A better Colab Experience](https://towardsdatascience.com/10-tips-for-a-better-google-colab-experience-33f8fe721b82)### Interaction with a Trained model
- Command line scripts:
- `python ai_single_response.py -p "hey, what's up?"`
- `python conv_w_ai.py -p "hey, what's up?"`
- You can pass the argument `--model ` to change the model.
- Example: `python conv_w_ai.py -p "hey, what's up?" --model "GPT2_trivNatQAdailydia_774M_175Ksteps"`
- Some demos are available on the ETHZ Analytics Group's huggingface.co page (_no code required!_):
- [basic chatbot](https://huggingface.co/spaces/ethzanalytics/dialogue-demo)
- [GPT-2 XL Conversational Chatbot](https://huggingface.co/spaces/ethzanalytics/dialogue-demo)
- Gradio - locally hosted runtime with public URL.
- See: `deploy-as-bot/gradio_chatbot.py`
- The UI and interface will look similar to the demos above, but run locally & are more customizable.
- Telegram bot - Runs locally, and anyone can message the model from the Telegram messenger app(s).
- See: `deploy-as-bot/telegram_bot.py`
- An example chatbot by one of the authors is usually online and can be found [here](https://t.me/GPTPeter_bot)### Improving Response Quality: Spelling & Grammar Correction
One of this project's primary goals is to train a chatbot/QA bot that can respond to the user "unaided" where it does not need hardcoding to handle questions/edge cases. That said, sometimes the model will generate a bunch of strings together. Applying "general" spell correction helps make the model responses as understandable as possible without interfering with the response/semantics.
- Implemented methods:
- **symspell** (via the pysymspell library) _NOTE: while this is fast and works, it sometimes corrects out common text abbreviations to random other short words that are hard to understand, i.e., **tues** and **idk** and so forth_
- **gramformer** (via transformers `pipeline()`object). a pretrained NN that corrects grammar and (to be tested) hopefully does not have the issue described above. Links: [model page](https://huggingface.co/prithivida/grammar_error_correcter_v1), [the models github](https://github.com/PrithivirajDamodaran/Gramformer/)
- **Grammar Synthesis** (WIP) - Some promising results come from training a text2text generation model that, through "pseudo-diffusion," is trained to denoise **heavily** corrupted text while learning to _not_ change the semantics of the text. A checkpoint and more details can be found [here](https://huggingface.co/pszemraj/grammar-synthesis-base) and a notebook [here](https://colab.research.google.com/gist/pszemraj/91abb08aa99a14d9fdc59e851e8aed66/demo-for-grammar-synthesis-base.ipynb).* * *
## WIP: Tasks & Ideas
- [x] finish out `conv_w_ai.py` that is capable of being fed a whole conversation (or at least, the last several messages) to prime response and "remember" things.
- [ ] better text generation- add-in option of generating multiple responses to user prompts, automatically applying sentence scoring to them, and returning the one with the highest mean sentence score.
- constrained textgen
- [x] explore constrained textgen
- [ ] add constrained textgen to repo
[x] assess generalization of hyperparameters for "text-message-esque" bots
- [ ] add write-up with hyperparameter optimization results/learnings- [ ] switch repo API from `aitextgen` to `transformers pipeline` object
- [ ] Explore model size about "human-ness."## Extras, Asides, and Examples
### examples of command-line interaction with "general" conversation bot
The following responses were received for general conversational questions with the `GPT2_trivNatQAdailydia_774M_175Ksteps` model. This is an example of what is capable (and much more!!) in terms of learning to interact with another person, especially in a different language:
python ai_single_response.py --time --model "GPT2_trivNatQAdailydia_774M_175Ksteps" --prompt "where is the grocery store?"
... generating...
finished!
"it's on the highway over there."
took 38.9 seconds to generate.Python ai_single_response.py --time --model "GPT2_trivNatQAdailydia_774M_175Ksteps" --prompt "what should I bring to the party?"
... generating...
finished!
'you need to just go to the station to pick up a bottle.'
took 45.9 seconds to generate.C:\Users\peter\PycharmProjects\gpt2_chatbot>python ai_single_response.py --time --model "GPT2_trivNatQAdailydia_774M_175Ksteps" --prompt "do you like your job?"
... generating...
finished!
'no, not very much.'
took 50.1 seconds to generate.### Other resources
These are probably worth checking out if you find you like NLP/transformer-style language modeling:
1. [The Huggingface Transformer and NLP course](https://huggingface.co/course/chapter1/2?fw=pt)
2. [Practical Deep Learning for Coders](https://course.fast.ai/) from fast.ai* * *
## Citations
TODO: add citations for datasets and main packages used.