https://github.com/twoeightnine/xvii_admin_bot
ml-bot to answer the most common questions to users
https://github.com/twoeightnine/xvii_admin_bot
bot machine-learning nlp python
Last synced: about 2 months ago
JSON representation
ml-bot to answer the most common questions to users
- Host: GitHub
- URL: https://github.com/twoeightnine/xvii_admin_bot
- Owner: TwoEightNine
- Created: 2020-07-23T09:00:42.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T11:14:41.000Z (over 3 years ago)
- Last Synced: 2025-02-13T18:53:10.874Z (about 1 year ago)
- Topics: bot, machine-learning, nlp, python
- Language: Python
- Homepage:
- Size: 109 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 12
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
## russian answer bot for groups of vk (and not only)
this is ML-solution to create a bot for groups of vk social network
(but you can easily create your own social network delegate).
the bot's behavior is based on real users' messages sent earlier.
currently the bot supports only russian language
this bot was created to assist me with answering the most frequent questions in vk group's messages (see [xvii messenger for vk](https://github.com/TwoEightNine/XVII))
read further for more information about how it works
**this project is in state of baseline**
### installation
#### step 0. cloning and setup
clone this repository using git
```bash
git clone https://github.com/TwoEightNine/xvii_admin_bot.git
cd xvii_admin_bot
```
then install and activate a virtual environment
```bash
sudo apt install python3-venv
python3.6 -m venv admin_bot_env
source admin_bot_env/bin/activate
pip install -r requirements.txt
```
in the root directory you should create file `secret.py` with some sensitive information. the file should be like this:
```python
access_token = 'your token here'
no_fetch_users = [13371337228]
```
`access_token` is a token to access group messages (to obtain the token visit [this page](https://vk.com/dev/authcode_flow_group), do not forget to use `scope=messages,offline`).
`no_fetch_user` is a list of users' ids. if you want to ignore messages from a user, put here his id
#### step 1. fetching messages
to fetch messages run
```bash
python3 fetcher.py --count COUNT --social SOCIAL [-h]
```
where `COUNT` is how many dialogs to fetch to get messages from,
`SOCIAL` is which social network to use. request `-h` help
to see which social networks are supported
the script will load messages into `data/messages.csv`
#### step 2. find clusters
to perform semi-automatic labelling here goes this step.
fetched messages are being lemmatized and cleaned, then converted to tf-idf vectors.
spectral clustering is used. to perform clustering run:
```bash
python3 clusterizer.py [--search] [--clusters_count CL_COUNT] --random_state RND_ST [-h]
```
where `--search` is an optional flag to perform search for better clusters count,
`--clusters_count` is required to perform final clustering,
defines preferred number of clusters,
`--random_state` is a random int for better reproducibility
you may want to run search (with `--search` flag) to calculate clustering metrics
for different number of clusters. in this case the script will print this information
after search you have already defined 'good' clusters count for your task.
now run this script again but with `--clusters_count YOUR_VALUE` and
the script will create `data/model_explanation.txt`
with information about the most frequent words in every cluster.
if you think that the result of clustering is not so good,
you can rerun clustering with other number of cluster or other random state
using the data you are going to create `classes.json` in next format:
```json
{
"your_class_1": {
"clusters": [3, 7, 11],
"response": "your_response_for_class_1"
},
"your_class_2": {
"clusters": [2],
"response": "__UNREAD"
},
"your_class_3": {
"clusters": [16],
"response": "your_response_for_class_3"
}
}
```
this will help to convert clusters into needed classes.
using these classes the model will train.
`clusters` are indexes of clusters that matches your class
`response` is an answer to user. this field may contain special markers
like `__UNREAD` and `__READ`. in these cases the response will not be sent
but the conversation will be left read (no answer needed) or unread
(human attention needed)
all not mentioned clusters implicitly belong to class `undefined`
with response `__UNREAD`
#### step 3. find and train a model
after you created `classes.json` you can start to search for and train a model to
perform predictions.
to search execute
```bash
python3 modeller.py --search [--cv CV] [--sort_by METRIC]
```
where `--search` is an optional flag that indicates that you want to
perform search (using sklearn's `GridSearch`),
`CV` is how many k-folds to use in cross validation,
`METRIC` is a metric alias to sort by
you can use default search params (like estimators and parameters)
or define own in `hyperparams.py` (variable `search_estimators`)
after search you can see 5 best results (according to `--sort_by`)
and explore all configurations in `data/search_results.csv`. best model
should be set in `hyperparams.py` as `final_estimator`
to train a model run
```bash
python3 modeller.py [--cv CV] [--pca_n_components N_COM]
```
where `N_COM` is an argument for `PCA()`'s `n_components` value,
if not set, PCA is not used
`data/model_pipeline.pkl` and `data/model_classes.pkl` will be created
**optionally**, you can interactively check the model using
```bash
python3 predictor.py
```
enter russian message and see which class the model thinks it belongs to
#### step 4. run and chill
the bot is ready to start. to launch it enter
```bash
python3 bot.py --social SOCIAL
```
in stdout you will see status messages, incoming messages and
predicted answers
###### twoeightnine, 2020