Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/IST-DASLab/PanzaMail
https://github.com/IST-DASLab/PanzaMail
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/IST-DASLab/PanzaMail
- Owner: IST-DASLab
- License: apache-2.0
- Created: 2024-04-10T13:10:31.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-10-31T12:02:48.000Z (about 2 months ago)
- Last Synced: 2024-10-31T13:17:56.291Z (about 2 months ago)
- Language: Python
- Size: 1.36 MB
- Stars: 257
- Watchers: 7
- Forks: 14
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Panza: A personal email assistant, trained and running on-device
## What is Panza?
Panza is an automated email assistant customized to your writing style and past email history. \
Its main features are as follows:
* Panza produces a fine-tuned LLM that matches your writing style, pairing it with a Retrieval-Augmented Generation (RAG) component which helps it produce relevant emails.
* Panza **can be trained and run entirely locally**. Currently, it requires a single GPU with
16-24 GiB of memory, but we also plan to release a CPU-only version. **At no point in training or execution is your data shared with the entities that trained the original LLMs, with LLM distribution services such as Huggingface, or with us.**
* Training and execution are also quick - for a dataset on the order of 1000 emails, training Panza takes well under an hour, and generating a new email takes a few seconds at most.
## Prerequisites
- Your emails, exported to `mbox` format (see tutorial below).
- A computer, preferably with a NVIDIA GPU with at least 24 GiB of memory (alternatively, check out [running in Google Colab](#cloud-try-out-panza-in-google-colab)).
- A Hugging Face [account](https://huggingface.co/login) to download the models (free of charge).
- [Optional] A Weights & Biases [account](https://wandb.ai/login) to log metrics during training (free of charge).
- Basic Python and Unix knowledge, such as building environments and running python scripts.
- *No prior LLMs experience is needed*.## How it works
### :film_projector: Step 1: Data playback
For most email clients, it is possible to download a user's past emails in a machine-friendly .mbox format. For example, GMail allows you to do this via [Google Takeout](https://takeout.google.com), whereas Thunderbird allows one to do this via various plugins.
One key part of Panza is a dataset-generation technique we call **data playback**: Given some of your past emails in .mbox format, we automatically create a training set for Panza by using a pretrained LLM to summarize the emails in instruction form; each email becomes a `(synthetic instruction, real email)` pair.
Given a dataset consisting of all pairs, we use these pairs to "play back" your sent emails: the LLM receives only the instruction, and has to generate the "ground truth" email as a training target.We find that this approach is very useful for the LLM to "learn" the user's writing style.
### :weight_lifting: Step 2: Local Fine-Tuning via Robust Adaptation (RoSA)
We then use parameter-efficient finetuning to train the LLM on this dataset, locally. We found that we get the best results with the [RoSA method](https://arxiv.org/pdf/2401.04679.pdf), which combines low-rank (LoRA) and sparse finetuning. If parameter efficiency is not a concern, that is, you have a more powerful GPU, then regular, full-rank/full-parameter finetuning can also be used. We find that a moderate amount of further training strikes the right balance between matching the writer's style without memorizing irrelevant details in past emails.
### :owl: Step 3: Serving via RAG
Once we have a custom user model, Panza can be run locally together with a Retrieval-Augmented Generation (RAG) module. Specifically, this functionality stores past emails in a database and provides a few relevant emails as context for each new query. This allows Panza to better insert specific details, such as a writer's contact information or frequently used Zoom links.
The overall structure of Panza is as follows:
## Installation
### Conda
1. Make sure you have a version of [conda](https://docs.anaconda.com/free/miniconda/miniconda-install/) installed.
2. Run `source prepare_env.sh`. This script will create a conda environment named `panza` and install the required packages.### Docker
As an alternative to the conda option above, you can run the following commands to pull a docker image with all the dependencies installed.
```
docker pull istdaslab/panzamail
```or alternatively, you can build the image yourself:
```
docker build . -f Dockerfile -t istdaslab/panzamail
```Then run it with:
```
docker run -it --gpus all istdaslab/panzamail /bin/bash
```In the docker you can activate the `panza` environment with:
```
micromamba activate panza
```## :rocket: Getting started
To quickly get started with building your own personalized email assistant, follow the steps bellow:
### Step 0: Download your sent emails
Expand for detailed download instructions.
We provide a description for doing this for GMail via Google Takeout.
1. Go to [https://takeout.google.com/](https://takeout.google.com/).
2. Click `Deselect all`.
3. Find `Mail` section (search for the phrase `Messages and attachments in your Gmail account in MBOX format`).
4. Select it.
5. Click on `All Mail data included` and deselect everything except `Sent`.
6. Scroll to the bottom of the page and click `Next step`.
7. Click on `Create export`.
8. Wait for download link to arrive in your inbox.
9. Download `Sent.mbox` and place it in the `data/` directory.For Outlook accounts, we suggest doing this via a Thunderbird plugin for exporting a subset of your email as an MBOX format, such as [this add-on](https://addons.thunderbird.net/en-us/thunderbird/addon/importexporttools-ng/).
At the end of this step you should have the downloaded emails placed inside `data/Sent.mbox`.
### Step 1: Environment configuration
Panza is configured through a set of environment variables defined in `scripts/config.sh` and shared along all running scripts.
The LLM prompt is controlled by a set of `prompt_preambles` that give the model more insight about its role, the user and how to reuse existing emails for *Retrieval-Augmented Generation (RAG)*. See more details in the [prompting section](prompt_preambles/README.md).
:warning: Before continuing, make sure you complete the following setup:
- Modifiy the environment variable `PANZA_EMAIL_ADDRESS` inside `scripts/config.sh` with your own email address.
- Modifiy `prompt_preambles/user_preamble.txt` with your own information. If you choose, this can even be empty.
- Login to Hugging Face to be able to download pretrained models: `huggingface-cli login`.
- [Optional] Login to Weights & Biases to log metrics during training: `wandb login`. Then, set `PANZA_WANDB_DISABLED=False` in `scripts/config.sh`.You are now ready to move to `scripts`.
``` bash
cd scripts
```### Step 2: Extract emails
1. Run `./extract_emails.sh`. This extracts your emails in text format to `data/_clean.jsonl` which you can manually inspect.
2. If you wish to eliminate any emails from the training set (e.g. containing certain personal information), you can simply remove the corresponding rows.
### Step 3: Prepare dataset
1. Simply run `./prepare_dataset.sh`.
This scripts takes care of all the prerequisites before training (expand for details).- Creates synthetic prompts for your emails as described in the [data playback](#film_projector-step-1-data-playback) section. The results are stored in `data/_clean_summarized.jsonl` and you can inspect the `"summary"` field.
- Splits data into training and test subsets. See `data/train.jsonl` and `data/test.jsonl`.
- Creates a vector database from the embeddings of the training emails which will later be used for *Retrieval-Augmented Generation (RAG)*. See `data/.pkl` and `data/.faiss`.
### Step 4: Train a LLM on your emails
We currently support `LLaMA3-8B-Instruct` and `Mistral-Instruct-v0.2` LLMs as base models; the former is the default, but we obtained good results with either model.
1. [Recommended] For parameter efficient fine-tuning, run `./train_rosa.sh`.
If a larger GPU is available and full-parameter fine-tuning is possible, run `./train_fft.sh`.2. We have prepopulated the training scripts with parameter values that worked best for us. We recommend you try those first, but you can also experiment with different hyper-parameters by passing extra arguments to the training script, such as `LR`, `LORA_LR`, `NUM_EPOCHS`. All the trained models are saved in the `checkpoints` directory.
Examples:
``` bash
./train_rosa.sh # Will use the default parameters../train_rosa.sh LR=1e-6 LORA_LR=1e-6 NUM_EPOCHS=7 # Will override LR, LORA_LR, and NUM_EPOCHS.
```### Step 5: Launch Panza!
1. Run `./run_panza_gui.sh MODEL=` to serve the trained model in a friendly GUI.
Alternatively, if you prefer using the CLI to interact with Panza, run `./run_panza_cli.sh` instead.You can experiment with the following arguments:
- If `MODEL` is not specified, it will use a pretrained `Meta-Llama-3-8B-Instruct` model by default, although Panza also works with `Mistral-7B-Instruct-v2`. Try it out to compare the syle difference!
- To disable RAG, run with `PANZA_DISABLE_RAG_INFERENCE=1`.Example:
``` bash
./run_panza_gui.sh \
MODEL=/local/path/to/this/repo/checkpoints/models/panza-rosa_1e-6-seed42_7908 \
PANZA_DISABLE_RAG_INFERENCE=0 # this is the default behaviour, so you can omit it
```:email: **Have fun with your new email writing assistant!** :email:
## :cloud: Try out Panza in Google Colab
- You can run Panza in a Google Colab instance [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IST-DASLab/PanzaMail/blob/main/notebooks/panza_colab.ipynb).
## :microscope: Advanced usage
- [Data Preparation Guide](./scripts/README.md#data-guide)
- [Hyper-Parameter Tuning Guide](./scripts/README.md#hyper-parameter-tuning-guide)
- [Prompt Preambles Tutorial](prompt_preambles/README.md)## Authors
Panza was conceived by Nir Shavit and Dan Alistarh and built by the [Distributed Algorithms and Systems group](https://ist.ac.at/en/research/alistarh-group/) at IST Austria. The contributors are (in alphabetical order):
Dan Alistarh, Eugenia Iofinova, Eldar Kurtic, Ilya Markov, Armand Nicolicioiu, Mahdi Nikdan, Andrei Panferov, and Nir Shavit.
Contact: [email protected]
We thank our collaborators Michael Goin and Tony Wang at NeuralMagic and MIT for their helpful testing and feedback.