https://github.com/snap-research/locomo
https://github.com/snap-research/locomo
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/snap-research/locomo
- Owner: snap-research
- License: other
- Created: 2024-02-23T18:28:28.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-13T03:21:32.000Z (almost 2 years ago)
- Last Synced: 2025-03-27T14:21:56.645Z (about 1 year ago)
- Language: Python
- Size: 7.32 MB
- Stars: 33
- Watchers: 9
- Forks: 5
- Open Issues: 5
-
Metadata Files:
- Readme: README.MD
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-agent-evolution - LoCoMo - Long-context memory benchmark for agent memory systems. (Benchmarks and Evaluation / Embodied AI and Robotics)
README
# Data and Code for the **ACL 2024** Paper "**Evaluating Very Long-Term Conversational Memory of LLM Agents**"
**Authors**: [Adyasha Maharana](https://adymaharana.github.io/), [Dong-Ho Lee](https://www.danny-lee.info/), [Sergey Tulyakov](https://stulyakov.com/), [Mohit Bansal](https://www.cs.unc.edu/~mbansal/), [Francesco Barbieri](https://fvancesco.github.io/) and [Yuwei Fang](https://yuwfan.github.io/)
**Paper**: [pdf](https://github.com/snap-research/locomo/tree/main/static/paper/locomo.pdf)
## Data
We release LoCoMo, a high-quality evaluation benchmark consisting of *very* long-term conversational data. The benchmark consists of ten conversations. Each conversation is annotated for the **question-answering** and **event-summarization** tasks. Additionally, the dialogs in each conversation can be used for the **multimodal-dialog-generation** task. See statistics of the dataset in the Table below.

The dataset can be found in the ```./data/locomo10.json``` file in this repository. Each sample represents a single conversation and it's corresponding annotations:
* `sample_id`: identifier for the sample
* `conversation`:
* List of sessions (`session_`) and their timestamps (`session__date_time`). The numbers `` represent the chronological order of the sessions. * It also includes names of the two speakers i.e., `speaker_a` and `speaker_b`.
* A *turn* within each *session* contains the name of the `speaker`, the dialog id `dia_id`, and content of the dialog `text`.
* If the turn contains images, it also includes a link to the image `img_url`, caption generated by the [BLIP](https://huggingface.co/Salesforce/blip-image-captioning-large) model for the image `blip_caption` and the search query used by the third party module [icrawler](https://icrawler.readthedocs.io/en/latest/) to retrieve the image.
* `observation` (generated): Observations for each of the sessions in `conversation` (`session__observation`). See below for the code to regenerate observations. These observations are used as one of the databases for evaluating retrieval-augmented generation i.e., RAG models in our paper.
* `session_summary` (generated): Session-level summaries for each session in `conversation` (`session__summary`). See below for the code to regenerate session-level summaries. These summaries are also used as one of the databases for evaluating RAG models in our paper.
* `event_summary` (annotated): List of significant events for each speaker within each session in `conversation` (`events_session_`). These are the ground truth annotations for the event summarization task in the LoCoMo dataset.
* `qa` (annotated): Question-answer annotations for the question answering task in the LoCoMo dataset. Each sample contains `question`, `answer`, `category` label and a list of dialog ids that contain the answer i.e., `evidence`, when available.
**Note 1**: This release is a subset of the conversations released previously with our first Arxiv version in March 2024. The initial release contained 50 conversations. We sampled a subset of the data to retain the longest conversations with high-quality annotations and for cost-effective evaluation of closed-source LLMs.
**Note 2**: We do not release the images. However, the web URLs, captions and search queries for the images are included in the dataset.
## Code
Configuration variables like API keys, output directories etc. are set in `scripts/env.sh` and run at the beginning of all other scripts.
### Generate *very* long-term conversations between two LLM-agents with pre-assigned personalities using our LLM-based generative framework
The code to generate conversations is available in `scripts/generate_conversations.sh` and can be run as follows:
```
bash scripts/generate_conversations.sh
```
This code can be run under two settings:
1. Generate conversations between agents assigned with custom personas. To enable this setting, point `--out-dir` to a directory containing the files `agent_a.json` and `agent_b.json`. These files should contain the `name` and `persona_summary` of the speaker represented by the agent. See an example at `data/multimodal_dialog/example`.
```
{
"name": "Angela",
"persona_summary": "Angela is a 31 year old woman who works as the manager of a gift shop in Chapel Hill. She curates interesting pieces from local artists and has maintained a beautiful gallery in the form of the gift shop. She also makes her own art sometimes, in the form of oil paintings."
}
```
2. Create personalities using prompts from the MSC dataset. To enable this setting, point `--out-dir` to an empty directory. This will make the script sample a pair of personalities from `data/msc_personas_all.json`.
See `scripts/generate_conversations.py` for details on the various parameters that can be tweaked for generating the conversations. For example, `--num-days` can be changed to specify the temporal span of the conversations.
### Evaluate open-source and closed-source LLMs on the LoCoMo Question Answering Task with the (truncated) conversation as context
* Evaluate OpenAI models
```
bash scripts/evaluate_gpts.sh
```
* Evaluate Anthropic models
```
bash scripts/evaluate_claude.sh
```
* Evaluate Gemini models
```
bash scripts/evaluate_gemini.sh
```
* Evaluate models available on Huggingface
```
bash scripts/evaluate_hf_llm.sh
```
### Generate observations and session summaries from LoCoMo conversations using `gpt-3.5-turbo` for evaluating RAG-based models
We provide the observations and summaries with our release of the LoCoMo dataset. Follow these instructions to re-generate the same or for a different set of conversations.
* Generate observations from all sessions:
```
bash scripts/generate_observations.sh
```
* Generate summary of each session:
```
bash scripts/generate_session_summaries.sh
```
**Note 3**: Session-summaries are different from the event summaries of the event summarization task. The former summairze only a single session whereas event summaries are specific to each speaker and contain causal, temporal connections across sessions.
### Evaluate retrieval-augmented `gpt-3.5-turbo` on the LoCoMo question-answering task using (a) dialogs, (b) observations and (c) session summaries as databases.
* Evaluate `gpt-3.5-turbo` using retrieval-based augmentation
```
bash scripts/evaluate_rag_gpts.sh
```
### Evaluate models on the event summarization task
Coming soon!
### Train and evaluate `MiniGPT-5` models on the multimodal dialog generation task
Coming soon!
### Reference
Please cite our paper if you use LoCoMo in your works:
```bibtex
@article{maharana2024evaluating,
title={Evaluating very long-term conversational memory of llm agents},
author={Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei},
journal={arXiv preprint arXiv:2402.17753},
year={2024}
}
```