{"id":25215627,"url":"https://github.com/deezer/llms_quotation_attribution","last_synced_at":"2025-07-09T20:36:26.896Z","repository":{"id":273914068,"uuid":"921111697","full_name":"deezer/llms_quotation_attribution","owner":"deezer","description":"Code and data used in the paper \"Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3\" (NAACL 2025), co-authored by Gaspard Michel, Elena Epure, Romain Hennequin and Christophe Cerisara. ","archived":false,"fork":false,"pushed_at":"2025-01-23T17:43:32.000Z","size":74960,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-05T08:43:26.383Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deezer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-23T11:05:00.000Z","updated_at":"2025-01-23T17:43:36.000Z","dependencies_parsed_at":"2025-01-23T18:46:45.744Z","dependency_job_id":null,"html_url":"https://github.com/deezer/llms_quotation_attribution","commit_stats":null,"previous_names":["deezer/llms_quotation_attribution"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/deezer/llms_quotation_attribution","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deezer%2Fllms_quotation_attribution","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deezer%2Fllms_quotation_attribution/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deezer%2Fllms_quotation_attribution/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deezer%2Fllms_quotation_attribution/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deezer","download_url":"https://codeload.github.com/deezer/llms_quotation_attribution/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deezer%2Fllms_quotation_attribution/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264502992,"owners_count":23618674,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-10T18:15:11.692Z","updated_at":"2025-07-09T20:36:21.884Z","avatar_url":"https://github.com/deezer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"This is the official repository for the paper [*Evaluating LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3*](https://arxiv.org/abs/2406.11380), G. Michel, E. V. Epure, R. Hennequin and C. Cerisara (NAACL 2025).\n\nIn this paper, we present experiments where we prompt Llama-3 8b and 70b in a zero-shot setting to attribute quotes in literary texts. We then evaluate the impact of memorized information (book memorization \u0026 data contamination) on the downstream performance.\n\nThis repository contains scripts to reproduce our main experiments.\n\n# Prompts\n\nYou can find the prompts used in our experiments in the folder `prompts/`.\n\n# Installation\n\nRun the following commands to create an environment and install all the required packages:\n\n```bash\npython3 -m venv llms_for_qa\n. ./llms_for_qa/bin/activate\npip3 install -U pip\npip3 install -r requirements.txt\n```\n\n# Data \n\nWe provide in the `/data` folder all data used for our experiments.\nIn particular, the annotations and full text of the recently published novel, Dark Corners, that we annotated manually are available in the folder `unseen_source/`. We followed the PDNC guidelines, but did not annotated mentions of entities within quotes.\n\nThe folder `pdnc_source` contains data taken from [this repository](https://github.com/Priya22/speaker-attribution-acl2023/tree/main/data/pdnc_source).\nWe aligned the annotations from the [official PDNC github repository](https://github.com/Priya22/project-dialogism-novel-corpus/tree/master) to match the annotations in `pdnc_source` for the 6 novels of the second release of PDNC and Dark Corners in `test_pdnc_source`.\n\nThe file `novel_metadata.json` contain book titles and author for each novel in our dataset. \n\nIn each novel folder, we provide the raw text, quotation annotations, character list as well as the passages used for testing name-cloze (in `$novel/$novel.name_cloze.txt`)\n\nWe also ran BookNLP on each novel, giving two additional files: `novel.entities` containing results of NER and coreference resolution and `novel.tokens` containing each processed tokens by BookNLP. \n\nFor our Corrupted-Speaker-Guessing experiments, we manually annotated name replacements that can be found in the file `id2replacements.pkl`.\n\nThe files *seen.all.strided.1024.pdnc.4096.right0.json*, *test.all.strided.1024.pdnc.4096.right0.json* and *unseen.all.strided.1024.pdnc.4096.right0.json* contains the processed chunks for PDNC_1, PDNC_2 and Dark Corners (the Unseen Novel) respectively.\n\n# Scripts\n\nWe provide bash scripts to reproduce each of our experiments: \n\n`attribution.sh` for Section 4.\n\n`memorization.sh` for Section 5 - Book Memorization.\n\n`contamination.sh` for Section 5 - Annotation Contamination.\n\nBefore running an experiment, [make sur you have access to Llama-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B), and insert your hugginface token by modifying the value of `HGFACE_TOKEN` in the file `token_hgface.py`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeezer%2Fllms_quotation_attribution","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeezer%2Fllms_quotation_attribution","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeezer%2Fllms_quotation_attribution/lists"}