{"id":19705872,"url":"https://github.com/llnl/luar","last_synced_at":"2025-04-29T16:32:00.342Z","repository":{"id":66082939,"uuid":"602321032","full_name":"LLNL/LUAR","owner":"LLNL","description":"Transformer-based model for learning authorship representations.","archived":false,"fork":false,"pushed_at":"2024-08-12T15:38:19.000Z","size":37868,"stargazers_count":36,"open_issues_count":2,"forks_count":8,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-05T18:05:24.825Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LLNL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-16T00:40:41.000Z","updated_at":"2025-04-03T06:12:21.000Z","dependencies_parsed_at":"2023-12-21T21:23:50.050Z","dependency_job_id":"789f3f50-3d88-4716-ac67-68d87eb52af0","html_url":"https://github.com/LLNL/LUAR","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FLUAR","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FLUAR/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FLUAR/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LLNL%2FLUAR/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LLNL","download_url":"https://codeload.github.com/LLNL/LUAR/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251540258,"owners_count":21605872,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T21:31:13.336Z","updated_at":"2025-04-29T16:31:55.326Z","avatar_url":"https://github.com/LLNL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Learning Universal Authorship Representations\n\nThis is the official repository for the EMNLP 2021 paper [\"Learning Universal Authorship Representations\"](https://aclanthology.org/2021.emnlp-main.70/). The paper studies whether the authorship representations learned in one domain transfer to another. To do so, we conduct the first large-scale study of cross-domain transfer for authorship verification considering zero-shot transfers involving three disparate domains: Amazon Reviews, fanfiction short stories, and Reddit comments.\n\n## HuggingFace\nLUAR model variations are now available on HuggingFace! They can be found [here](https://huggingface.co/collections/rrivera1849/luar-65133328387d403b2e6f33a2).\n\n## Installation\nRun the following commands to create an environment and install all the required packages:\n```bash\npython3 -m venv vluar\n. ./vluar/bin/activate\npip3 install -U pip\npip3 install -r requirements.txt\n```\n\n## Downloading the Data and Pre-trained Weights\n\nOnce you've installed the environment, execute the following commands to download the SBERT pre-trained weights, download and preprocess the data:\n\n### Pre-trained Weights\n\nFollow the instructions [here](https://git-lfs.github.com) to install git lfs.\n\n```bash\n./scripts/download_sbert_weights.sh\n```\n\n### Reddit\n\nReddit has changed their [Data API terms](https://www.redditinc.com/policies/data-api-terms) to disallow the use of user-data to train machine learning models unless permission is explicitly granted by the original poster. \nAs such, we're only providing the comment identifiers of the posts used to train our models:\n\n|               Dataset Name              |                                     Download Link                                     |\n|:---------------------------------------:|:-------------------------------------------------------------------------------------:|\n| [IUR](https://arxiv.org/abs/1910.04979) | https://cs.jhu.edu/~noa/data/reddit.tar.gz                                            |\n| [MUD](https://arxiv.org/abs/2105.07263) | https://drive.google.com/file/d/16YgK62cpe0NC7zBvSF_JxosOozG-wxou/view?usp=drive_link |\n\n### Amazon\n\nThe amazon data must be requested from [here](https://nijianmo.github.io/amazon/index.html#files) (the \"raw review data\" (34gb) dataset). Once the data has been downloaded, place the files under \"./data/raw_amazon\" and run the following command to pre-process the data:\n\n```bash\n./scripts/preprocess_amazon_data.sh\n```\n\n### Fanfiction\n\nThe fanfiction data must be requested from [here](https://zenodo.org/record/3724096#.YT942y1h1pQ). Once the data has been downloaded, place the data.jsonl and truth.jsonl files from the large dataset under \"./data/pan_paragraph\". Then, run the following command to pre-process the data:\n\n```bash\n./scripts/preprocess_fanfiction_data.sh\n```\n\n## Path Configuration\nThe application paths can be changed by modifying the variables in `file_config.ini`:\n- **output_path**: Where the experiment results and model checkpoints will be saved. (Default ./output)\n- **data_path**: Where the datasets should be stored. (Default ./data)\n- **transformer_path**: Where the pretrained model weights for SBERT should be stored. (Default ./pretrained_weights)\n\nWe strongly encourage you to set your own paths.\n\n## Reproducing Results\n\nThe commands for reproducing each table of results within the paper are found under \"./scripts/reproduce/table_N.sh\". \n\n## Training\n\nThe commands to train the SBERT model are shown below. There are two types of training: single domain and multi-domain. In short, single-domain models are trained on one dataset while multi-domain models are trained on two datasets. \n\nThe dataset names available for training are:\n* iur_dataset - The Reddit dataset from [here](https://aclanthology.org/D19-1178/).\n* raw_all - The Reddit Million User Dataset (MUD).\n* raw_amazon - The Amazon Reviews dataset.\n* pan_paragraph - The PAN Short Stories dataset.\n\n\n## Training Single-Domain Models\n\n#### Reddit Comments\n```bash\npython main.py --dataset_name raw_all --do_learn --validate --gpus 4 --experiment_id reddit_model\n```\n#### Amazon Reviews\n```bash\npython main.py --dataset_name raw_amazon --do_learn --validate --experiment_id amazon_model\n```\n#### Fanfiction Stories\n```bash\npython main.py --dataset_name pan_paragraph --do_learn --validate --experiment_id fanfic_model\n```\n\n## Training Multi-Domain Models\n\n### Reddit Comments + Amazon Reviews\n```bash\npython main.py --dataset_name raw_all+raw_amazon --do_learn --validate --gpus 4 --experiment_id reddit_amazon_model\n```\n### Amazon Reviews + Fanfiction Stories\n```bash\npython main.py --dataset_name raw_amazon+pan_paragraph --do_learn --validate --gpus 4 --experiment_id amazon_stories_model\n```\n#### Reddit Comments + Fanfiction Stories\n```bash\npython main.py --dataset_name raw_all+pan_paragraph --do_learn --validate --gpus 4 --experiment_id reddit_stories_model\n```\n\n## Evaluating\nThe commands to evaluate on each dataset is shown below. Replace \u003cexperiment_id\u003e with the experiment identifier that was used during training. For example, if you followed the commands for single domain training shown above, valid experiment identifiers would be: reddit_model, amazon_model and fanfic_model. \n\n### Reddit Comments\n```bash\npython main.py --dataset_name raw_all --evaluate --experiment_id \u003cexperiment_id\u003e --load_checkpoint\n```\n\n### Amazon Reviews\n```bash\npython main.py --dataset_name raw_amazon --evaluate --experiment_id \u003cexperiment_id\u003e --load_checkpoint\n```\n\n### Fanfiction Stories\n```bash\npython main.py --dataset_name pan_paragraph --evaluate --experiment_id \u003cexperiment_id\u003e --load_checkpoint\n```\n\n## Contributing\n\nTo contribute to LUAR, just send us a [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests).\nWhen sending a request, make `main` the destination branch on the LUAR repository.\n\n## Citation\n\nIf you use our code base in your work, please consider citing:\n\n```\n@inproceedings{uar-emnlp2021,\n  author    = {Rafael A. Rivera Soto and Olivia Miano and Juanita Ordonez and Barry Chen and Aleem Khan and Marcus Bishop and Nicholas Andrews},\n  title     = {Learning Universal Authorship Representations},\n  booktitle = {EMNLP},\n  year      = {2021},\n}\n```\n\n## Contact\n\nFor questions about our paper or code, please contact [Rafael A. Rivera Soto](riverasoto1@llnl.gov).\n\n## Acknowledgements\n\nHere's a list of the people who have contributed to this work: \n- [Olivia Miano](https://github.com/omiano)\n- [Juanita Ordonez](https://github.com/hot-cheeto)\n- Barry Chen\n- [Aleem Khan](https://aleemkhan62.github.io/)\n- [Nicholas Andrews](https://www.cs.jhu.edu/~noa/)\n- Marcus Bishop\n\n## License\n\nLUAR is distributed under the terms of the Apache License (Version 2.0).\n\nAll new contributions must be made under the Apache-2.0 licenses.\n\nSee LICENSE and NOTICE for details.\n\nSPDX-License-Identifier: Apache-2.0\n\nLLNL-CODE-844702\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllnl%2Fluar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fllnl%2Fluar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fllnl%2Fluar/lists"}