An open API service indexing awesome lists of open source software.

https://github.com/jpwahle/emnlp22-transforming

The official implementation of the EMNLP 2022 paper "How Large Language Models are Transforming Machine-Paraphrased Plagiarism".
https://github.com/jpwahle/emnlp22-transforming

machine-learning natural-language-processing nlp paraphrase-generation plagiarism

Last synced: about 2 months ago
JSON representation

The official implementation of the EMNLP 2022 paper "How Large Language Models are Transforming Machine-Paraphrased Plagiarism".

Awesome Lists containing this project

README

          

# How Large Language Models Are Transforming Machine Paraphrase Generation

[![arXiv](https://img.shields.io/badge/arXiv-2210.03568-b31b1b.svg)](https://arxiv.org/abs/2210.03568)
[![HuggingFace Dataset](https://img.shields.io/badge/🤗-Datasets-ffce1c.svg)](https://huggingface.co/datasets/jpwahle/autoregressive-paraphrase-dataset)

## Quick Start

### Install

```bash
poetry install
```

### Run

To generate paraphrases using T5, run the following command:

> Note: T5 benefits from more few shot examples as it actually performs some gradient steps. However, to make it comparable to GPT-3, we don't recommend exceeding 50 examples.

```bash
poetry run python paraphrase.generate --model_name gpt3 --num_prompts 4 --num_examples 32
```

For generating paraphrases using GPT-3, run the following command:

> Warning: Using GPT-3 requires a paid account and can quickly run up a bill if you don't have credits.
> Reducing the number of prompts and/or the number of samples can help reduce costs.

```bash
OPENAI_API_KEY={YOUR_KEY} poetry run python paraphrase.generate --model_name gpt3 --num_prompts 4 --num_examples 32
```

For help, run the following command:

```bash
poetry run python -m paraphrase.generate --help
```

## Dataset

The dataset generated for our study is available on [🤗 Hugging Face Datasets](https://huggingface.co/datasets/jpwahle/autoregressive-paraphrase-dataset).

## Detection

For the detection code, please refer to this [repository](https://github.com/jpwahle/iconf22-paraphrase) and [paper](https://link.springer.com/chapter/10.1007/978-3-030-96957-8_34).

For all models except GPT-3 and T5, we used the trained versions on MPC. For PlagScan, we embedded the text in the same way as in the paper above.

## Citation
```bib
@inproceedings{wahle-etal-2022-large,
title = "How Large Language Models are Transforming Machine-Paraphrase Plagiarism",
author = "Wahle, Jan Philip and
Ruas, Terry and
Kirstein, Frederic and
Gipp, Bela",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.62",
pages = "952--963",
abstract = "The recent success of large language models for text generation poses a severe threat to academic integrity, as plagiarists can generate realistic paraphrases indistinguishable from original work.However, the role of large autoregressive models in generating machine-paraphrased plagiarism and their detection is still incipient in the literature.This work explores T5 and GPT3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia.We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software and perform a human study with 105 participants regarding their detection performance and the quality of generated examples.Our results suggest that large language models can rewrite text humans have difficulty identifying as machine-paraphrased (53{\%} mean acc.).Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5, fluency 4.2/5, coherence 3.8/5).The best-performing detection model (GPT-3) achieves 66{\%} F1-score in detecting paraphrases.We make our code, data, and findings publicly available to facilitate the development of detection solutions.",
}
```
## License
This repository is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
Use the code for any of your research projects, but be nice and give credit where credit is due.
Any illegal use for plagiarism or other purposes is prohibited.