Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/YuchuanTian/AIGC_text_detector
The official codes of our work on AIGC detection: "Multiscale Positive-Unlabeled Detection of AI-Generated Texts" (ICLR'24 Spotlight)
https://github.com/YuchuanTian/AIGC_text_detector
Last synced: about 1 month ago
JSON representation
The official codes of our work on AIGC detection: "Multiscale Positive-Unlabeled Detection of AI-Generated Texts" (ICLR'24 Spotlight)
- Host: GitHub
- URL: https://github.com/YuchuanTian/AIGC_text_detector
- Owner: YuchuanTian
- License: apache-2.0
- Created: 2023-06-02T03:49:26.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-28T16:32:08.000Z (12 months ago)
- Last Synced: 2024-08-03T09:06:46.408Z (5 months ago)
- Language: Python
- Homepage: https://arxiv.org/abs/2305.18149
- Size: 247 KB
- Stars: 88
- Watchers: 1
- Forks: 13
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - YuchuanTian/AIGC_text_detector - 未标记检测”(ICLR‘24 Spotlight) (A01_文本生成_文本对话 / 大语言对话模型及数据)
README
# Multiscale Positive-Unlabeled Detection of AI-Generated Texts
*Yuchuan Tian, Hanting Chen, Xutao Wang, Zheyuan Bai, Qinghua Zhang, Ruifeng Li, Chao Xu, Yunhe Wang*
The official codes of our paper "Multiscale Positive-Unlabeled Detection of AI-Generated Texts".
[[arXiv]](https://arxiv.org/abs/2305.18149)
Paper Link: https://arxiv.org/pdf/2305.18149.pdf
*BibTex* formatted citation:
```
@misc{tian2023multiscale,
title={Multiscale Positive-Unlabeled Detection of AI-Generated Texts},
author={Yuchuan Tian and Hanting Chen and Xutao Wang and Zheyuan Bai and Qinghua Zhang and Ruifeng Li and Chao Xu and Yunhe Wang},
year={2023},
eprint={2305.18149},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```## Detector Models
We have open-sourced detector models in the paper as follows.
Links for Detectors: [Google Drive](https://drive.google.com/drive/folders/1Q_78qoaAuO8HOtt-SawimiXhli6y0Uii?usp=drive_link) [Baidu Disk (PIN:1234)](https://pan.baidu.com/s/11hOpOxImAh1ZfDy9F5jC1Q)
We have also uploaded detector models to HuggingFace, where easy-to-use **DEMOs** and online **APIs** are provided.
| Variants | HC3-Full-En | HC3-Sent-En |
| ------------------------------------------------------------ | ------------------ | ------------------ |
| seed0 | 98.68 | 82.84 |
| seed1 [HuggingFace: en v1](https://huggingface.co/yuchuantian/AIGC_detector_env1) | 98.56 | 87.06 |
| seed2 | 97.97 | 86.02 |
| **Avg.** | **98.40$\pm$0.31** | **85.31$\pm$1.80** |## Stronger Detectors
We have also open-sourced detector models with strengthened training strategies. Specifically, we develop a strong Chinese detector ```AIGC_detector_zhv2```, which demonstrates similar performance to SOTA closed-source Chinese detectors on various texts, including news articles, poetry, essays, etc. The **DEMOs** and **APIs** are available on HuggingFace.
| Detector | Google Drive | Baidu Disk | HuggingFace Link |
| ------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| English, version 2 (env2) | [Google Drive](https://drive.google.com/drive/folders/11ROLjxopgx44YT9RS8RmchdxR6Yi-CJk?usp=drive_link) | [Baidu Disk (PIN:1234)](https://pan.baidu.com/s/11CQaWzsT7a-IgceOBpmb7g) | [en v2](https://huggingface.co/yuchuantian/AIGC_detector_env2) |
| Chinese, version 2 (zhv2) | [Google Drive](https://drive.google.com/drive/folders/1-a7n-T9Z1_EIWbvip2eC0ssx5rih8pQI?usp=drive_link) | [Baidu Disk (PIN:1234)](https://pan.baidu.com/s/1VPGYtswC1GJXESWzne4RPA) | [zh v2](https://huggingface.co/yuchuantian/AIGC_detector_zhv2) |## About the Dataset
Here we provide the official link for the HC3 dataset: [Dataset Link](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection/blob/main/HC3/README.md). We also provide identical dataset copies on [Google Drive](https://drive.google.com/drive/folders/10GsKfCWe_BHLdGnfhSV-_k0-PH31_oWn?usp=drive_link) and [Baidu Disk (PIN:1234)](https://pan.baidu.com/s/1OUJbPYbC2ZUAt50MFDdHrQ) for your ease of use. We acknowledge the marvelous work by HC3 authors.
#### Data Preprocessing
In Appendix B of [our paper](https://arxiv.org/pdf/2305.18149.pdf), we proposed the removal of redundant spaces in human texts of the HC3-English dataset. We have provided a helper function ```en_cleaning``` in ```corpus_cleaning_kit.py``` that takes a sentence string as input and returns a preprocessed sentence without redundant spaces.
Here we provide a cleaned version of HC3-English. In this version, all answers are cleaned (*i. e.* redundant spaces are removed). However, please use the original version of HC3 for all experiments in our paper, as we have embedded the cleaning procedures in the training & validation scripts.
**CLEANED** HC3-English Link: [Google Drive](https://drive.google.com/drive/folders/11m9w7blNjUR2VE5N5AU7aOmj9YZhOyLy?usp=drive_link) [Baidu Disk (PIN:1234)](https://pan.baidu.com/s/1kKSiyj1Nv2me6mODZd0Y4A)
## Preparation
- Install requirement packages:
```shell
pip install -r requirements.txt
```- Download datasets to directory: ```./data```
- Download nltk package punct (This step could be done by ```nltk``` api: ```nltk.download('punkt')```)
- Download pretrained models (This step could be automatically done by ```transformers```)
Before running, the directory should contain the following files:
```
├── data
│ ├── unfilter_full
│ │ ├── en_test.csv
│ │ └── en_train.csv
│ └── unfilter_sent
│ ├── en_test.csv
│ └── en_train.csv
├── README.md
├── corpus_cleaning_kit.py
├── dataset.py
├── multiscale_kit.py
├── option.py
├── pu_loss_mod.py
├── prior_kit.py
├── requirements.txt
├── train.py
└── utils.py
```## Training
The script for training is ```train.py```.
#### RoBERTa on HC3-English
Commands for seed=0,1,2:
```shell
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2
```
#### BERT on HC3-English
Commands for seed=0,1,2:
```shell
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2
```
## Acknowledgement
Our code refers to the following GitHub repo:
https://github.com/openai/gpt-2-output-dataset
We sincerely thank their authors for open-sourcing.