https://github.com/hello-simpleai/chatgpt-comparison-detection

Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥
https://github.com/hello-simpleai/chatgpt-comparison-detection

ai chatbot chatgpt dataset deep-learning gpt-3 gpt2 gpt3 machine-learning ml nlp openai python text-classification

Last synced: 2 months ago
JSON representation

Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥

Host: GitHub
URL: https://github.com/hello-simpleai/chatgpt-comparison-detection
Owner: Hello-SimpleAI
Created: 2023-01-07T07:21:24.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-12-01T16:03:51.000Z (over 1 year ago)
Last Synced: 2024-10-27T21:53:04.019Z (8 months ago)
Topics: ai, chatbot, chatgpt, dataset, deep-learning, gpt-3, gpt2, gpt3, machine-learning, ml, nlp, openai, python, text-classification
Language: Python
Homepage: https://arxiv.org/abs/2301.07597
Size: 53.7 KB
Stars: 1,254
Watchers: 26
Forks: 120
Open Issues: 27
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-ChatGPT-repositories - chatgpt-comparison-detection - Human ChatGPT Comparison Corpus (HC3), Detectors, and more! 🔥 (Chatbots)

README

        # ChatGPT-Comparison-Detection Project 🔬

![](https://img.shields.io/badge/Languages-%20English%2C%20Chinese-brightgreen) 

![](https://img.shields.io/badge/ChatGPT-Corpus%2C%20Detector-blue)

Official repository of paper ["How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection"](https://arxiv.org/abs/2301.07597). Please star, watch, and fork our repo for the active updates!

See also→([📢 Feedback Space for Detectors](https://github.com/Hello-SimpleAI/chatgpt-comparison-detection/discussions/2) please feel free to leave your feedback here! 请留下您宝贵的意见！)



---

### Human ChatGPT Comparison Corpus (HC3) / 人类-ChatGPT 问答对比语料集

Yes, we propose the first **Human vs. ChatGPT** comparison corpus, named **HC3**.

我们提出了第一个 **Human vs. ChatGPT** 对比语料, 叫做 **HC3**.



The first version of the HC3 datasets are now available on 🤗 Huggingface Datasets:

- [HC3-English](https://huggingface.co/datasets/Hello-SimpleAI/HC3)

- [HC3-Chinese](https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese)

在中文社区，HC3 数据集也已在 ModelScope 上可用:

- [HC3-English](https://www.modelscope.cn/datasets/simpleai/HC3)

- [HC3-Chinese](https://www.modelscope.cn/datasets/simpleai/HC3-Chinese)

> Train/Test splits & filtered versions of the paper, ref to Google Drive links in [HC3/README.md](HC3/README.md).

### Dataset Copyright

If the source datasets used in this corpus has a specific license which is stricter than CC-BY-SA, our products follow the same.

If not, they follow CC-BY-SA license.

| English Split       | Source | Source License | Note |

|----------|-------------|--------|-------------|

| reddit_eli5 | [ELI5](https://github.com/facebookresearch/ELI5)   | BSD License    |     |

| open_qa  | [WikiQA](https://www.microsoft.com/en-us/download/details.aspx?id=52419)  | [PWC Custom](https://paperswithcode.com/datasets/license)   |      |

| wiki_csai   | Wikipedia | CC-BY-SA |   | [Wiki FAQ](https://en.wikipedia.org/wiki/Wikipedia:FAQ/Copyright) |

| medicine    | [Medical Dialog](https://github.com/UCSD-AI4H/Medical-Dialogue-System) | Unknown|  [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10)|

| finance     | [FiQA](https://paperswithcode.com/dataset/fiqa-1) | Unknown |  Asking by 📧  |

| Chinese Split       | Source | Source License  | Note |

|----------|-------------|-----------|-------------|

| open_qa  | [WebTextQA & BaikeQA](https://github.com/brightmart/nlp_chinese_corpus) | MIT license |  |  |

| baike     | Baidu Baike  | None   |    |   |

| nlpcc_dbqa  | [NLPCC-DBQA](https://github.com/msra-nlc/ChineseDBQA) | Unknown |   [Asking](https://github.com/UCSD-AI4H/Medical-Dialogue-System/issues/10) |

| medicine    | [Chinese Medical Dialogue](https://tianchi.aliyun.com/dataset/90163) |  CC-BY-NC 4.0 | 

| finance     | [FinanceZhidao](https://www.heywhale.com/mw/dataset/5e9588f8e7ec38002d0331b1/content) | CC-BY 4.0 |  |

| psychology  | [On Baidu AI Studio](https://aistudio.baidu.com/aistudio/datasetdetail/38489) | CC0  | |

|law          | [LegalQA](https://github.com/siatnlp/LegalQA) | Unknown | [Asking](https://github.com/siatnlp/LegalQA/issues/2) |

---

### ChatGPT detectors / 内容检测器

![image](https://user-images.githubusercontent.com/37113676/211677236-d7c028f5-b9a5-4d88-baee-8b86dc942ff7.png)

(Hosted on 🤗 Hugging Face Spaces)

We provide three kinds of detectors, all in Bilingual / 我们提供了三个版本的检测器，且都支持中英文:

- [QA version / 问答版](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-qa): detect whether an **answer** is generated by ChatGPT for certain **question**, using PLM-based classifiers / 判断某个**问题的回答**是否由ChatGPT生成，使用基于PTM的分类器来开发;

- [Sinlge-text version / 独立文本版](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-single): detect whether a piece of text is ChatGPT generated, using PLM-based classifiers / 判断**单条文本**是否由ChatGPT生成，使用基于PTM的分类器来开发;

- [Linguistic version / 语言学版](https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-ling): detect whether a piece of text is ChatGPT generated, using linguistic features / 判断**单条文本**是否由ChatGPT生成，使用基于语言学特征的模型来开发;

在 modelscope 中文社区平台，三个版本的检测器也都可用:

- [QA version / 问答版](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-qa)

- [Sinlge-text version / 独立文本版](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-single)

- [Linguistic version / 语言学版](https://www.modelscope.cn/studios/simpleai/chatgpt-detector-ling)

The model weights are all available at 🤗 Hugging Face Models:

| Model Checkpoints              | Comment      |

|-----------------------|------------|

|[chatgpt-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta)|To detect a single piece of text|

|[chatgpt-qa-detector-roberta](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta)|To detect a question-answer pair|

|[chatgpt-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-detector-roberta-chinese)|检测单条文本，中文版|

|[chatgpt-qa-detector-roberta-chinese](https://huggingface.co/Hello-SimpleAI/chatgpt-qa-detector-roberta-chinese)|检测一对QA文本，中文版|

The English models are based on [roberta-base](https://huggingface.co/roberta-base).

The Chinese models are based on [hfl/chinese-roberta-wwm-ext](https://huggingface.co/hfl/chinese-roberta-wwm-ext).

---

### Important Dates / 重要节点:

| Events                | Dates      |

|-----------------------|------------|

| Project Launch / 项目启动        | 2022-12-09 ✅ |

| Comparison Data Collection / 对比数据收集        | 2022-12-11 to Now 🏎️|

| Release ChatGPT Detector (Demo) / 检测器 Demo 发布 | 2023-01-11 ✅|

| Models Release / 模型开源 | 2023-01-18 ✅|

| Comparison Corpus Release / 语料集开源 | 2023-01-18 ✅|

| Research Paper / 研究论文发布 | 2023-01-19 ✅|

|...|...|

---

### Citation

Checkout this paper [arxiv: 2301.07597](https://arxiv.org/abs/2301.07597)

```

@article{guo-etal-2023-hc3,

    title = "How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection",

    author = "Guo, Biyang  and

      Zhang, Xin  and

      Wang, Ziyuan  and

      Jiang, Minqi  and

      Nie, Jinran  and

      Ding, Yuxuan  and

      Yue, Jianwei  and

      Wu, Yupeng",

    journal={arXiv preprint arxiv:2301.07597}

    year = "2023",

}

```

---

### Our Story... / 背景故事

On December 9, 2022, which is 10 days after the launch of [ChatGPT](https://openai.com/blog/chatgpt/), we started this project, for two purposes: 

1. To create some **open-source models** for efficiently detecting ChatGPT-generated content; 

2. To collect a valuable **human-ChatGPT comparison Q&A corpus**, to facilitate releated research.

2022 年 12 月 9 日，也就是 [ChatGPT](https://openai.com/blog/chatgpt/) 推出的第 10 天，我们开始了这个项目，为了两个目的：

1. 做出一些**开源**模型工具来高效检测 ChatGPT 生成的内容；

2. 收集一批有价值的**人类和 ChatGPT 对比**的中英双语问答语料，来助力相关学术研究。

Welcome to follow our project! We have released a preview of our ChatGPT detectors, and the **models, dataset will be open-sourced** in about a week. We look forward to receiving feedback from the community to help improve the models and make contributions to **open** academic research together:)


欢迎关注我们项目，我们目前已经发布ChatGPT检测器预览版，并将于约**一周内发布开源模型、数据集**。期待得到广大群众的反馈，来帮助我们改进模型，为**开放**的学术研究一起做贡献！

### About Us / 关于我们

We are a group of insignificant researchers (in the shadow of ChatGPT) hoping to do some significant work for the community. The team for this projects consists of PhD students and engineers from 6 universities/companies.


我们是一群（在 ChatGPT 的阴影下）渺小的研究人员，但希望为社区做一些有意义的事。这个项目的团队由来自6所大学/公司的博士生和工程师组成。

|   |   |   |   |

|:-:|:-:|:-:|:-:|

| [Biyang Guo](https://github.com/beyondguo) | [Minqi Jiang](https://github.com/Minqi824) | [Ziyuan Wang](https://github.com/SUFEHeisenberg) | [Xin Zhang](https://github.com/izhx) |

|||||

| [Jinran Nie](https://github.com/NJRBarry) | [Yuxuan Ding](https://github.com/yxding95) | [Jianwei Yue](https://github.com/TurquoiseA) | [Yupeng Wu](https://github.com/realRoc) |

|||   |   |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hello-simpleai/chatgpt-comparison-detection

Awesome Lists containing this project

README