{"id":22066373,"url":"https://github.com/jaketae/lm-identifier","last_synced_at":"2025-05-13T01:54:48.970Z","repository":{"id":100151039,"uuid":"585189162","full_name":"jaketae/lm-identifier","owner":"jaketae","description":"A toolkit for identifying pretrained language models from potentially AI-generated text","archived":false,"fork":false,"pushed_at":"2023-01-05T14:23:26.000Z","size":18,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-13T01:54:43.907Z","etag":null,"topics":["chatgpt","gpt","identification","natural-language-processing","pytorch","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaketae.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-04T14:44:27.000Z","updated_at":"2023-04-20T22:19:28.000Z","dependencies_parsed_at":"2023-04-07T20:46:48.427Z","dependency_job_id":null,"html_url":"https://github.com/jaketae/lm-identifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Flm-identifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Flm-identifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Flm-identifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Flm-identifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaketae","download_url":"https://codeload.github.com/jaketae/lm-identifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253856639,"owners_count":21974577,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","gpt","identification","natural-language-processing","pytorch","transformers"],"created_at":"2024-11-30T19:27:54.966Z","updated_at":"2025-05-13T01:54:48.961Z","avatar_url":"https://github.com/jaketae.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LM Identifier\n\n[![PyPI](https://img.shields.io/pypi/v/lm-identifier.svg)](https://pypi.org/project/lm-identifier/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nWith a surge of generative pretrained language models, it is becoming increasingly important to distinguish between human and AI-generated text. Inspired by [GPTZero](https://etedward-gptzero-main-zqgfwb.streamlit.app), an app that seeks to detect AI-generated text, LM Identifier pokes at this question even further by providing a growing suite of tools to help identify *which (publicly available) language model* might have been used to generate some given chunck of text.\n\n## Installation\n\nLM Identifier is available on PyPI.\n\n```\n$ pip install lm-identifier\n```\n\nTo develop locally, first install pre-commit:\n\n```\n$ pip install --upgrade pip wheel\n$ pip install pre-commit\n$ pre-commit install\n```\n\nInstall the package in editable mode.\n\n```\n$ pip install -e .\n```\n\n## Usages\n\n**Disclaimer:** This package is under heavy development. The API and associated functionalities may undergo substantial changes.\n\n### 1.Perplexity-based Ranking\n\nPerplexity is a common metric used in natural language generation to measure the performance of an LM. Roughly speaking, perplexity is the exponentiation of cross entropy. The bottom line is that lower perplexity indicates higher probability of the text being generated by that model.\n\nLM Identifier provides a perplexity-based ranking function as shown below.\n\n```python\nfrom lm_identifier.perplexity import rank_by_perplexity\n\ncandidate_models = [\n    \"gpt2\",\n    \"distilgpt2\",\n    \"facebook/opt-350m\",\n    \"lvwerra/gpt2-imdb\",\n]\n\ntext = (\n    \"My name is Thomas and my main character\"\n    \"is a young man who is a member of the military.\"\n)\n\nmodel2perplexity = rank_by_perplexity(text, candidate_models)\n```\n\n`model2perplexity` is a dictionary of perplexity scores for each language model in sorted order.\n\n```python\n{\n    'lvwerra/gpt2-imdb': 13.910672187805176,\n    'gpt2': 16.332365036010742,\n    'facebook/opt-350m': 18.126564025878906,\n    'distilgpt2': 28.430707931518555,\n}\n```\n\nThis toy example was indeed generated with `'lvwerra/gpt2-imdb'`, which is a standard `'gpt2'` model fine-tuned on the IMDB dataset. LM Identifier can thus be leveraged to distinguish between not only disparate models, but also an upstream model and its fine-tuned variant.\n\n### 2. Position-based Ranking\n\nWhile various autoregressive decoding and sampling methods exist, they typically involve applying a softmax over the logits to obtain the posterior distribution $p(x_t | x_1, x_2, \\dots, x_{t - 1})$. We can analyze this distribution with some given text to see how closely aligned the model's predictions are with the input.\n\nConcretely, if the token $x_t$ is ranked highly in the posterior sorted by probability mass, it is likely to have been produced by the model.\n\n```python\n\u003e\u003e\u003e from lm_identifier.position import rank_by_position\n\u003e\u003e\u003e model2position = rank_by_position(text, candidate_models)\n\u003e\u003e\u003e model2position\n{\n    'lvwerra/gpt2-imdb': 38.94736842105263,\n    'gpt2': 50.78947368421053,\n    'facebook/opt-350m': 100.6842105263158,\n    'distilgpt2': 318.89473684210526\n}\n```\n\nOn average, the tokens that appear in the input text ranked 40 in `'lvwerra/gpt2-imdb'`. While the ranking score may not seen low enough, recall that GPT-2 has an token vocabulary size of 50257. Given the cardinality of the PMF domain, this is a low score. Note also that the result is aligned with that obtained through ranking by perplexity.\n\n## Acknowledgement\n\nThis project heavily borrows code from Hugging Face's article on [perplexity measurement](https://huggingface.co/docs/transformers/perplexity), as well as the [GLTR code base](https://github.com/HendrikStrobelt/detecting-fake-text).\n\nThis project was heavily inspired by [GPTZero](https://etedward-gptzero-main-zqgfwb.streamlit.app), a project by [Edward Tian](https://twitter.com/edward_the6/status/1610067688449007618).\n\n## License\n\nReleased under the [MIT License](License).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaketae%2Flm-identifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaketae%2Flm-identifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaketae%2Flm-identifier/lists"}