{"id":13754111,"url":"https://github.com/chtmp223/topicGPT","last_synced_at":"2025-05-09T22:30:55.958Z","repository":{"id":205200788,"uuid":"713590337","full_name":"chtmp223/topicGPT","owner":"chtmp223","description":"TopicGPT: A Prompt-Based Framework for Topic Modeling (NAACL'24)","archived":false,"fork":false,"pushed_at":"2025-03-15T13:55:46.000Z","size":848,"stargazers_count":282,"open_issues_count":6,"forks_count":46,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-04-17T16:16:00.481Z","etag":null,"topics":["llm","nlp","openai","python","topic-modeling","vllm"],"latest_commit_sha":null,"homepage":"https://chtmp223.github.io/topicGPT","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chtmp223.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-02T20:43:16.000Z","updated_at":"2025-04-17T15:52:16.000Z","dependencies_parsed_at":"2025-03-15T14:28:19.531Z","dependency_job_id":"2c33efca-e4bf-4ca5-858e-bb77f10d1819","html_url":"https://github.com/chtmp223/topicGPT","commit_stats":null,"previous_names":["chtmp223/topicgpt"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chtmp223%2FtopicGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chtmp223%2FtopicGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chtmp223%2FtopicGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chtmp223%2FtopicGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chtmp223","download_url":"https://codeload.github.com/chtmp223/topicGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335367,"owners_count":21892659,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","nlp","openai","python","topic-modeling","vllm"],"created_at":"2024-08-03T09:01:40.652Z","updated_at":"2025-05-09T22:30:55.944Z","avatar_url":"https://github.com/chtmp223.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# TopicGPT\n[![arXiV](https://img.shields.io/badge/arxiv-link-red)](https://arxiv.org/abs/2311.01449) [![Website](https://img.shields.io/badge/website-link-purple)](https://chtmp223.github.io/topicGPT) \n\nThis repository contains scripts and prompts for our paper [\"TopicGPT: Topic Modeling by Prompting Large Language Models\"](https://arxiv.org/abs/2311.01449) (NAACL'24). Our `topicgpt_python` package consists of five main functions: \n- `generate_topic_lvl1` generates high-level and generalizable topics. \n- `generate_topic_lvl2` generates low-level and specific topics to each high-level topic.\n- `refine_topics` refines the generated topics by merging similar topics and removing irrelevant topics.\n- `assign_topics` assigns the generated topics to the input text, along with a quote that supports the assignment.\n- `correct_topics` corrects the generated topics by reprompting the model so that the final topic assignment is grounded in the topic list. \n\n![TopicGPT Pipeline Overview](assets/img/pipeline.png)\n\n## 📣 Updates\n- [11/09/24] Python package `topicgpt_python` is released! You can install it via `pip install topicgpt_python`. We support OpenAI API, VertexAI, Azure API, Gemini API, and vLLM (requires GPUs for inference). See [PyPI](https://pypi.org/project/topicgpt-python/).\n- [11/18/23] Second-level topic generation code and refinement code are uploaded.\n- [11/11/23] Basic pipeline is uploaded. Refinement and second-level topic generation code are coming soon.\n\n## 📦 Using TopicGPT\n### Getting Started\n1. Make a new Python 3.9+ environment using virtualenv or conda. \n2. Install the required packages:\n    ```\n    pip install topicgpt_python\n    ```\n- Set your API key:\n    ```\n    # Run in shell\n    # Needed only for the OpenAI API deployment\n    export OPENAI_API_KEY={your_openai_api_key}\n\n    # Needed only for the Vertex AI deployment\n    export VERTEX_PROJECT={your_vertex_project}   # e.g. my-project\n    export VERTEX_LOCATION={your_vertex_location} # e.g. us-central1\n\n    # Needed only for Gemini deployment\n    export GEMINI_API_KEY={your_gemini_api_key}\n\n    # Needed only for the Azure API deployment\n    export AZURE_OPENAI_API_KEY={your_azure_api_key}\n    export AZURE_OPENAI_ENDPOINT={your_azure_endpoint}\n    ```\n- Refer to https://openai.com/pricing/ for OpenAI API pricing or to https://cloud.google.com/vertex-ai/pricing for Vertex API pricing. \n\n### Data\n- Prepare your `.jsonl` data file in the following format:\n    ```shell\n    {\n        \"id\": \"IDs (optional)\",\n        \"text\": \"Documents\",\n        \"label\": \"Ground-truth labels (optional)\"\n    }\n    ```\n- Put your data file in `data/input`. There is also a sample data file `data/input/sample.jsonl` to debug the code.\n- Raw dataset used in the paper (Bills and Wiki): [[link]](https://drive.google.com/drive/folders/1rCTR5ZQQ7bZQoewFA8eqV6glP6zhY31e?usp=sharing). \n\n### Pipeline\nCheck out `demo.ipynb` for a complete pipeline and more detailed instructions. We advise you to try running on a subset with cheaper (or open-source) models first before scaling up to the entire dataset. \n\n0. (Optional) Define I/O paths in `config.yml` and load using: \n    ```python\n    import yaml\n\n    with open(\"config.yml\", \"r\") as f:\n        config = yaml.safe_load(f)\n    ```\n1. Load the package:\n    ```python\n    from topicgpt_python import *\n    ```\n2. Generate high-level topics:\n    ```python\n    generate_topic_lvl1(api, model, data, prompt_file, seed_file, out_file, topic_file, verbose)\n    ```\n3. Generate low-level topics (optional)\n    ```python\n    generate_topic_lvl2(api, model, seed_file, data, prompt_file, out_file, topic_file, verbose)\n    ```  \n\n4. Refine the generated topics by merging near duplicates and removing topics with low frequency (optional):\n    ```python\n    refine_topics(api, model, prompt_file, generation_file, topic_file, out_file, updated_file, verbose, remove, mapping_file)\n    ```\n5. Assign and correct the topics, usually with a weaker model if using paid APIs to save cost:\n    \n    ```python\n    assign_topics(\n    api, model, data, prompt_file, out_file, topic_file, verbose\n    )\n    ```\n\n    ```\n    correct_topics(\n        api, model, data_path, prompt_path, topic_path, output_path, verbose\n    ) \n    ```\n\n6. Check out the `data/output` folder for sample outputs.\n7. We also offer metric calculation functions in `topicgpt_python.metrics` to evaluate the alignment between the generated topics and the ground-truth labels (Adjusted Rand Index, Harmonic Purity, and Normalized Mutual Information).\n\n\n## 📜 Citation\n```\n@misc{pham2023topicgpt,\n      title={TopicGPT: A Prompt-based Topic Modeling Framework}, \n      author={Chau Minh Pham and Alexander Hoyle and Simeng Sun and Mohit Iyyer},\n      year={2023},\n      eprint={2311.01449},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchtmp223%2FtopicGPT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchtmp223%2FtopicGPT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchtmp223%2FtopicGPT/lists"}