{"id":26697512,"url":"https://github.com/stefanheng/proggen","last_synced_at":"2025-04-13T04:26:24.964Z","repository":{"id":228682861,"uuid":"771860098","full_name":"StefanHeng/ProgGen","owner":"StefanHeng","description":"Code for paper \"ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models\"","archived":false,"fork":false,"pushed_at":"2024-03-29T22:23:31.000Z","size":65314,"stargazers_count":17,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-26T21:19:30.986Z","etag":null,"topics":["data-generation","efficient-nlp","few-shot-learning","large-language-models","low-resource-nlp","named-entity-recognition","natural-language-processing","training-data-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StefanHeng.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-03-14T04:44:45.000Z","updated_at":"2025-02-25T13:59:04.000Z","dependencies_parsed_at":"2024-03-27T23:46:10.322Z","dependency_job_id":null,"html_url":"https://github.com/StefanHeng/ProgGen","commit_stats":null,"previous_names":["stefanheng/proggen"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanHeng%2FProgGen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanHeng%2FProgGen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanHeng%2FProgGen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StefanHeng%2FProgGen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StefanHeng","download_url":"https://codeload.github.com/StefanHeng/ProgGen/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248662675,"owners_count":21141611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-generation","efficient-nlp","few-shot-learning","large-language-models","low-resource-nlp","named-entity-recognition","natural-language-processing","training-data-generation"],"created_at":"2025-03-26T21:19:52.548Z","updated_at":"2025-04-13T04:26:24.944Z","avatar_url":"https://github.com/StefanHeng.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ProgGen\nThis repo contains the code and datasets for paper \"[ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models](https://arxiv.org/abs/2403.11103)\". \n\n\n\n![ProgGen workflow](figures/24-02-24_flowchart-proggen_cropped.png)\n\n\n\nWe study 4 datasets: CoNLL-2003, WikiGold, MIT-Movie and MIT-Restaurant. \n\nSee sections (1) [`data`](https://github.com/StefanHeng/ProgGen/tree/master#data) (`reproduce` folder) for LLM prompts and responses and processed datasets and (2) [`commands`](https://github.com/StefanHeng/ProgGen/tree/master#commands) (`scripts` folder) for reproducing the results in the main experiments. \n\n\n\n\n\n## Data\n\nThe `reproduce` folder contains prompts, LLM responses and processed datasets as reported in our main experiments. It is organized as follows: \n\n-   `diversify-x` (Diversify X) \n    -   `gen-attr-dim` and `gen-attr-val` contain prompts and responses for attribute dimensions and attribute values generation, respectively. \n    -   `config` contains processed attribute dimensions and values. \n-   `diversify-y` (Diversify X)\n    -   `gen-entity-vanilla` and `gen-entity-latent`contain prompts and responses for named entity pool generation, for the vanilla and latent variant, respectively. \n    -   `config` contains processed named entities. \n-   `sample` for NER sample generation \n    -   `gen-sample` contains prompts and responses. \n    -   `dataset` contains processed NER datasets. \n-   `correction` for LLM self-correction \n    -   `gen-correction` contains prompts and responses. \n    -   `config` contains entity-class-specific annotation instructions and demos for each dataset and each diversity approach. \n    -   `instruction\u0026demo-pool` contains annotation instruction pool and demo pool for each entity class, shared for all diversity approaches, for illustration purposes. \n    -   `annotation-error` contains representative entity annotation errors from NER sample generation for each dataset. \n    -   `dataset` contains processed datasets with entity annotations overridden by processed corrections. \n\n\n\nNote \n\n-   LLM prompts and responses are available in 2 formats: \n    1.   A readable format, via `prompts.log` and `completion-*.txt` files, and\n    2.   OpenAI API format, via `requests.jsonl` and `requests_results.jsonl` files. \n-   All folders are have date prefixes indicating date of experiments. \n-   In each processed dataset (`sample/dataset`) folder, each entity annotation triple (sentence, span, entity type) is  available in `logprobs-triple.json` files. \n-   Top-uncertain triples selected for LLM Self-Correction are available from correction generation log files (`correction/gen-correction/**/completion.log`)\n\n\n\n\n\n## Commands\n\nWe detail scripts for running experiments and reproducing our results with example commands. \n\nNote \n\n1.   Each script contains all relevant arguments (see `help` in each script and `utils.py`). \n2.   It’s expected to run each script/command at the directory root level. \n3.   Terminal logging messages (and log file writes) w.r.t each script will show where the relevant (dataset) files are saved. \n4.   All OpenAI API responses and processed datasets will be written to the `generated_data` folder. \n\n\n\nBefore you run a script, make sure python sees the `src` package folder: \n\n```bash\nexport PYTHONPATH=$PYTHONPATH:$(pwd) \n```\n\nFor all LLM generation steps, set your OpenAI API via \n\n```bash\nexport OPENAI_API_KEY='\u003cyour-api-key\u003e'\n```\n\n\n\n\n\n### Environment Setup\n\nPython version `3.8`\n\n1\u003e Install conda environment \n\n```bash\nconda create -n prog-gen python=3.8 pip\n```\n\n2\u003e Activate environment and install packages\n\n```bash\nconda activate prog-gen\npip install -r requirements.txt\n```\n\n\n\n### Steps\n\n#### Step 1: Write Original Dataset Samples\n\nIncludes writing (1) few-shot demo samples and (2) entire test set for each of the datasets studied. Intended for downstream model training. \n\nSee `write_original_dataset.py` for details. \n\n\n\nExample 1: Write few-shot demo samples for CoNLL-2003:\n\n```bash\npython scripts/write_original_dataset.py demo \\\n\t--dataset_name 'conll2003-no-misc' \\\n\t--n_demo 1 \\\n\t--include_negative_sample 1\n```\n\nExample 2: Write entire test set for MIT-Movie:\n\n```bash\npython scripts/write_original_dataset.py test --dataset_name 'mit-movie'\n```\n\n\n\nNote this step is not necessary as each subsequent step will automatically write the respective files if not found. \n\n\n\n\n\n#### Step 2: Generate Diversify Requirement Configurations\n\nNote that additional manual inspection and filtering for low-quality values may be needed. \n\n\n\n**1: Diversify X** \n\nNote we omit the step for attribute dimension generation as we queried the GPT-4 web App. See the paper for the prompt templates and `reproduce` for the actual prompts used. \n\n\n\nSee `generate_diversify_x_config.py` for details on generating attribute values. \n\nExample on WikiGold: \n\n```bash\npython scripts/generate_diversity_config.py \\\n\t--dataset_name 'wiki-gold-no-misc' \\\n\t--diversity_variant 'diversify-x' \\\n\t--prompt_seed 42 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_timeout 30 \\\n\t--n_call 3\n```\n\n\n\n**2: Diversify Y**\n\nIncludes the *vanilla* and *latent* variants. See `generate_diversify_y_config.py`\n\n\n\nExample 1: The *vanilla* variant on MIT-Restaurant: \n\n```bash\npython scripts/generate_diversity_config.py \\\n\t--dataset_name 'mit-restaurant' \\\n\t--diversity_variant 'diversify-y-vanilla' \\\n\t--prompt_seed 42 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_timeout 30 \\\n\t--n_call 10\n```\n\n\n\nExample 2: The *latent* variant on  CoNLL-2003: \n\n```bash\npython scripts/generate_diversity_config.py \\\n\t--dataset_name 'conll2003-no-misc' \\\n\t--diversity_variant 'diversify-y-latent' \\\n\t--diversify_y_latent_attribute 'reproduce/diversify-x/config/conll2003_no_misc.json' \\\n\t--prompt_seed 42 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 256 \\\n\t--chat_timeout 30 \\\n\t--n_call 5\n```\n\n\n\nNote the internal name of the dataset-independent attribute dimension for each dataset is given by \n\n```python\nDATASET_NAME2TOPIC_DIM = {\n    'conll2003-no-misc': 'news-category',\n    'wiki-gold-no-misc': 'topic',\n    'mit-movie': 'query-category',\n    'mit-restaurant': 'meal-category'\n}\n```\n\n\n\n\n\n#### Step 3: Generate NER Samples\n\nIncludes Simple Prompt and all 4 diversity variants studied. \n\nSee `generate_ner_sample.py` for details. \n\n\n\nExample 1: Simple Prompt on MIT-Movie: \n\n```bash\npython scripts/generate_ner_sample.py \\\n\t--dataset_name 'mit-movie' \\\n\t--diversity_variant 'simple-prompt' \\\n\t--prompt_seed 42 \\\n\t--n_list 50 \\\n\t--n_call 36 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 2560 \\\n\t--chat_logprobs 'True' \\\n\t--chat_timeout 60\n```\n\nNote (1) a large `n_list` (e.g. 50) may not yield 50 generated samples sometimes, as discussed in the paper, and (2) WikiGold generated samples are much longer so a relatively higher `chat_max_tokens` is advised. \n\n\n\nExample 2: Diversify X on WikiGold: \n\n```bash\npython scripts/generate_ner_sample.py \\\n\t--dataset_name 'wiki-gold-no-misc' \\\n\t--diversity_variant 'diversify-x' \\\n\t--diversify_x_config 'reproduce/diversify-x/config/wiki_gold_no_misc.json' \\\n\t--prompt_seed 42 \\\n\t--n_list 3 \\\n\t--n_call 600 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 256 \\\n\t--chat_logprobs 'True' \\\n\t--chat_timeout 20\n```\n\n\n\nExample 3: Diversify Y (vanilla) on MIT-Restaurant: \n\n```bash\npython scripts/generate_ner_sample.py \\\n\t--dataset_name 'mit-restaurant' \\\n\t--diversity_variant 'diversify-y-vanilla' \\\n\t--diversify_y_config 'reproduce/diversify-y/config/vanilla/mit_restaurant.json' \\\n\t--prompt_seed 42 \\\n\t--n_list 3 \\\n\t--n_call 600 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 256 \\\n\t--chat_logprobs 'True' \\\n\t--chat_timeout 20\n```\n\n\n\nExample 4: Diversify Y (latent) on MIT-Movie: \n\n```bash\npython scripts/generate_ner_sample.py \\\n\t--dataset_name 'mit-movie' \\\n\t--diversity_variant 'diversify-y-latent' \\\n\t--diversify_y_config 'reproduce/diversify-y/config/latent/mit_movie.json' \\\n\t--diversify_y_n_exp_entity 4.5 \\\n\t--prompt_seed 42 \\\n\t--n_list 3 \\\n\t--n_call 600 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 256 \\\n\t--chat_logprobs 'True' \\\n\t--chat_timeout 20\n```\n\n\n\nExample 5: Diversify X+Y on CoNLL-2003: \n\n```bash\npython scripts/generate_ner_sample.py \\\n\t--dataset_name 'conll2003-no-misc' \\\n\t--diversity_variant 'diversify-x+y' \\\n\t--diversify_x_config 'reproduce/diversify-x/config/conll2003_no_misc.json' \\\n\t--diversify_y_config 'reproduce/diversify-y/config/latent/conll2003_no_misc.json' \\\n\t--prompt_seed 42 \\\n\t--n_list 3 \\\n\t--n_call 600 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 256 \\\n\t--chat_logprobs 'True' \\\n\t--chat_timeout 20\n```\n\nDiversity arguments including `diversify_x_config`, `diversify_x_sample_prob`  `diversify_y_config` and `diversify_y_n_exp_entity` are optional and will default to setups as reported in the paper (via loading from processed datasets in the `generated_data` folder). \n\n\n\n\n\n#### Step 4: Generate LLM Self-Corrections\n\nFor generating LLM Self-Corrections for entity annotations given a generated (and processed) NER dataset. \n\nSee `generate_correction.py`\n\n\n\nExample: Self-Correction for a processed dataset (`diversify-y-vanilla`) on MIT-Movie \n\n```bash\npython scripts/generate_correction.py \\\n\t--dataset_name 'mit-movie' \\\n\t--generated_dataset_dir_name 'reproduce/sample/dataset/mit_movie/24-02-06_Diversify-Y-vanilla' \\\n\t--correction_config 'reproduce/correction/config/mit_movie/Diverse-Y-vanilla.json' \\\n\t--output_postfix 'diversify-y-vanilla' \\\n\t--prompt_seed 42 \\\n\t--n_correct 3 \\\n\t--logprob_thresh=-2e-2 \\\n\t--top_n 0.2 \\\n\t--chat_model_name 'gpt-3.5-turbo-1106' \\\n\t--chat_max_tokens 256 \\\n\t--chat_temperature 0 \\\n\t--chat_timeout 30\n```\n\n\n\n#### Step 5: Downstream BERT Training\n\nIncludes training a BERT-class model with epoch-wise evaluation. See `train.py`. \n\n\n\nExample: Train a generated dataset (Diversify X) with self-correction for WikiGold: \n\n```bash\npython scripts/train.py \\\n\t--dataset_name 'wiki-gold-no-misc' \\\n\t--generated_dataset_dir_name 'reproduce/correction/dataset/wiki_gold_no_misc/24-02-11_Diversify-X' \\\n\t--few_shot_demo_file 'bio-train-1-shot-shuffled+neg.jsonl' \\\n\t--test_file 'bio-test-all.jsonl' \\\n\t--hf_model_name 'microsoft/deberta-v3-base' \\\n\t--learning_rate 4e-5 \\\n\t--n_epochs 16.0 \\\n\t--train_batch_size 24 \\\n\t--seed 42\n```\n\nTo train with GPU, use: \n\n```bash\nCUDA_VISIBLE_DEVICES=\u003cyour-gpu-id\u003e python scripts/train.py ...\n```\n\n\n\n### Potential Support\n\nPotential functionalities to support include \n\n1.   Custom shuffle seed for different set of n-shot demo \n2.   Customizable templates, including\n     -   (1) diversity config generation, (2) data generation instruction and (3) diversity requirement\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefanheng%2Fproggen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstefanheng%2Fproggen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefanheng%2Fproggen/lists"}