{"id":19168389,"url":"https://github.com/bloomberg/mixce-acl2023","last_synced_at":"2025-05-07T14:41:50.518Z","repository":{"id":169481281,"uuid":"642073995","full_name":"bloomberg/MixCE-acl2023","owner":"bloomberg","description":"Implementation of MixCE method described in ACL 2023 paper by Zhang et al.","archived":false,"fork":false,"pushed_at":"2023-05-29T20:21:39.000Z","size":1020,"stargazers_count":19,"open_issues_count":1,"forks_count":3,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-19T23:31:58.794Z","etag":null,"topics":["language-model","machine-learning","nlp","python","pytorch","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bloomberg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-17T19:03:33.000Z","updated_at":"2024-06-20T10:35:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"3e7a372c-0492-4c08-bbd4-f7b4ccc79037","html_url":"https://github.com/bloomberg/MixCE-acl2023","commit_stats":null,"previous_names":["bloomberg/mixce-acl2023"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2FMixCE-acl2023","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2FMixCE-acl2023/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2FMixCE-acl2023/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bloomberg%2FMixCE-acl2023/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bloomberg","download_url":"https://codeload.github.com/bloomberg/MixCE-acl2023/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252897422,"owners_count":21821433,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","machine-learning","nlp","python","pytorch","transformer"],"created_at":"2024-11-09T09:42:30.289Z","updated_at":"2025-05-07T14:41:50.493Z","avatar_url":"https://github.com/bloomberg.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MixCE\n\nThis repository contains the code and data for the following paper:\n\n[MixCE: Training Autoregressive Language Models by Mixing the Forward and Reverse Cross-Entropies](https://arxiv.org/abs/2305.16958)\n\n```\n@inproceedings{zhang2023mixce,\n  title={MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies},\n  author={Zhang, Shiyue and Wu, Shijie and İrsoy, Ozan and Lu, Steven and Bansal, Mohit and Dredze, Mark and Rosenberg, David},\n  booktitle={Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics},\n  year={2023}\n}\n```\n\n**code author:** Shiyue Zhang\n\n## Requirements\n\n- Python 3 (tested with Python 3.9.5)\n- Install required packages:\n\n```bash\npython -m pip install -r requirements.txt\n```\n\nOptional: To avoid any version clashes with existing packages, you may want to perform the installation\nunder a virtual environment:\n\n```bash\npython -m venv yourenv\n. yourenv/bin/activate  # for bash, might be something else for your particular shell\npython -m pip install -r requirements.txt\n```\n\n## Synthetic Experiments\n\n### Quick Start\n\n[**synthetic.py**](./synthetic.py) is the script for running synthetic experiments.\nRunning experiments is very simple, just run:\n\n```\npython synthetic.py\n```\n\nConfigurations (like seed, vocab size, etc.) can be specified and changed within the script and under `if __name__ == '__main__':`.\n\n### Configurations\n\nThere are a few important configurations within [synthetic.py](./synthetic.py) that determine what kind of synthetic\nexperiments you can run:\n\n**real_dataset**: if it is `None`, the transition matrix will be randomly initialized; or if it is `'webtext'`,\nthe transition matrix will be initialized from the pre-computed [transition matrices on webtext](./data/webtext_transition_matrices.pkl).\n\n**zero_percent**: determines how many values in the transition matrix are 0. For example, if `zero_percent==0.5`, then 50% probabilities in the transition matrix are 0.\n\n**vocab_size**: the vocabulary size. We test 21, 51, 101, 501, or 1001. Note that 21 means we have 20\nnormal tokens (including EOS) and 1 PAD token.\n\n**seed**: We run 5 seeds (7, 42, 777, 4222, 99999) for each experiment.\n\n**loss_func**: We test 4 loss functions: (1) `'two_xens'`: it is denoted as MixCE\\* in our paper and uses the gold data\ndistribution P and gumbel softmax; (2) `'qlogq_mix'`: it is our approximated MixCE loss function; (3) `'two_kls'`:\nthe mixture of two KL divergences; (4) `'js'`: js divergence.\n\n**train_eta**: The mixing ratio for those loss functions. If `train_eta==1.0` for `'two_xens'`, it is MLE. If `train_eta==1.0`\nfor `'two_kls'`, it is forward KL (also equals MLE). If `train_eta==0.0` for `'two_kls'`, it is the reverse KL.\nWe use a general definition of JS (see [this paper](https://arxiv.org/abs/1511.05101) for more details), and\nJS converges to 0 when `train_eta` gets closer to 0.0 or 1.0. When train_eta=0.5, it is the normal definition of\nJS divergence.\n\n### Metrics \u0026 Evaluation\n\nWe evaluate synthetically trained bigram LMs by comparing the learned transition matrix against the gold\ntransition matrix. We use two metrics:\n\n(1) **avg. js**: we compute the js divergence between each row of the gold and learned\ntransition matrices and average across rows.\n\n(2) **avg. 0s**: we get the values from the learned matrix at gold probability=0 positions and then average them.\n\nThe `compare_parameters()` function in [synthetic.py](./synthetic.py) is used for computing these two metrics.\n\n### Models\n\nModels will all be saved under the `synthetic_logs/` directory.\nEach model directory's name starts with the datetime that experiment was run. Under the model\ndirectory, you will also find the TensorBoard event files, as well as an `all_best_metrics.json` that saves the best metrics scores\nfor each mixing ratio. See examples under [synthetic_logs/](./synthetic_logs).\n\nModel evaluation is conducted after each epoch, and the best checkpoint is selected based on the loss on the dev set.\n\n### Get results\n\nEventually, for each experiment, we average the results from 5 seeds; and for each\nobjective, we choose the best mixing ratio based on avg. js.\n\n`get_synthetic_results()` in [results.py](./results.py) is a function used to average results from 5 seeds\nand sort results of different mixing ratios accorrding to avg. js.\n\nTo use `get_synthetic_results()`, you need to first prepare [synthetic_models.json](./synthetic_logs/synthetic_models.json)\nto specify the model directories. An example is shown in [synthetic_models.json](./synthetic_logs/synthetic_models.json).\nThen you can get the result of the experiment that uses webtext initialized transition matrix, vocab=20 and objective=two_kls\nby running `get_synthetic_results('webtext', '20', 'two_kls')`.\n\n## GPT-2 Experiments\n\n### Preparation\n\n#### Prepare data\n\n**Detokenizer.** You first need to download `detokenizer.perl` from Moses [here](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl),\nand place it under the path `data/detokenizer.perl` because the following Python scripts depend on it.\n\nThen:\n\n```\ncd data\npython wikitext_data.py\npython webtext_data.py\ncurl https://dl.fbaipublicfiles.com/fairseq/data/writingPrompts.tar.gz | tar xvzf -\npython writingprompts_data.py\n```\n\nThe preprocessed data will be saved under `data/wikitext`, `data/webtext`, and `data/writingPrompts`.\n\n#### Download GPT-2 models\n\nClone GPT-2 models using `git lfs` following the instruction provided by Hugging Face.\n\n```\ngit lfs install\ngit clone https://huggingface.co/gpt2\n```\n\ngpt2 is the smallest GPT-2 model. We also experiment with gpt2-medium and gpt2-large. gpt2-large is used in computing MAUVE, so please download them too:\n\n```\ngit clone https://huggingface.co/gpt2-medium\ngit clone https://huggingface.co/gpt2-large\n```\n\nMake a copy of gpt2-large for MAUVE:\n\n```\ncp -r gpt2-large gpt2-large-mauve\n```\n\nBecause we will directly write to gpt2-large, which will affect MAUVE computation.\n\n## Quick start\n\nYou can simply start running experiments by doing:\n\n```\npython run.py\n```\n\nConfigurations can be manually specified within `run.py`. See an example under `if __name__ == '__main__'`.\n\n## Configurations \u0026 Files\n\nThere are a few important configurations in [**run.py**](./run.py):\n\n**training_size**: The training data size, we test `'10K'`, `'25K'`, `'50K'`, and `'100K'`; by default we use `'50K'`.\n\n**model**: It can be `'gpt2'`, `'gpt2-meidum'`, or `'gpt2-large'`.\n\n**dataset**: It can be `\"wikitext\"`, `\"webtext\"`, or `\"writingPrompts\"`.\n\n**mixing_ratio**: We search through `[0.0, 0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99, 1.0]` and choose the best `mixing_ratio`\nbased on dev set MAUVE score.\n\n**train_batch_size, accumulation, eval_batch_size**: These configs should be determined by the platform you use. We use\none single Tesla V100 GPU (32G memory), and the recommended configs in this setting are in `run.py`.\n\nThere are one dict and three functions in [**run.py**](./run.py):\n\n**data_sets{}**: It saves the paths of data files.\n\n**run_no_trainer()**: The function used for training and evaluating models.\n\n**run_no_trainer_eval()**: The function used for model evaluation only.\n\n**run_no_trainer_turn_topp()**: The function used for tuning top-p sampling's p.\n\nBesides [**run.py**](./run.py), I introduce here the other important Python scripts for model training and evaluation:\n\n[**gpt2.py**](./gpt2.py) (the most essential file) contains a **GPT2MIXModel model class that implements our MixCE loss function**.\n\n[**run_clm_no_trainer.py**](./run_clm_no_trainer.py) is the script to train and evaluate GPT-2 models.\n\n[**run_clm_no_trainer_tune_topp.py**](./run_clm_no_trainer_tune_topp.py) is similar to `run_clm_no_trainer.py`, except that it is only used for tuning the hyperparameter p of top-p sampling.\n\n[**metircs.py**](./metrics.py) contains the metrics we use to evaluate model generations.\n\n## Models\n\nModels will be saved under `train/` directory.\n\nEach model directory's name starts with the datetime that experiment was run.\nUnder the model directory, we save the **best** checkpoint (selected based on dev loss).\n\n`dev/test.sample`, `dev/test.sample1`, `dev/test.sample2`, and `dev/test.human` are 3 unbiased sampling generations and human text.\n\n`dev/test_results.json` save the results of perplexity, diversity, and repetition.\n\nAfter tuning p for top-p sampling, `dev/test.topp(p=*)` are top-p sampling generations with different p values.\n\nAfter computing MAUVE and coherence (see next section for details), `dev/test_mauve_coherence_*.json` have the MAUVE and\ncoherence scores with different max lengths.\n\nAfter computing controlled MAUVE and coherence (see next section for details), `dev/test_controlled_mauve_coherence_*.json`\nare controlled MAUVE and coherence scores with different max lengths.\n\n## Metrics \u0026 Evaluation\n\nWe report the scores of 6 metrics in our paper:\n\n**perplexity** is computed along with model training/evaluation (see [**run_clm_no_trainer.py**](./run_clm_no_trainer.py)).\n\n**diversity** is implemented by the `diversity()` function in [**metircs.py**](./metrics.py), and it is also computed\nalong with model training/evaluation by calling the `compute_diversity_repetition()` function in [**run_clm_no_trainer.py**](./run_clm_no_trainer.py).\nNote that repetition is another metric we implemented but did not report in our paper; it checks what percent of the text is repetition loops and also return the repetitive phrase length.\n\n**MAUVE** and **coherence** are computed in a post-hoc manner by using saved generation files. `compute_mauve()` and\n`compute_coherence()` in [**metrics.py**](./metrics.py) are two helper functions to compute MAUVE and coherence.\nThey are called by the `compute_mauve_coherence()` function in [**results.py**](./results.py). To use `compute_mauve_coherence()`,\nyou must first prepare the [**models.json**](./train/models.json) to specify model directory names for evaluation.\n\nSimilarly, **controlled-MAUVE** and **controlled-coherence** can also be computed in a post-hoc manner by `compute_controlled_mauve_coherence()`\nfunction in [**results.py**](./results.py).\n\n## Pretrained models\n\n| Dataset        | Model Size | Training Data Size | Objective       | Hugging Face hub name                             |\n| -------------- | ---------- | ------------------ | --------------- | ------------------------------------------------ |\n| wikitext       | gpt2-large | 50K                | MLE             | shiyue/wikitext_train50K_gpt2-large_mix1.0       |\n| wikitext       | gpt2-large | 50K                | MixCE (eta=0.1) | shiyue/wikitext_train50K_gpt2-large_mix0.1       |\n| webtext        | gpt2-large | 50K                | MLE             | shiyue/webtext_train50K_gpt2-large_mix1.0        |\n| webtext        | gpt2-large | 50K                | MixCE (eta=0.3) | shiyue/webtext_train50K_gpt2-large_mix0.3        |\n| writingPrompts | gpt2-large | 50K                | MLE             | shiyue/writingPrompts_train50K_gpt2-large_mix1.0 |\n| writingPrompts | gpt2-large | 50K                | MixCE (eta=0.7) | shiyue/writingPrompts_train50K_gpt2-large_mix0.7 |\n\nTry pretrained models in the following ways:\n\n```\n\u003e\u003e\u003e from gpt2 import GPT2MIXModel\n\u003e\u003e\u003e from transformers import GPT2Tokenizer\n\u003e\u003e\u003e model = GPT2MIXModel.from_pretrained(\"shiyue/wikitext_train50K_gpt2-large_mix1.0\")\n\u003e\u003e\u003e tokenizer = GPT2Tokenizer.from_pretrained('shiyue/wikitext_train50K_gpt2-large_mix1.0')\n\u003e\u003e\u003e text = \"Hey, how are you?\"\n\u003e\u003e\u003e encoded_input = tokenizer(text, return_tensors='pt')\n\u003e\u003e\u003e model.eval()\n\u003e\u003e\u003e out_ids = model.lm.generate(inputs=encoded_input[\"input_ids\"], max_length=50, do_sample=True)\n\u003e\u003e\u003e print(tokenizer.batch_decode(out_ids, skip_special_tokens=True))\n```\n\n## Contributions\n\nWe :heart: contributions.\n\nHave you had a good experience with this project? Why not share some love and contribute code, or just let us know about any issues you had with it?\n\nWe welcome issue reports [here](../../issues); be sure to choose the proper issue template for your issue, so that we can be sure you're providing us with the necessary information.\n\nBefore sending a [Pull Request](../../pulls), please make sure you read our\n[Contribution Guidelines](https://github.com/bloomberg/.github/blob/main/CONTRIBUTING.md).\n\n## Notices\n\nThe following two files are borrowed and adopted from the `transformers` repository, and therefore retain their original copyrights.\n\n### **run_clm_no_trainer.py**\n\nThis is originally picked up from https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm_no_trainer.py. On top of it, we have applied the following modifications:\n\n- Added the following additional arguments for parsing:\n  - `--test_file`\n  - `--reduction`\n  - `--mixing_ratio`\n  - `--max_length`\n  - `--prompt_length`\n  - `--eval_prompt_length`\n  - `--cache_dir`\n  - `--do_train`\n  - `--do_eval`\n- Commented out some unused code blocks for \"`push_to_hub`\" option.\n- Handle tokenizer possibly not having pad token.\n- Modify model loading block to use GPT2MIXModel for pretrained GPT2 models.\n- Logic to add EOS after each text.\n- Use `DataCollatorWithPadding` instead of the default collator.\n- Add evaluation logic for \"`do_eval`\" option, most of which goes into the new function '`evaluate()`'.\n\n### **run_clm_no_trainer_tune_topp.py**\n\nThis file is further modified from `run_clm_no_trainer.py` (see above) by changing how the `generate()` function is invoked to enable tuning for `top_p` option.\n\n## Code of Conduct\n\nThis project has adopted a [Code of Conduct](https://github.com/bloomberg/.github/blob/main/CODE_OF_CONDUCT.md).\nIf you have any concerns about the Code, or behavior which you have experienced in the project, please\ncontact us at opensource@bloomberg.net.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbloomberg%2Fmixce-acl2023","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbloomberg%2Fmixce-acl2023","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbloomberg%2Fmixce-acl2023/lists"}