{"id":19381567,"url":"https://github.com/megagonlabs/coop","last_synced_at":"2025-04-23T20:31:57.225Z","repository":{"id":58180810,"uuid":"353862905","full_name":"megagonlabs/coop","owner":"megagonlabs","description":"☘️ Code for Convex Aggregation for Opinion Summarization (Iso et al; Findings of EMNLP 2021)","archived":false,"fork":false,"pushed_at":"2022-12-22T00:07:14.000Z","size":1444,"stargazers_count":35,"open_issues_count":0,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-16T17:12:53.800Z","etag":null,"topics":["natural-language-generation","opinion-summarization","summarization","text-generation","unsupervised-learning","vae","variational-autoencoder"],"latest_commit_sha":null,"homepage":"https://aclanthology.org/2021.findings-emnlp.328v2.pdf","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/megagonlabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-02T00:40:10.000Z","updated_at":"2025-04-11T21:41:43.000Z","dependencies_parsed_at":"2023-01-30T05:16:03.095Z","dependency_job_id":null,"html_url":"https://github.com/megagonlabs/coop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fcoop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fcoop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fcoop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/megagonlabs%2Fcoop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/megagonlabs","download_url":"https://codeload.github.com/megagonlabs/coop/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250509698,"owners_count":21442482,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-generation","opinion-summarization","summarization","text-generation","unsupervised-learning","vae","variational-autoencoder"],"created_at":"2024-11-10T09:17:35.252Z","updated_at":"2025-04-23T20:31:56.679Z","avatar_url":"https://github.com/megagonlabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Convex Aggregation for Opinion Summarization\n\n[![Conference](https://img.shields.io/badge/findings_of_emnlp-2021-red)](https://aclanthology.org/2021.findings-emnlp.328)\n[![arXiv](https://img.shields.io/badge/arxiv-2104.01371-success)](https://arxiv.org/abs/2104.01371/)\n[![arXiv](https://img.shields.io/badge/colab-demo-yellow)](https://colab.research.google.com/drive/1kyWw9H6TBfpuVrQH_35ofeScX1E-DSpb?usp=sharing)\n\nCode for [Convex Aggregation for Opinion Summarization](https://arxiv.org/abs/2104.01371).\n\nThe codebase provides an easy-to-use framework that enables the user to train and use text VAE models with different configurations.\n\nYou can also easily configure the architecture of the text VAE model without changing the code at all. You need to use a different Jsonnet file (perhaps with some modification) to train and use a model.\n\n![Coop](./img/overview.png)\n\n## Citations\n```bibtex\n@inproceedings{iso21emnlpfindings,\n    title = {{C}onvex {A}ggregation for {O}pinion {S}ummarization},\n    author = {Hayate Iso and\n              Xiaolan Wang and\n              Yoshihiko Suhara and\n              Stefanos Angelidis and\n              Wang{-}Chiew Tan},\n    booktitle = {Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},\n    month = {November},\n    year = {2021}\n}\n```\n\n## Installation\n```bash\nconda create -n coop python=3.7\nconda activate coop\nconda install -c conda-forge jsonnet sentencepiece # If needed\npip install git+https://github.com/megagonlabs/coop.git\n```\nor\n```\ngit clone https://github.com/megagonlabs/coop.git\ncd coop\npip install -e .  # or python setup.py develop\n```\n\n## Quick tour\nOur unsupervised opinion summarization model can generate a summary by decoding the aggregated latent vectors of inputs.\nThe proposed framework, ```coop``` will find the best summary based on the input-output overlap.\nHere you can firstly encode the input reviews, ```reviews```, into the latent vectors, ```z_raw```:\n```python\nfrom typing import List\nimport torch\nfrom coop import VAE, util\n\nmodel_name: str = \"megagonlabs/bimeanvae-yelp\"  # or \"megagonlabs/bimeanvae-amzn\", \"megagonlabs/optimus-yelp\", \"megagonlabs/optimus-amzn\"\nvae = VAE(model_name)\n\nreviews: List[str] = [\n    \"I love this ramen shop!! Highly recommended!!\",\n    \"Here is one of my favorite ramen places! You must try!\"\n]\nz_raw: torch.Tensor = vae.encode(reviews) # [num_reviews * latent_size]\n```\nGiven the latent vectors for input reviews, the model generates summaries from all combinations of latent vectors:\n```python\n# All combinations of input reviews\nidxes: List[List[int]] = util.powerset(len(reviews))\n# Taking averages for all combinations of latent vectors\nzs: torch.Tensor = torch.stack([z_raw[idx].mean(dim=0) for idx in idxes]) # [2^num_reviews - 1 * latent_size]\n\noutputs: List[str] = vae.generate(zs)\noutputs\n```\nThen, the output looks like this:\n```shell\n['I love this restaurant!! Highly recommended!!',\n 'Here is one of my favorite ramen places! You must try this place!',\n 'I love this place! Food is amazing!!']\n```\nFinally, our framework, Coop, selects the summary based on the input-output overlap:\n```python\n# Input-output overlap is measured by ROUGE-1 F1 score.\nbest: str = max(outputs, key=lambda x: util.input_output_overlap(inputs=reviews, output=x))\nbest\n```\n\nThen, the selected summary based on the input-output overlap looks like this:\n```shell\n'Here is one of my favorite ramen places! You must try this place!'\n```\n\n## Evaluate on Dev/Test set\nYou can easily get the generated examples and evaluate their performance with only 30 lines of code!\nBefore doing so, you need to download the dev/test set by running the following command.\n```bash\n# Download dev and test set for evaluation\npython scripts/get_summ.py yelp data/yelp\npython scripts/get_summ.py amzn data/amzn\n```\n\nThen, you can get the generated examples as follows!\n```python\nimport json\nfrom typing import List\nimport pandas as pd\nimport torch\nimport rouge\nfrom coop import VAE, util\n\ntask = \"yelp\"  # or \"amzn\"\nsplit = \"dev\"  # or \"test\"\ndata: List[dict] = json.load(open(f\"./data/{task}/{split}.json\"))\nmodel_name: str = f\"megagonlabs/bimeanvae-{task}\"  # or f\"megagonlabs/optimus-{task}\"\nvae = VAE(model_name)\n\nhypothesis = []\nfor ins in data:\n    reviews: List[str] = ins[\"reviews\"]\n    z_raw: torch.Tensor = vae.encode(reviews)\n    idxes: List[List[int]] = util.powerset(len(reviews))\n    zs: torch.Tensor = torch.stack([z_raw[idx].mean(dim=0) for idx in idxes]) # [2^num_reviews - 1 * latent_size]\n\n    outputs: List[str] = vae.generate(zs, bad_words=util.BAD_WORDS)  # First-person pronoun blocking\n    best: str = max(outputs, key=lambda x: util.input_output_overlap(inputs=reviews, output=x))\n    hypothesis.append(best)\n\nreference: List[List[str]] = [ins[\"summary\"] for ins in data]\n\nevaluator = rouge.Rouge(metrics=[\"rouge-n\", \"rouge-l\"], max_n=2, limit_length=False, apply_avg=True,\n                        stemming=True, ensure_compatibility=True)\n\nscores = pd.DataFrame(evaluator.get_scores(hypothesis, reference))\nscores\n```\n\n# Available models\nAll models are hosted on huggingface :hugs: model hub (https://huggingface.co/megagonlabs/).\n\n\n| Model name                                                      | Training Data  | Encoder               | Decoder | \n| :-------------------------------------------------------------- | :-------------:|:---------------------:|:-------:|\n| [megagonlabs/bimeanvae-yelp](https://huggingface.co/megagonlabs/bimeanvae-yelp) | Yelp           | BiLSTM + Mean Pooling | LSTM    |\n| [megagonlabs/optimus-yelp](https://huggingface.co/megagonlabs/optimus-yelp)     | Yelp           | bert-base-cased       | gpt2    |\n| [megagonlabs/bimeanvae-amzn](https://huggingface.co/megagonlabs/bimeanvae-amzn) | Amazon         | BiLSTM + Mean Pooling | LSTM    |\n| [megagonlabs/optimus-amzn](https://huggingface.co/megagonlabs/optimus-amzn)     | Amazon         | bert-base-cased       | gpt2    |\n\n```VAE``` automatically downloads model checkpoints from the model hub.\n\n## Summarization Performance\n### Yelp dataset [(Chu and Liu, 2019)](https://github.com/sosuperic/MeanSum)\n\n| Model name                                                      | Aggregation | ROUGE-1 F1 | ROUGE-2 F1 | ROUGE-L F1 | \n| :-------------------------------------------------------------- |:-----------:|:----------:|:----------:|:----------:|\n| [megagonlabs/bimeanvae-yelp](https://huggingface.co/megagonlabs/bimeanvae-yelp) | SimpleAvg   | 32.87      | 6.93       | 19.89      |\n| [megagonlabs/bimeanvae-yelp](https://huggingface.co/megagonlabs/bimeanvae-yelp) | Coop        | **35.37**  | **7.35**   | **19.94**  |\n| [megagonlabs/optimus-yelp](https://huggingface.co/megagonlabs/optimus-yelp)     | SimpleAvg   | 31.23      | 6.48       | 18.27      |\n| [megagonlabs/optimus-yelp](https://huggingface.co/megagonlabs/optimus-yelp)     | Coop        | 33.68      | 7.00       | 18.95      |\n\n\n### Amazon dataset [(Bražinskas et al., 2020)](https://github.com/abrazinskas/Copycat-abstractive-opinion-summarizer)\n| Model name                                                      | Aggregation | ROUGE-1 F1 | ROUGE-2 F1 | ROUGE-L F1 | \n| :-------------------------------------------------------------- |:-----------:|:----------:|:----------:|:----------:|\n| [megagonlabs/bimeanvae-amzn](https://huggingface.co/megagonlabs/bimeanvae-amzn) | SimpleAvg   | 33.60     | 6.64     | 20.87     |\n| [megagonlabs/bimeanvae-amzn](https://huggingface.co/megagonlabs/bimeanvae-amzn) | Coop        | **36.57** | **7.23** | **21.24** |\n| [megagonlabs/optimus-amzn](https://huggingface.co/megagonlabs/optimus-amzn)     | SimpleAvg   | 33.54     | 6.18     | 19.34     |\n| [megagonlabs/optimus-amzn](https://huggingface.co/megagonlabs/optimus-amzn)     | Coop        | 35.32     | 6.22     | 19.84     |\n\n\n# Reproduction\n\n## Setup\n```shell\n$ unzip coop.zip \u0026\u0026 cd coop\n$ conda create -n coop python=3.7\n$ conda activate coop\n$ conda install -c conda-forge jsonnet sentencepiece  # If needed\n$ pip install -r requirements.txt\n```\n\n## Preparation\n\n### Yelp dataset\n\nDownload the Yelp dataset from [this link](https://www.yelp.com/dataset).  \nYou only need the JSON file (`yelp_dataset.tar`).\n\nMove the file to `data/yelp` and uncompress it. You only need `yelp_academic_dataset_review.json`\n\n```bash\n$ tar -xvf yelp_dataset.tar\n$ YELP_RAW=$(pwd)/yelp_academic_dataset_review.json\n```\n\nRun the following preprocessing scripts. This may take several hours, depending on your machine spec.\n\n```bash\n$ mkdir -p ./data/yelp\n$ python scripts/preprocess.py yelp $YELP_RAW \u003e ./data/yelp/train.jsonl\n```\n\nAdditionally, you need to download the reference summaries from [this link](https://s3.us-east-2.amazonaws.com/unsup-sum/summaries_0-200_cleaned.csv) provided by [MeanSum](https://github.com/sosuperic/MeanSum)\n\nRun the following command to download and preprocess it.\nThis will create `dev.json` and `test.json`, which follow the dev/test splits\ndefined in [the original MeanSum paper](https://arxiv.org/abs/1810.05739).\n\n```\n$ python scripts/get_summ.py yelp data/yelp\n$ ls data/yelp\ntrain.jsonl\ndev.json\ntest.json\n```\n\n\n### Amazon dataset\n\nDownload the Amazon dataset from [this link](http://jmcauley.ucsd.edu/data/amazon/links.html).\nYou only need the following files for 4 categories:\n- [Clothing_Shoes_and_Jewelry.json.gz](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry.json.gz)\n- [Electronics.json.gz](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics.json.gz)\n- [Health_and_Personal_Care.json.gz](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care.json.gz)\n- [Home_and_Kitchen.json.gz](http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen.json.gz)\n\nRun the script to download the datasets.\nYou **don't need to uncompress** them.\n\n```shell\n$ mkdir amzn_raw \u0026\u0026 cd amzn_raw\n$ wget -P data/amazon http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Clothing_Shoes_and_Jewelry.json.gz\n$ wget -P data/amazon http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics.json.gz\n$ wget -P data/amazon http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care.json.gz\n$ wget -P data/amazon http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen.json.gz\n$ AMZN_RAW=$(pwd)\n$ ls $AMZN_RAW\nClothing_Shoes_and_Jewelry.json.gz\nElectronics.json.gz\nHealth_and_Personal_Care.json.gz\nHome_and_Kitchen.json.gz\n$ cd -\n```\n\nRun the following preprocessing script. This may take several hours, depending on your machine spec.\n\n```bash\n$ mkdir -p ./data/amzn\n$ python scripts/preprocess.py amzn $AMZN_RAW \u003e ./data/amzn/train.jsonl\n```\n\nDownload the reference summaries from this link provided by CopyCat.\n\nRun the following command to download and preprocess it. This will create dev.json and test.json, which follow the dev/test splits defined in the original CopyCat paper.\n\n```bash\n$ python scripts/get_summ.py amzn data/amzn\n$ ls data/amzn\ntrain.jsonl\ndev.json\ntest.json\n```\n\n\n## Training\n### Model and Training Configuration\n```config``` directory contains the configuration files used for the experiments. You can copy it and edit the configuration file to run experiments in different settings.\n\n```jsonnet\nlocal lib = import '../utils.libsonnet';\nlocal data_type = \"yelp\";\nlocal latent_dim = 512;\nlocal free_bit = 0.25;\nlocal num_steps = 100000;\nlocal checkout_step = 1000;\nlocal batch_size = 256;\nlocal lr = 1e-3;\n\n{\n    \"data_dir\": \"./data/%s\" % data_type,\n    \"spm_path\": \"./data/sentencepiece/%s.model\" % data_type,\n    \"model\": lib.BiMeanVAE(latent_dim, free_bit),\n    \"trainer\": lib.VAETrainer(num_steps, checkout_step, batch_size, lr)\n}\n\n```\n\n### Training a model\nTo train the model, you can run the following script with ``config`` file and the directory to save checkpoints.\n```bash\n$ python train.py \u003cconfig filepath\u003e -s \u003cmodel dir path\u003e\n```\n\nFor example,\n\n```bash\n$ python train.py config/bimeanvae/yelp.jsonnet -s log/bimeanvae/yelp/ex1\n```\n\n## Evaluation\nTo evaluate the model with our proposed framework, ```coop```, you can simply run the following:\n```bash\n$ python coop/search.py \u003cmodel dir path\u003e\n```\n\nFor example,\n```bash\n$ python coop/search.py log/bimeanvae/yelp/ex1\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fcoop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmegagonlabs%2Fcoop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmegagonlabs%2Fcoop/lists"}