{"id":31211512,"url":"https://github.com/cyberagentailab/mbr-anomaly","last_synced_at":"2025-09-21T05:30:18.328Z","repository":{"id":229939641,"uuid":"778056174","full_name":"CyberAgentAILab/mbr-anomaly","owner":"CyberAgentAILab","description":"Code for \"On the True Distribution Approximation of Minimum Bayes-Risk Decoding,\" NAACL 2024","archived":false,"fork":false,"pushed_at":"2024-04-02T16:40:12.000Z","size":13057,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-10T07:42:50.239Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CyberAgentAILab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-03-27T01:47:32.000Z","updated_at":"2024-06-07T11:46:04.000Z","dependencies_parsed_at":"2025-09-10T09:00:58.045Z","dependency_job_id":null,"html_url":"https://github.com/CyberAgentAILab/mbr-anomaly","commit_stats":null,"previous_names":["cyberagentailab/mbr-anomaly"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CyberAgentAILab/mbr-anomaly","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2Fmbr-anomaly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2Fmbr-anomaly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2Fmbr-anomaly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2Fmbr-anomaly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CyberAgentAILab","download_url":"https://codeload.github.com/CyberAgentAILab/mbr-anomaly/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2Fmbr-anomaly/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276195627,"owners_count":25601152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-21T02:00:07.055Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-21T05:30:17.264Z","updated_at":"2025-09-21T05:30:18.299Z","avatar_url":"https://github.com/CyberAgentAILab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Analyzing MBR decoding via Anomaly Detection Approach\n\nThis repository contains the code to reproduce the analysis results presented in our NAACL 2024 paper:\n\nOn the True Distribution Approximation of Minimum Bayes-Risk Decoding [[Paper](https://arxiv.org/abs/2404.00752)]\n\n## Requirements\n- Python 3.9+\n\n```bash\n# Install required packages\n# Specify the torch version appropriate for your environment in `requirements.txt`\npip install -r requirements.txt\n\n# Install fastBPE here\npip install fastBPE\n```\n\n## 1 Reproduce with pre-computed results\n\nIn this section, we reproduce the analysis results presented in our paper using pre-computed translation results and utility matrices that we have already generated.\n\nIn our experiments, we used four translation pairs: English (`en`) \u003c-\u003e German (`de`) and English \u003c-\u003e Russian (`ru`), and the dataset used is newstest2019 from [WMT19](https://www.statmt.org/wmt19/translation-task.html). The NMT model used is the WMT19 winner model ([Ng et al., 2019](https://arxiv.org/abs/1907.06616)).\n\n### 1.1 Prepare generated hypotheses\n\nFollow the steps below to download and extract `hypotheses`:\n\n```bash\ncurl -O https://storage.googleapis.com/ailab-public/mbr-anomaly/hypotheses.tar.gz\ntar -zxvf hypotheses.tar.gz\n```\n\nThe `hypotheses` directory contains the results of translations generated under each setting, saved in separate directories. The naming convention for each directory is as follows:\n\n```bash\nhypotheses/wmt19.newstest2019.\u003csource_lang\u003e-\u003ctarget_lang\u003e.\u003csampling_method\u003enb\u003cnbest\u003eseed\u003cseed\u003e\n```\n\n- `\u003csource_lang\u003e`: The source language in the translation\n- `\u003ctarget_lang\u003e`: The target language in the translation\n- `\u003csampling_method\u003e`: The generation method. For example, the following abbreviations are used:\n    - `bm100`: beam search (beam size = 100)\n    - `ep002`: epsilon sampling (epsilon = 0.02)\n    - `tp09`: nucleus sampling (p = 0.9)\n- `\u003cnbest\u003e`: The number of n-best for generation. Typically set to 100, matching the number of candidates for MBR decoding and the number of pseudo-references\n- `\u003cseed\u003e`: The seed value used during generation. Either 1, 2, or 3\n\nEach directory contains the source sentences, translated sentences, and their likelihoods for each instance saved as `e-1.hypotheses.json`. The COMET22 ([Rei et al., 2020](https://aclanthology.org/2022.wmt-1.52/)) scores for each hypothesis are stored in a DataFrame format in `e-1.scores.pkl`.\n\n- `e-1.hypotheses.json`\n    ```json\n    {\n        \"0\": {\n            \"source\": \"Welsh AMs worried about 'looking like muppets'\",\n            \"reference\": \"Walisische Ageordnete sorgen sich \\\"wie Dödel auszusehen\\\"\",\n            \"hypotheses\": [\n                {\n                    \"hypothesis_index\": 0,\n                    \"generation_name\": \"wmt19.newstest2019.en-de.tp09nb100.out\",\n                    \"sentence\": \"Walisische AMs besorgt, \\\"wie Muppets auszusehen\\\"\",\n                    \"sum_logprobs\": -10.9131,\n                    \"mean_logprobs\": -0.68206875\n                },\n                ...\n            ]\n        },\n        ...\n    }\n    ```\n- `e-1.scores.pkl`\n    ```python\n    \u003e\u003e\u003e pd.read_pickle(\"hypotheses/wmt19.newstest2019.en-de.tp09nb100seed1/e-1.scores.pkl\")\n                                    comet22_score\n    example_index hypothesis_index               \n    0             0                         0.700\n                  1                         0.712\n                  2                         0.646\n    ...                                       ...\n    1996          97                        0.697\n                  98                        0.818\n                  99                        0.547\n    ```\n\n### 1.2 Compute utility matrix for MBR decoding\n\nFollow the steps below to download and extract `mbrd_output`:\n\n```bash\ncurl -O https://storage.googleapis.com/ailab-public/mbr-anomaly/mbrd_output.tar.gz\ntar -zxvf mbrd_output.tar.gz\n```\n\nThe `mbrd_output` directory contains the MBR decoding results calculated for each combination of candidate and pseudo-reference. The data for candidates and pseudo-references are selected from the `hypotheses` prepared in [1.1 Prepare generated hypotheses](#11-prepare-generated-hypotheses). The naming convention for each directory is as follows:\n\n```bash\nmbrd_output/wmt19.newstest2019.\u003csource_lang\u003e-\u003ctarget_lang\u003e.\u003ccandidate_id\u003e.\u003cpseudo_ref_id\u003e\n```\n\n- `\u003csource_lang\u003e`: The source language in the translation\n- `\u003ctarget_lang\u003e`: The target language in the translation\n- `\u003ccandidate_id\u003e` and `\u003cpseudo_ref_id\u003e`: Correspond to `\u003csampling_method\u003enb\u003cnbest\u003eseed\u003cseed\u003e` from [1.1 Prepare generated hypotheses](#11-prepare-generated-hypotheses)\n\nEach directory contains:\n\n- `c100p100.e1000.data.json`: Data for 1,000 randomly selected instances from the dataset, recording 100 sentences for candidates and 100 sentences for pseudo-references\n    ```json\n    {\n        \"0\": {\n            \"source\": \"Welsh AMs worried about 'looking like muppets'\",\n            \"reference\": \"Walisische Ageordnete sorgen sich \\\"wie Dödel auszusehen\\\"\",\n            \"candidates\": [\n                {\n                    \"hypothesis_index\": 0,\n                    \"generation_name\": \"wmt19.newstest2019.en-de.ep002nb100.out\",\n                    \"sentence\": \"Walisische AMs besorgt darüber, \\\"wie Muppets auszusehen\\\"\",\n                    \"sum_logprobs\": -9.650299999999998,\n                    \"mean_logprobs\": -0.5676647058823528\n                },\n                ...\n            ],\n            \"pseudo_refs\": [\n                {\n                    \"hypothesis_index\": 0,\n                    \"generation_name\": \"wmt19.newstest2019.en-de.tp09nb100.out\",\n                    \"sentence\": \"Walisische AMs besorgt, \\\"wie Muppets auszusehen\\\"\",\n                    \"sum_logprobs\": -10.9131,\n                    \"mean_logprobs\": -0.68206875\n                },\n                ...\n            ]\n        },\n        ...\n    }\n    ```\n\n- `c100p100.e1000.utilities.pkl`: COMET22 scores recorded for each candidate against all pseudo-references for each instance.\n    ```python\n    \u003e\u003e\u003e pd.read_pickle(\"hypotheses/wmt19.newstest2019.en-de.tp09/e-1.scores.pkl\")\n                                                      utilities  expected_utility\n    example_index candidate_index\n    0             0                [0.93, 0.91, 0.87, 0.88, ...              0.82\n                  1                [0.85, 0.77, 0.89, 0.91, ...              0.70\n                  2                [0.85, 0.77, 0.89, 0.91, ...              0.70\n    ...                                                     ...               ...\n    998           97               [0.88, 0.89, 0.87, 0.86, ...              0.85\n                  98               [0.79, 0.79, 0.79, 0.78, ...              0.79\n                  99               [0.86, 0.86, 0.85, 0.86, ...              0.85\n    ```\n\n\n### 1.3 Analysis of MBR decoding performance\n\nThe experimental scripts for Section 3 and Section 5 in our paper are compiled in `experiments.ipynb`. By following the steps in `experiments.ipynb`, you can reproduce our experimental results using the prepared hypotheses and utility matrix.\n\nRegarding Anomaly Detection distances, the covariance matrix for Mahalanobis distance and the computation for Local Outlier Factor (LOF) are time-consuming, thus they have been computed in advance.\n\nThe computation results are saved in `mbrd_output/*/c100p100.e1000.example_covariance.pkl` and `mbrd_output/*/c100p100.e1000.example_lof.pkl`.\n\nIn `experiments.ipynb`, these files can be loaded later for analysis using Mahalanobis distance and LOF.\n\n## 2 Reproduce from scratch\n\nThis section explains, with concrete examples, how to use scripts to perform calculations (e.g., generation of translation hypotheses or computation of utility matrices) that were skipped in Reproduce with pre-computed results.\n\n### 2.1 Prepare generated hypotheses\n\nHere, we prepare the hypotheses data as in [1.1 Prepare generated hypotheses](#11-prepare-generated-hypotheses). Hypotheses can be generated in the following four steps:\n\n#### 2.1.1 Download dataset and model\n\nFollow the instructions in [[datasets/README.md](datasets/README.md)] and [[models/README.md](models/README.md)] to download and unpack newstest2019 and the NMT models for each language pair.\n\n#### 2.1.2 Preprocess dataset\n\nUse the `preprocess_newstest2019.sh` script to tokenize and apply BPE to each language pair data of newstest2019.\n\nFor example, to preprocess the en → de pair data:\n\n```bash\n# Usage: bash preprocess_newstest2019.sh \u003csource_lang\u003e \u003ctarget_lang\u003e\nbash preprocess_newstest2019.sh en de\n```\n\nPreprocessed data is saved in `data-bin/newstest2019/ende`.\n\n#### 2.1.3 Generate translations\n\nUse the preprocessed data and NMT model to generate translations. The `fairseq-generate` command is used for generation.\n\nFor example, to translate the en -\u003e de pair using nucleus sampling (p=0.9):\n\n```bash\ndata_bin_prefix=\"data-bin/newstest2019/ende\"\nmodel_dpath=\"models/wmt19.en-de.ensemble\"\n\nfairseq-generate \\\n    ${data_bin_prefix} \\\n    --path ${model_dpath}/model4.pt \\\n    --source-lang en \\\n    --target-lang de \\\n    --tokenizer moses \\\n    --bpe fastbpe \\\n    --bpe-codes ${model_dpath}/bpecodes \\\n    --batch-size 2 \\\n    --sampling \\\n    --sampling-topp 0.9 \\\n    --temperature 1.0 \\\n    --beam 100 \\\n    --nbest 100 \\\n    --seed 1 \\\n    \u003e wmt19.newstest2019.en-de.tp09nb100seed1.out\n```\n\n#### 2.1.4 Postprocess generated translations\n\nUse `postprocess_hypotheses.py` to postprocess the generated translations. At this time, the quality of each hypothesis is evaluated by specified utility functions.\n\nFor example, to evaluate and postprocess `wmt19.newstest2019.en-de.tp09nb100seed1.out` created above with COMET22:\n\n```bash\npython postprocess_hypotheses.py \\\n    --generation-output wmt19.newstest2019.en-de.tp09nb100seed1.out \\\n    --result-prefix hypotheses/wmt19.newstest2019.en-de.tp09nb100seed1/e-1 \\\n    --eval-metrics comet22 \\\n    --world-size 8\n```\n\nThis creates the directory `hypotheses/wmt19.newstest2019.en-de.tp09nb100seed1` containing `e-1.hypotheses.json` and `e-1.scores.pkl`.\n\n### 2.2 Compute utility matrix for MBR decoding\n\nUse `mbr_decoding.py` to compute the utility matrix for MBR decoding from the hypotheses created in [2.1 Prepare generated hypotheses](#21-prepare-generated-hypotheses).\n\nFor example, if using COMET22 as the utility function, and epsilon-sampling (`ep002nb100seed1`) and nucleus-sampling (`tp09nb100seed1`) generated hypotheses as the candidate set and pseudo-reference set, respectively:\n\n```bash\npython mbr_decoding.py \\\n    --hypotheses-prefix-for-candidates hypotheses/wmt19.newstest2019.en-de.ep002nb100seed1/e-1 \\\n    --hypotheses-prefix-for-pseudo-refs hypotheses/wmt19.newstest2019.en-de.tp09nb100seed1/e-1 \\\n    --result-prefix mbrd_output/wmt19.newstest2019.en-de.ep002nb100seed1.tp09nb100seed1/c100p100.e1000 \\\n    --num-examples 1000 \\\n    --num-candidates 100 \\\n    --num-pseudo-refs 100 \\\n    --metric comet22 \\\n    --world-size 8\n```\n\nThis creates the directory `mbrd_output/wmt19.newstest2019.en-de.ep002nb100seed1.tp09nb100seed1` containing `c100p100.e1000.data.json` and `c100p100.e1000.utilities.pkl`.\n\n### 2.3 Analysis of MBR decoding performance\n\nBefore running `experiments.ipynb`, compute the covariance matrix for Mahalanobis distance and LOF scores as follows:\n\n#### 2.3.1 Covariance matrix\n\nUse the `compute_covariance.py` script to compute the covariance matrix from the utility matrix calculated in [2.2 Compute utility matrix for MBR decoding](#22-compute-utility-matrix-for-mbr-decoding) as follows:\n\n```bash\npython compute_covariance.py \\\n    --mbrd_prefix mbrd_output/wmt19.newstest2019.en-de.ep002nb100seed1.tp09nb100seed1/c100p100.e1000\n```\n\nThe calculated covariance matrix is saved under the `--mbrd_prefix` directory as `c100p100.e1000.example_covariance.pkl`.\n\n#### 2.3.2 LOF\n\nUse the `compute_lof.py` script to compute the LOF object (sklearn.neighbors.LocalOutlierFactor) fitted on the utility matrix data calculated in [2.2 Compute utility matrix for MBR decoding](#22-compute-utility-matrix-for-mbr-decoding):\n\n```bash\npython compute_lof.py \\\n    --mbrd_prefix mbrd_output/wmt19.newstest2019.en-de.ep002nb100seed1.tp09nb100seed1/c100p100.e1000\n```\n\nThe calculated LOF object is saved under the `--mbrd_prefix` directory as `c100p100.e1000.example_lof.pkl`.\n\n## Citation\n```bibtex\n@article{ohashi2024true,\n    title={On the True Distribution Approximation of Minimum Bayes-Risk Decoding},\n    author={Ohashi, Atsumoto and Honda, Ukyo and Morimura, Tetsuro and Jinnai, Yuu},\n    journal={arXiv preprint arXiv:2404.00752},\n    year={2024},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcyberagentailab%2Fmbr-anomaly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcyberagentailab%2Fmbr-anomaly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcyberagentailab%2Fmbr-anomaly/lists"}