{"id":28795553,"url":"https://github.com/ukplab/useb","last_synced_at":"2025-08-09T21:14:14.134Z","repository":{"id":62586709,"uuid":"373593720","full_name":"UKPLab/useb","owner":"UKPLab","description":"Heterogenous, Task- and Domain-Specific Benchmark for Unsupervised Sentence Embeddings used in the TSDAE paper: https://arxiv.org/abs/2104.06979.","archived":false,"fork":false,"pushed_at":"2022-01-04T22:22:44.000Z","size":37,"stargazers_count":32,"open_issues_count":1,"forks_count":2,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-06-18T03:09:16.685Z","etag":null,"topics":["benchmark","domain-adaptation","information-retrieval","nlp","paraphrase-identification","pytorch","reranking","sbert","sentence-embeddings","transformer","unsupervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UKPLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-03T17:43:21.000Z","updated_at":"2024-06-28T14:42:16.000Z","dependencies_parsed_at":"2022-11-03T22:10:04.174Z","dependency_job_id":null,"html_url":"https://github.com/UKPLab/useb","commit_stats":null,"previous_names":["kwang2049/useb"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/UKPLab/useb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2Fuseb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2Fuseb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2Fuseb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2Fuseb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UKPLab","download_url":"https://codeload.github.com/UKPLab/useb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2Fuseb/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260477931,"owners_count":23015066,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","domain-adaptation","information-retrieval","nlp","paraphrase-identification","pytorch","reranking","sbert","sentence-embeddings","transformer","unsupervised-learning"],"created_at":"2025-06-18T03:09:17.082Z","updated_at":"2025-08-09T21:14:14.122Z","avatar_url":"https://github.com/UKPLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Unsupervised Sentence Embedding Benchmark (USEB)\nThis repository hosts the data and the evaluation script for reproducing the results reported in the paper: \"[TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979)\" (EMNLP 2021 Findings). This benchmark (USEB) contains four heterogeous, task- and domain-specific datasets: [AskUbuntu](https://github.com/taolei87/askubuntu), [CQADupStack](https://github.com/D1Doris/CQADupStack), [TwitterPara](https://www.aclweb.org/anthology/D17-1126/) and [SciDocs](https://github.com/allenai/scidocs). It directly works with [SBERT](https://github.com/UKPLab/sentence-transformers). For details, pleasae refer to the paper.\n\n## Install\n```python\npip install useb  # Or git clone and pip install .\npython -m useb.downloading all  # Download both training and evaluation data\n```\n\n## Usage \u0026 Example\nAfter data downloading, one can either run (it needs ~8min on a GPU)\n```bash\npython -m useb.examples.eval_sbert\n```\nto evaluate an [SBERT](https://github.com/UKPLab/sentence-transformers) model (really an awesome repository for sentence embeddings, and the lastest model there is much better) on all the datasets; or run this same code below:\n```python\nfrom useb import run\nfrom sentence_transformers import SentenceTransformer  # SentenceTransformer is an awesome library for providing SOTA sentence embedding methods. TSDAE is also integrated into it.\nimport torch\n\nsbert = SentenceTransformer('bert-base-nli-mean-tokens')  # Build an SBERT model\n\n# The only thing needed for the evaluation: a function mapping a list of sentences into a batch of vectors (torch.Tensor)\n@torch.no_grad()\ndef semb_fn(sentences) -\u003e torch.Tensor:\n    return torch.Tensor(sbert.encode(sentences, show_progress_bar=False))\n\nresults, results_main_metric = run(\n    semb_fn_askubuntu=semb_fn, \n    semb_fn_cqadupstack=semb_fn,  \n    semb_fn_twitterpara=semb_fn, \n    semb_fn_scidocs=semb_fn,\n    eval_type='test',\n    data_eval_path='data-eval'  # This should be the path to the folder of data-eval\n)\n\nassert round(results_main_metric['avg'], 1) == 47.6\n```\nIt is also supported to evaluate on a single dataset (please see [useb/examples/eval_sbert_askubuntu.py](useb/examples/eval_sbert_askubuntu.py)):\n```bash\npython -m useb.examples.eval_sbert_askubuntu\n```\n\n## Data Organization\n```bash\n.\n├── data-eval  # For evaluation usage. One can refer to ./unsupse_benchmark/evaluators to learn about how to loading these data.\n│   ├── askubuntu\n│   │   ├── dev.txt\n│   │   ├── test.txt\n│   │   └── text_tokenized.txt\n│   ├── cqadupstack\n│   │   ├── corpus.json\n│   │   └── retrieval_split.json\n│   ├── scidocs\n│   │   ├── cite\n│   │   │   ├── test.qrel\n│   │   │   └── val.qrel\n│   │   ├── cocite\n│   │   │   ├── test.qrel\n│   │   │   └── val.qrel\n│   │   ├── coread\n│   │   │   ├── test.qrel\n│   │   │   └── val.qrel\n│   │   ├── coview\n│   │   │   ├── test.qrel\n│   │   │   └── val.qrel\n│   │   └── data.json\n│   └── twitterpara\n│       ├── Twitter_URL_Corpus_test.txt\n│       ├── test.data\n│       └── test.label\n├── data-train  # For training usage.\n│   ├── askubuntu\n│   │   ├── supervised  # For supervised training. *.org and *.para are parallel files, each line are aligned and compose a gold relevant sentence pair (to work with MultipleNegativeRankingLoss in the SBERT repo).\n│   │   │   ├── train.org\n│   │   │   └── train.para\n│   │   └── unsupervised  # For unsupervised training. Each line is a sentence.\n│   │       └── train.txt\n│   ├── cqadupstack\n│   │   ├── supervised\n│   │   │   ├── train.org\n│   │   │   └── train.para\n│   │   └── unsupervised\n│   │       └── train.txt\n│   ├── scidocs\n│   │   ├── supervised\n│   │   │   ├── train.org\n│   │   │   └── train.para\n│   │   └── unsupervised\n│   │       └── train.txt\n│   └── twitter  # For supervised training on TwitterPara, the float labels are also available (to work with CosineSimilarityLoss in the SBERT repo). As reported in the paper, using the float labels can achieve higher performance.\n│       ├── supervised\n│       │   ├── train.lbl\n│       │   ├── train.org\n│       │   ├── train.para\n│       │   ├── train.s1\n│       │   └── train.s2\n│       └── unsupervised\n│           └── train.txt\n└── tree.txt\n```\n\n## Citation\nIf you use the code for evaluation, feel free to cite our publication [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979):\n```bibtex \n@inproceedings{wang-etal-2021-tsdae-using,\n    title = \"{TSDAE}: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning\",\n    author = \"Wang, Kexin and Reimers, Nils and Gurevych, Iryna\",\n    booktitle = \"Findings of the Association for Computational Linguistics: EMNLP 2021\",\n    month = nov,\n    year = \"2021\",\n    address = \"Punta Cana, Dominican Republic\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2021.findings-emnlp.59\",\n    doi = \"10.18653/v1/2021.findings-emnlp.59\",\n    pages = \"671--688\",\n}\n```\n\nContact person and main contributor: [Kexin Wang](https://kwang2049.github.io/), kexin.wang.2049@gmail.com\n\n[https://www.ukp.tu-darmstadt.de/](https://www.ukp.tu-darmstadt.de/)\n\n[https://www.tu-darmstadt.de/](https://www.tu-darmstadt.de/)\n\nDon't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.\n\n\u003e This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fukplab%2Fuseb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fukplab%2Fuseb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fukplab%2Fuseb/lists"}