{"id":13754343,"url":"https://github.com/PhilipMay/stsb-multi-mt","last_synced_at":"2025-05-09T22:32:04.228Z","repository":{"id":106915692,"uuid":"349548651","full_name":"PhilipMay/stsb-multi-mt","owner":"PhilipMay","description":"Machine translated multilingual STS benchmark dataset.","archived":false,"fork":false,"pushed_at":"2023-12-21T15:14:38.000Z","size":7648,"stargazers_count":30,"open_issues_count":4,"forks_count":9,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-03T16:23:20.779Z","etag":null,"topics":["dataset","multilingual","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PhilipMay.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-03-19T20:33:48.000Z","updated_at":"2025-03-24T02:23:35.000Z","dependencies_parsed_at":"2023-12-21T18:40:29.036Z","dependency_job_id":null,"html_url":"https://github.com/PhilipMay/stsb-multi-mt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhilipMay%2Fstsb-multi-mt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhilipMay%2Fstsb-multi-mt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhilipMay%2Fstsb-multi-mt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PhilipMay%2Fstsb-multi-mt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PhilipMay","download_url":"https://codeload.github.com/PhilipMay/stsb-multi-mt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335861,"owners_count":21892749,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","multilingual","nlp"],"created_at":"2024-08-03T09:01:55.578Z","updated_at":"2025-05-09T22:31:59.181Z","avatar_url":"https://github.com/PhilipMay.png","language":"Python","funding_links":[],"categories":["NLP语料和数据集"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# STSb Multi MT\nMachine translated multilingual STS benchmark dataset.\n\nThese are different multilingual translations and the English original of the [STSbenchmark dataset](https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark). Translation has been done with [deepl.com](https://www.deepl.com/).\n\n- Available languages are: de, en, es, fr, it, ja, nl, pl, pt, ru, zh\n- Dataset splits are called: train, dev, test\n\nIt can be used to train [sentence embeddings](https://github.com/UKPLab/sentence-transformers) like [T-Systems-onsite/cross-en-de-roberta-sentence-transformer](https://huggingface.co/T-Systems-onsite/cross-en-de-roberta-sentence-transformer).\n\nPlease [open an issue](https://github.com/PhilipMay/stsb-multi-mt/issues/new) if you have questions or want to report problems.\n\nThis dataset provides pairs of sentences and a score of their similarity.\n\nscore | 2 example sentences | explanation\n------|---------|------------\n5 | *The bird is bathing in the sink.\u003cbr/\u003eBirdie is washing itself in the water basin.* | The two sentences are completely equivalent, as they mean the same thing.\n4 | *Two boys on a couch are playing video games.\u003cbr/\u003eTwo boys are playing a video game.* | The two sentences are mostly equivalent, but some unimportant details differ.\n3 | *John said he is considered a witness but not a suspect.\u003cbr/\u003e“He is not a suspect anymore.” John said.* | The two sentences are roughly equivalent, but some important information differs/missing.\n2 | *They flew out of the nest in groups.\u003cbr/\u003eThey flew into the nest together.* | The two sentences are not equivalent, but share some details.\n1 | *The woman is playing the violin.\u003cbr/\u003eThe young lady enjoys listening to the guitar.* | The two sentences are not equivalent, but are on the same topic.\n0 | *The black dog is running through the snow.\u003cbr/\u003eA race car driver is driving his car through the mud.* | The two sentences are completely dissimilar.\n\n## Content\n- folder `raw-data`: the raw data how it was convertet with deepl.com\n- folder `data`: the data: sentence1, sentence2, similarity_score\n- `convert.py`: script to convert data from `raw-data` to `data`\n\n## Examples of Use\n```python\nimport csv\n\nwith open(filepath, newline=\"\", encoding=\"utf-8\") as csvfile:\n    csv_dict_reader = csv.DictReader(\n        csvfile,\n        dialect='excel',\n        fieldnames=[\"sentence1\", \"sentence2\", \"similarity_score\"],\n    )\n    for row in csv_dict_reader:\n        print(row)\n```\n\n## Known Issues\nnone\n\n## Manual Testing of Datasets\nLanguage | 1st train | 1000st train | last train        | 1st dev | 1000st dev | last dev | 1st test | 1000st test | last test\n---------|-----------|--------------|-------------------|---------|------------|----------|----------|-------------|----------\nde       | ok        | ok           | ok                | ok      | ok         | ok       | ok       | ok          | ok\nen       | ok        | ok           | ok                | ok      | ok         | ok       | ok       | ok          | ok\nes       |           |              |                   |         |            |          |          |             |\nfr       |           |              |                   |         |            |          |          |             |\nit       |           |              |                   |         |            |          |          |             |\nja       |           |              |                   |         |            |          |          |             |\nnl       | ok        | ok           | partially English | ok      | ok         | ok       | ok       | ok          | poor grammar\npl       |           |              |                   |         |            |          |          |             |\npt       |           |              |                   |         |            |          |          |             |\nru       |           |              |                   |         |            |          |          |             |\nzh       |           |              |                   |         |            |          |          |             |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPhilipMay%2Fstsb-multi-mt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FPhilipMay%2Fstsb-multi-mt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FPhilipMay%2Fstsb-multi-mt/lists"}