{"id":18960437,"url":"https://github.com/google-research-datasets/wit","last_synced_at":"2026-02-13T10:41:05.962Z","repository":{"id":38196901,"uuid":"342013716","full_name":"google-research-datasets/wit","owner":"google-research-datasets","description":"WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.","archived":false,"fork":false,"pushed_at":"2024-09-27T20:55:42.000Z","size":4751,"stargazers_count":1059,"open_issues_count":1,"forks_count":44,"subscribers_count":37,"default_branch":"main","last_synced_at":"2025-06-08T16:06:20.955Z","etag":null,"topics":["cc-by-sa-3","machine-learning","multilingual","multimodal","nlp","wikipedia"],"latest_commit_sha":null,"homepage":"https://github.com/google-research-datasets/wit","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research-datasets.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-02-24T19:35:36.000Z","updated_at":"2025-06-04T02:40:33.000Z","dependencies_parsed_at":"2025-05-25T20:08:10.334Z","dependency_job_id":"7d18a78d-c339-40eb-a060-d33595971b02","html_url":"https://github.com/google-research-datasets/wit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/google-research-datasets/wit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research-datasets%2Fwit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research-datasets%2Fwit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research-datasets%2Fwit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research-datasets%2Fwit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research-datasets","download_url":"https://codeload.github.com/google-research-datasets/wit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research-datasets%2Fwit/sbom","scorecard":{"id":436058,"data":{"date":"2025-08-11","repo":{"name":"github.com/google-research-datasets/wit","commit":"68b38670d984a7ddfb9e131596d56568ac987313"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Code-Review","score":0,"reason":"Found 1/29 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 2 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T04:43:25.689Z","repository_id":38196901,"created_at":"2025-08-19T04:43:25.689Z","updated_at":"2025-08-19T04:43:25.689Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272988919,"owners_count":25026961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-31T02:00:09.071Z","response_time":79,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cc-by-sa-3","machine-learning","multilingual","multimodal","nlp","wikipedia"],"created_at":"2024-11-08T14:06:57.879Z","updated_at":"2026-02-13T10:41:00.931Z","avatar_url":"https://github.com/google-research-datasets.png","language":null,"funding_links":[],"categories":["多模态大模型","Others","Datasets","**Datasets**"],"sub_categories":["资源传输下载","2023","Miscellaneous NLP Datasets"],"readme":"# WIT : Wikipedia-based Image Text Dataset\n\n**Wikipedia-based Image Text (WIT) Dataset** is a large **multimodal\nmultilingual** dataset. WIT is composed of a curated set of 37.6 million entity\nrich image-text examples with 11.5 million unique images across 108 Wikipedia\nlanguages. Its size enables WIT to be used as a pretraining dataset for\nmultimodal machine learning models.\n\n## Key Advantages\n\nA few unique advantages of WIT:\n\n-   The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.\n-   A massively multilingual dataset (first of its kind) with coverage for 108 languages.\n-   First image-text dataset with page level metadata and contextual information\n-   A collection of diverse set of concepts and real world entities.\n-   Brings forth challenging real-world test sets.\n\nYou can learn more about WIT Dataset from our\n[arXiv paper](https://arxiv.org/abs/2103.01913).\n\n## Latest Updates\n\n2021 April: Happy to share the good news that our paper got accepted at [SIGIR Conference](https://sigir.org/sigir2021/call-for-resource-papers/). From ACM site, you can find our [paper, slides and presentation](https://dl.acm.org/doi/abs/10.1145/3404835.3463257).\n\n2021 September: [WIT Image-Text Competition](https://www.kaggle.com/c/wikipedia-image-caption/overview) is live on Kaggle. Our collaborators from Wikimedia Research [blogged](https://techblog.wikimedia.org/2021/09/09/the-wikipedia-image-caption-matching-challenge-and-a-huge-release-of-image-data-for-research/) about this and they have made available the raw pixels and resnet50 embeddings for the images in this set. Here is our [Google AI blog post](https://ai.googleblog.com/2021/09/announcing-wit-wikipedia-based-image.html).\n\n2022 April: We are happy to share that the WIT paper and dataset was awarded the **WikiMedia Foundation's Research Award of the Year** ([tweet 1](https://twitter.com/WikiResearch/status/1518640500000972800), [tweet 2](https://twitter.com/wikiworkshop/status/1518639913813565441)). We are deeply honored and thank you for the recognition.\n\n2022 May: We have released the WIT validation set and test set. Please see the [data](DATA.md) page for download links.\n\n2022 Oct: [Authoring Tools for Multimedia Content](https://trec.nist.gov/pubs/call2023.html) proposal accepted at TREC 2023\n\n2023 Apr: [AToMiC](https://arxiv.org/abs/2304.01961) accepted at SIGIR 2023.\n\n2023 Apr: [WikiWeb2M Dataset](wikiweb2m.md) released.\n\n2023 May: Accepted submissions at [WikiWorkshop 2023](https://wikiworkshop.org/2023/).\n\n-  WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset ([pdf](https://wikiworkshop.org/2023/papers/WikiWorkshop2023_paper_10.pdf), [arXiv](https://arxiv.org/abs/2305.05432))\n-  Building Authoring Tools for Multimedia Content with Human-in-the-loop Relevance Annotations ([pdf](https://wikiworkshop.org/2023/papers/WikiWorkshop2023_paper_57.pdf))\n-  Characterizing Image Accessibility on Wikipedia across Languages ([pdf](https://wikiworkshop.org/2023/papers/WikiWorkshop2023_paper_25.pdf))\n\n\n## WIT Example\n\n## Wikipedia Page\n\nFor example, let's take the Wikipedia page for\n[Half Dome, Yosemite in CA](https://en.wikipedia.org/wiki/Half_Dome).\n\n![WIT Wikipedia Half Dome Image](images/wit_half_dome_wiki.png)\n\n[From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0](https://en.wikipedia.org/wiki/Half_Dome#/media/File:Half_Dome_from_Glacier_Point,_Yosemite_NP_-_Diliff.jpg)\n\n## Wikipedia Page with Annotations of what we can extract\n\nFrom this page, we highlight the various key pieces of data that we can\nextract - images, their respective text snippets and some contextual metadata.\n\n![WIT Half Dome Page with Annotations](images/wit_take2_half_dome_with_annotations.png)\n\nBy extracting and filtering these carefully, we get a clean, high quality\nimage-text example that can be used in multimodal modeling.\n\n\u003c!-- ![WIT Half Dome Data](images/wit_half_dome_wiki_and_wit.png) --\u003e\n\n## Motivation\n\nMultimodal visio-linguistic models rely on a rich dataset to help them learn to\nmodel the relationship between images and texts. Having large image-text\ndatasets can significantly improve performance, as shown by recent works.\nFurthermore the lack of language coverage in existing datasets (which are mostly\nonly in English) also impedes research in the multilingual multimodal space – we\nconsider this a lost opportunity given the potential shown in leveraging images\n(as a language-agnostic medium) to help improve our multilingual textual\nunderstanding.\n\nTo address these challenges and advance research on multilingual, multimodal\nlearning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created\nby extracting multiple different texts associated with an image (e.g., as shown\nin the above image) from Wikipedia articles and Wikimedia image links. This was\naccompanied by rigorous filtering to only retain high quality image-text sets.\n\nThe resulting dataset contains over 37.6 million image-text sets – making WIT\nthe largest multimodal dataset (publicly available at the time of this writing) \nwith unparalleled multilingual coverage – with 12K+ examples in each of \n108 languages (53 languages have 100K+ image-text pairs).\n\n## WIT: Dataset Numbers\n\nType          | Train  | Val    | Test   | Total / Unique\n------------- | ------ | ------ | ------ | --------------\nRows / Tuples | 37.13M | 261.8K | 210.7K | 37.6M\nUnique Images | 11.4M  | 58K    | 57K    | 11.5M\nRef. Text     | 16.9M  | 150K   | 104K   | 17.2M / 16.7M\nAttr. Text    | 34.8M  | 193K   | 200K   | 35.2M / 10.9M\nAlt Text      | 5.3M   | 29K    | 29K    | 5.4M / 5.3M\nContext Texts | -      | -      | -      | 119.8M\n\n### WIT: Image-Text Stats by Language\n\nImage-Text   | # Lang | Uniq. Images  | # Lang\n------------ | ------ | ------------- | ------\ntotal \u003e 1M   | 9      | images \u003e 1M   | 6\ntotal \u003e 500K | 10     | images \u003e 500K | 12\ntotal \u003e 100K | 36     | images \u003e 100K | 35\ntotal \u003e 50K  | 15     | images \u003e 50K  | 17\ntotal \u003e 14K  | 38     | images \u003e 13K  | 38\n\n## Get WIT\n\nWe believe that such a powerful diverse dataset will aid researchers in building\nbetter multimodal multilingual models and in identifying better learning and\nrepresentation techniques leading to improvement of Machine Learning models in\nreal-world tasks over visio-linguistic data.\n\nWIT Dataset is now available for download. Please check the [data](DATA.md) page.\n\n## Citing WIT\n\nIf you use the WIT dataset, you can cite our work as follows.\n\n```\n@inproceedings{10.1145/3404835.3463257,\nauthor = {Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},\ntitle = {WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning},\nyear = {2021},\nisbn = {9781450380379},\npublisher = {Association for Computing Machinery},\naddress = {New York, NY, USA},\nurl = {https://doi.org/10.1145/3404835.3463257},\ndoi = {10.1145/3404835.3463257},\nbooktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},\npages = {2443–2449},\nnumpages = {7},\nkeywords = {dataset, multimodal, machine learning, wikipedia, multilingual, image-text retrieval, neural networks},\nlocation = {Virtual Event, Canada},\nseries = {SIGIR '21}\n}\n```\n\n## License\n\nThis data is available under the [Creative Commons Attribution-ShareAlike 3.0 Unported](LICENSE) license.\n\n## Projects using WIT\n\nFor information regarding [MURAL](https://github.com/google-research-datasets/wit/tree/main/mural) (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.\n\n## Contact\n\nFor any questions, please contact wit-dataset@google.com. To any questions to the first author, Krishna, please reach via their personal page [krishna2.com](https://krishna2.com) for contact informaiton.\n\nIf WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research-datasets%2Fwit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research-datasets%2Fwit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research-datasets%2Fwit/lists"}