{"id":15034089,"url":"https://github.com/juand-r/entity-recognition-datasets","last_synced_at":"2025-05-14T12:08:15.762Z","repository":{"id":37926461,"uuid":"147035039","full_name":"juand-r/entity-recognition-datasets","owner":"juand-r","description":"A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.","archived":false,"fork":false,"pushed_at":"2024-11-29T05:05:17.000Z","size":2729,"stargazers_count":1533,"open_issues_count":8,"forks_count":248,"subscribers_count":40,"default_branch":"master","last_synced_at":"2025-04-12T03:44:44.776Z","etag":null,"topics":["annotations","corpora","datasets","entity-extraction","entity-recognition","named-entity-recognition","natural-language-processing","ner","nlp","nlp-resources"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/juand-r.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-01T21:47:11.000Z","updated_at":"2025-04-09T05:27:20.000Z","dependencies_parsed_at":"2025-01-09T19:32:08.407Z","dependency_job_id":"c457520a-36ad-4f4a-91d5-172f8236e05c","html_url":"https://github.com/juand-r/entity-recognition-datasets","commit_stats":{"total_commits":246,"total_committers":12,"mean_commits":20.5,"dds":"0.34959349593495936","last_synced_commit":"f2516d3b0a7f69204ff39223ab96f5602773716f"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juand-r%2Fentity-recognition-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juand-r%2Fentity-recognition-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juand-r%2Fentity-recognition-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juand-r%2Fentity-recognition-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/juand-r","download_url":"https://codeload.github.com/juand-r/entity-recognition-datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254140756,"owners_count":22021219,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotations","corpora","datasets","entity-extraction","entity-recognition","named-entity-recognition","natural-language-processing","ner","nlp","nlp-resources"],"created_at":"2024-09-24T20:23:53.650Z","updated_at":"2025-05-14T12:08:15.730Z","avatar_url":"https://github.com/juand-r.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"===============================\nDatasets for Entity Recognition\n===============================\n\nThis repository contains datasets from several domains\nannotated with a variety of entity types, useful for entity recognition and\nnamed entity recognition (NER) tasks.\n\n\n**NOTE: I am no longer actively adding datasets to this list -- there are likely more NER datasets that have appeared since 2020. However, I am happy to add more datasets via issues or pull requests.**\n\nDatasets for NER in English\n===========================\n\n.. |check| unicode:: 0x2714\n\nThe following table shows the list of datasets for English-language entity recognition (for a list of NER datasets in other languages, see below). The `data` directory\ncontains information on where to obtain those datasets which could not be shared\ndue to licensing restrictions, as well as code to convert them (if necessary)\nto the CoNLL 2003 format. Links to NER corpora in other languages\nare also listed below.\n\n============== =============== ======================= =============================== ==================================\nDataset         Domain            License                 Reference                       Availablility\n============== =============== ======================= =============================== ==================================\nCONLL 2003      News               DUA                  Sang and Meulder, 2003          `Easy \u003chttps://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003/\u003e`_ `to \u003chttps://github.com/synalp/NER/tree/master/corpus/CoNLL-2003\u003e`_ `find \u003chttps://github.com/glample/tagger/tree/master/dataset\u003e`_\nNIST-IEER       News               None                 NIST 1999 IE-ER                 `NLTK data \u003chttps://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/ieer.zip\u003e`_\nMUC-6           News               LDC                  Grishman and Sundheim, 1996     `LDC 2003T13 \u003chttps://catalog.ldc.upenn.edu/LDC2003T13\u003e`_\nOntoNotes 5     Various            LDC                  Weischedel et al., 2013         `LDC 2013T19 \u003chttps://catalog.ldc.upenn.edu/LDC2013T19\u003e`_\nBBN             Various            LDC                  Weischedel and Brunstein, 2005    `LDC 2005T33 \u003chttps://catalog.ldc.upenn.edu/LDC2005T33\u003e`_\nGMB-1.0.0       Various            None                 Bos et al., 2017                `http://gmb.let.rug.nl/data.php \u003chttp://gmb.let.rug.nl/releases/gmb-1.0.0.zip\u003e`_\nGUM-3.1.0       Wiki               Several (*2)         Zeldes, 2016                    |check| Included here\nwikigold        Wikipedia          CC-BY 4.0            Balasuriya et al., 2009         |check| Included here\nRitter          Twitter            None                 Ritter et al., 2011             `No split \u003chttps://github.com/aritter/twitter_nlp/blob/master/data/annotated/ner.txt\u003e`_ , `Train/test/dev split \u003chttps://github.com/aritter/twitter_nlp/tree/master/data/annotated/wnut16/data\u003e`_\nBTC             Twitter            CC-BY 4.0            Derczynski et al., 2016         |check| Included here\nWNUT17          Social media       CC-BY 4.0            Derczynski et al., 2017         |check| Included here\ni2b2-2006       Medical            DUA                  Uzuner et al., 2007             `http://www.i2b2.org \u003chttps://www.i2b2.org/NLP/DataSets/Main.php\u003e`_\ni2b2-2014       Medical            DUA                  Stubbs et al., 2015             `http://www.i2b2.org \u003chttps://www.i2b2.org/NLP/DataSets/Main.php\u003e`_\nCADEC           Medical            CSIRO                Karimi et al., 2015             http://data.csiro.au/\nAnEM            Anatomical         CC-BY-SA 3.0         Ohta et al., 2012               |check| Included here\nMITRestaurant   Queries            None                 Liu et al., 2013a               `http://groups.csail.mit.edu/sls/ \u003chttps://groups.csail.mit.edu/sls/downloads/restaurant/\u003e`_\nMITMovie        Queries            None                 Liu et al., 2013b               `http://groups.csail.mit.edu/sls/ \u003chttps://groups.csail.mit.edu/sls/downloads/movie/\u003e`_\nMalwareTextDB   Malware            None                 Lim et al., 2017                `http://www.statnlp.org/ \u003chttp://www.statnlp.org/research/re/MalwareTextDB-1.0.zip\u003e`_\nre3d            Defense            Several (*1)         DSTL, 2017                      |check| Included here\nSEC-filings     Finance            CC-BY 3.0            Alvarado et al., 2015           |check| Included here\nAssembly        Robotics           X                    Costa et al., 2017              X\nWikiNEuRal      Wikipedia          CC BY-SA-NC 4.0      Tedeschi et al., 2021           https://github.com/Babelscape/wikineural\nMultiNERD       Wikipedia          CC BY-SA-NC 4.0      Tedeschi et al., 2022           https://github.com/Babelscape/multinerd\nHIPE-2022       Historical         CC BY-SA-NC 4.0      Ehrmann et al., 2022            https://github.com/hipe-eval/HIPE-2022-data\nMusic-NER       Music              MIT                  Epure and Hennequin, 2023       https://github.com/deezer/music-ner-eacl2023\nWIESP2022-NER   Astrophysics       CC BY-SA-NC 4.0      Grezes et al., 2022             https://huggingface.co/datasets/adsabs/WIESP2022-NER\nNNE             News               CC 4.0 / LDC         Ringland et al., 2019           https://github.com/nickyringland/nested_named_entities\nWorldWide       News               CC BY-SA-NC 4.0      Shan et al., 2023               https://github.com/stanfordnlp/en-worldwide-newswire   https://arxiv.org/abs/2404.13465\n============== =============== ======================= =============================== ==================================\n\nLicenses\n========\n\nNotes on licenses:\n\n(1) re3d (\"Relationship and Entity Extraction Evaluation Dataset\") contains\nseveral datasets, with different licenses. These are:\n\n  - CC-BY-SA 3.0 (Wikipedia dataset)\n  - CC BY-NC 3.0 (BBC_Online dataset)\n  - CC BY 3.0 AU (Australian_Department_of_Foreign_Affairs dataset)\n  - public domain (US_State_Department dataset, CENTCOM dataset)\n  - UK Open Government Licence v3.0 (UK_Government dataset)\n  - Delegation_of_the_European_Union_to_Syria: see\n    https://eeas.europa.eu/delegations/syria/8157/legal-notice_en\n\n(2) GUM 3.1.0 comprises three datasets, with licenses CC-BY 3.0, CC-BY-SA 3.0 and\n    CC-BY-NC-SA 3.0. The annotations are licensed under CC-BY 4.0.\n\nMore detailed license information for each dataset can be found in\nthe corresponding subdirectory.\n\nLater ...\n- Tabassum et al., Code and Named Entity Recognition in StackOverflow https://cocoxu.github.io/publications/ACL2020_stackoverflow_NER.pdf\n- LitBank: https://github.com/dbamman/litbank (Bamman, Popat and Shen, An Annotated Dataset of Literary Entities, NAACL 2019)\n- NNE: A Dataset for Nested Named Entity Recognition in English Newswire, 2019 https://github.com/nickyringland/nested_named_entities\n- Mars Target Encyclopedia - LPSC abstracts labeled data set:  https://zenodo.org/record/1048419#.W5a2CBwnZhE\n- Best Buy queries: https://www.kaggle.com/dataturks/best-buy-ecommerce-ner-dataset/home\n- Resume entities for NER: https://www.kaggle.com/dataturks/resume-entities-for-ner/home\n- FEW-NERD: A Few-shot Named Entity Recognition Dataset https://aclanthology.org/2021.acl-long.248/\n\n\n\nDatasets for NER in other languages\n===================================\n\nLexical Named Entity resources\n------------------------------\n\n- HeiNER: http://heiner.cl.uni-heidelberg.de/index.shtml\n- NECKAr: https://event.ifi.uni-heidelberg.de/?page_id=532#Wikidata_NE_dataset\n\nCode-Switching\n--------------\n\n- English-Spanish tweets (CALCS 2018): https://code-switching.github.io/2018/ ; https://code-switching.github.io/2018/files/spa-eng/Release.zip ; http://www.aclweb.org/anthology/W18-3219\n- Arabic-Egyptian tweets (CALCS 2018): https://code-switching.github.io/2018/ ; https://code-switching.github.io/2018/files/msa-egy/ArabicTweetsTokenAssigner.zip ; http://www.aclweb.org/anthology/W18-3219\n- Hindi-English social media text: https://github.com/SilentFlame/Named-Entity-Recognition ; http://aclweb.org/anthology/W18-2405\n- EMNLP 2014 Shared Task - Code-Switched Tweets (Nepali-English, Spanish-English, Mandarin-English, Arabic-Arabic dialects): http://emnlp2014.org/workshops/CodeSwitch/call.html\n\nGerman\n------\n\n- CoNLL 2003 (English, German): https://www.clips.uantwerpen.be/conll2003/ner/\n- GermEval 2014: https://sites.google.com/site/germeval2014ner/data\n- Tübingen Treebank of Written German (TüBa-D/Z): http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html\n- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access\n- German EUROPARL transcripts (subset): https://nlpado.de/~sebastian/software/ner_german.shtml\n- Named Entity Model for German, Politics (NEMGP): https://www.thomas-zastrow.de/nlp/\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- DFKI SmartData Corpus (geo-entities): https://dfki-lt-re-group.bitbucket.io/smartdata-corpus/ (A German Corpus for Fine-Grained Named Entity Recognition and Relation Extraction of Traffic and Industry Events. Martin Schiersch, Veselina Mironova, Maximilian Schmitt, Philippe Thomas, Aleksandra Gabryszak, Leonhard Hennig. Proceedings of LREC, 2018)\n- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/\n- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation\n- Elena Leitner, Georg Rehm, Juli ́an Moreno-Schneider, A Dataset of German Legal Documents for Named Entity Recognition, LREC 2020: http://georg-re.hm/pdf/LREC-2020-Leitner-et-al-preprint.pdf ; Data: https://github.com/elenanereiss/Legal-Entity-Recognition\n- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data\n\nDutch\n-----\n\n- CoNLL 2002 (Spanish, Dutch): https://www.clips.uantwerpen.be/conll2002/ner/\n- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access\n- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/\n- Dutch parliamentary documents 2015-2016, from 1848.nl (Jonkers, Named Entity Recognition on Dutch Parliamentary Documents using Frog, thesis, University of Amsterdam, 2016): https://github.com/Poezedoez/NER/blob/master/Code/data/lobby/golden_standard\n- SONAR 1 - Desmet and Hoste, Fine-grained Dutch named entity recognition, 2014 (hierarchy of classes)\n- Corpus-SONAR books and Corpus Gutenberg Dutch: http://blog.namescape.nl/?page_id=85 ; http://portal.clarin.nl/node/1940\n\nAfrikaans\n---------\n\n- NCHLT Afrikaans Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/299\n\nSpanish\n-------\n\n- CoNLL 2002 (Spanish, Dutch): https://www.clips.uantwerpen.be/conll2002/ner/\n- AnCora (Spanish, Catalan): http://clic.ub.edu/corpus/en\n- DEFT Spanish Treebank (LDC2018T01): https://catalog.ldc.upenn.edu/LDC2018T01\n- PANACEA (LAB): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-es\n- PANACEA (ENV): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-es\n- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/\n- ACE 2007 (Spanish and Arabic): https://catalog.ldc.upenn.edu/LDC2014T18\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- http://www.grupolys.org/~marcos/pub/lrec16.tar.bz2 (used in \"Incorporating Lexico-semantic Heuristics into Coreference Resolution Sieves for Named Entity Recognition at Document-level\")\n- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2 \n- DrugSemantics Gold Standard (Moreno et al., DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics, 2017): https://data.mendeley.com/datasets/fwc7jrc5jr/1\n- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/\n- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation\n- CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition) - named entity recognition of a critical type of concept related to cancer, namely tumor morphology in Spanish medical texts: https://temu.bsc.es/cantemist/\n\nCatalan\n-------\n\n- AnCora (Spanish, Catalan): http://clic.ub.edu/corpus/en\n\nGalician\n--------\n\n- Galician NER corpus: https://gramatica.usc.es/~marcos/resources/corpus_gal_nec.txt.gz\n- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2 \n\nBasque\n------\n\n- Basque Named Entities Corpus (EIEC): http://ixa.eus/node/4486?language=en\n- Basque Disambiguated Named Entities Corpus (EDIEC): http://ixa.si.ehu.es/node/4485?language=en\n- Egunkaria 2000 corpus (383 newswire texts), mentioned in http://qtleap.eu/wp-content/uploads/2014/04/QTLEAP-2013-D5.1.pdf\n\nPortuguese\n----------\n\n- HAREM: https://www.linguateca.pt/aval_conjunta/HAREM/harem_ing.html\n- CINTIL corpus: http://cintil.ul.pt/cintilfeatures.html#corpus\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- Multilingual corpora with coreferential annotation of person entities (Spanish, Galician, Portuguese): http://gramatica.usc.es/~marcos/lrec.tar.bz2 \n- Bosque 8.0 EAGLES format: https://gramatica.usc.es/~marcos/resources/corpora_FLpt.tgz\n- LeNER-Br (Brazilian legal documents): https://cic.unb.br/~teodecampos/LeNER-Br/\n- Paramopama: a Brazilian-Portuguese Corpus for Named Entity Recognition\n\nFrench\n------\n\n- ESTER: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0241/\n- ESTER 2: http://catalogue.elra.info/en-us/repository/browse/ELRA-S0338/\n- ETAPE: http://catalogue.elra.info/en-us/repository/browse/ELRA-E0046/\n- Europeana Newspapers (Dutch, French, German): https://github.com/EuropeanaNewspapers/ner-corpora ; http://lab.kb.nl/dataset/europeana-newspapers-ner#access\n- QUAERO French Medical Corpus: https://quaerofrenchmed.limsi.fr/\n- Quaero Broadcast News Extended Named Entity Corpus: http://catalog.elra.info/en-us/repository/browse/ELRA-S0349/\n- Quaero Old Press Extended Named Entity corpus: http://catalog.elra.info/en-us/repository/browse/ELRA-W0073/ \n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNER-fr-gold  https://arxiv.org/abs/2411.00030  https://huggingface.co/datasets/danrun/WikiNER-fr-gold\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/\n- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation\n- CAp 2017 - (Twitter data), Lopez et al., CAp 2017 challenge: Twitter Named Entity Recognition, 2017: http://cap2017.imag.fr/competition.html\n- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data\n\n\nItalian\n-------\n\n- KIND: https://github.com/dhfbk/KIND\n- Evalita: http://www.evalita.it/2009/tasks/entity\n- MEANTIME Corpus (Parallel corpus: English, Spanish, Italian, Dutch): http://www.newsreader-project.eu/results/data/wikinews/\n- PANACEA (ENV): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-it\n- PANACEA (LAB): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-it\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/\n- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation\n\nRomanian\n--------\n\n- RONEC (Dumitrescu and Avram, Introducing RONEC - the Romanian Named Entity Corpus. LREC 2020). Paper: https://arxiv.org/pdf/1909.01247.pdf Data: https://github.com/dumitrescustefan/ronec\n- Romanian journalistic corpus (ROCO): http://metashare.elda.org/repository/browse/romanian-journalistic-corpus-roco/038baa80dc7311e5aa0b00237df3e3583781d7c0f2084057aa018a2d63d987e9/\n- Romanian Balanced Corpus (ROMBAC): http://metashare.elda.org/repository/browse/romanian-balanced-corpus-rombac/0a7dd85edc7311e5aa0b00237df3e35873a0d662435d42dd94fba48c29dc0065/\n\nGreek\n-----\n\n- PANACEA (ENV): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-env-el\n- PANACEA (LAB): http://panacea-lr.eu/en/info-for-researchers/data-sets/dependency-parsed-corpora/dependency-lab-el\n\nHungarian\n---------\n\n- Hungarian Named Entity Corpora: http://rgai.inf.u-szeged.hu/index.php?lang=en\u0026page=corpus_ne\n- hunNERwiki: http://hlt.sztaki.hu/resources/hunnerwiki.html\n- NYTK: https://github.com/nytud/NYTK-NerKor\n\nCzech\n-----\n\n- Czech Named Entity Corpus: http://ufal.mff.cuni.cz/cnec\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- CzEng 1.0 (Parallel corpus: Czech-English): http://ufal.mff.cuni.cz/czeng/czeng10\n- PERO OCR NER (Czech historical OCR chronicles): https://github.com/roman-janik/PONER  https://dspace.vut.cz/items/6092e1b0-3d75-4451-8582-28573ac30404\n\nPolish\n------\n\n- The Polish Sejm Corpus: http://clip.ipipan.waw.pl/PSC\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- Polish Coreference Corpus: http://zil.ipipan.waw.pl/PolishCoreferenceCorpus\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- Corpus of Economic News (CEN Corpus): http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/cen\n- KPWr (Korpus Języka Polskiego Politechniki Wrocławskiej/Polish Corpus of Wrocław University of Technology): http://plwordnet.pwr.wroc.pl/index.php?option=com_content\u0026view=article\u0026id=35\u0026Itemid=181\u0026lang=pl ; http://plwordnet.pwr.wroc.pl/attachments/article/35/kpwr-1.1.7z (Broda et al., KPWr: Towards a Free Corpus of Polish, 2012)\n- NKJP: http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile\u0026do=view\u0026target=NKJP-PodkorpusMilionowy-1.2.tar.gz\n\nCroatian\n--------\n\n- hr500k 1.0:  http://hdl.handle.net/11356/1183\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- ReLDI-NormTagNER-hr (Croatian tweets): http://hdl.handle.net/11356/1170\n\nSlovak\n------\n\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- Slovak Categorized News Corpus: https://nlp.web.tuke.sk/pages/categorizednews\n\nSlovene\n-------\n\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- ssj500k:  http://www.slovenscina.eu/tehnologije/ucni-korpus ; http://eng.slovenscina.eu/tehnologije/ucni-korpus ; https://www.clarin.si/repository/xmlui/handle/11356/1029 ;  NOTE: for v 2.2 see: http://hdl.handle.net/11356/1210\n- Slovene news: http://zitnik.si/mediawiki/index.php?title=Datasets#Slovene_news ; http://zitnik.si/mediawiki/images/7/7d/Rtvslo_dec2011.tsv ; http://zitnik.si/mediawiki/images/5/5e/Rtvslo_dec2011_v2.tsv\n- Janes-Tag 2.0 (social media text) https://www.clarin.si/repository/xmlui/handle/11356/1123 ; see also: Fišer et al., The Janes project: language resources and tools for Slovene user generated content, 2018.\n\nUkrainian\n---------\n\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- Ukrainian Brown NER Corpus: https://github.com/lang-uk/ner-uk ; http://lang.org.ua/en/corpora/\n\nSerbian\n-------\n\n- SETimes.SR - http://hdl.handle.net/11356/1200\n- Named Entities evaluation corpus for Serbian: http://www.korpus.matf.bg.ac.rs/SrpNEval/\n- ReLDI-NormTagNER-sr (Serbian tweets): http://hdl.handle.net/11356/1171\n\nBulgarian\n---------\n\n- BulTreeBank (BTB)\n\nIcelandic\n---------\n\n- MIM-GOLD-NER (Ingólfsdóttir, Svanhvít Lilja, Sigurjón Þorsteinsson, and Hrafn Loftsson. \"Towards High Accuracy Named Entity Recognition for Icelandic.\" Proceedings of the 22nd Nordic Conference on Computational Linguistics. 2019): http://www.malfong.is/index.php?pg=mim_gold_ner\n\nDanish\n------\n\n- DaNE: Hvingelby et al., [DaNE: A Named Entity Resource for Danish.](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.565.pdf), LREC 2020: https://github.com/alexandrainst/danlp/\n- Danish Propbank (DPB): http://catalog.elra.info/en-us/repository/browse/ELRA-W0117/\n- Arboretum treebank: http://catalog.elra.info/en-us/repository/browse/ELRA-W0084/\n\nNorwegian\n---------\n\n- Bjarte Johansen, Named-Entity Recognition for Norwegian, Proceedings of the 22nd Nordic Conference on Computational Linguistics. 2019 (https://www.aclweb.org/anthology/W19-6123.pdf) Data: https://github.com/ljos/navnkjenner\n- Fredrik Jørgensen et al., NorNE: Annotating Named Entities for Norwegian, 2019 (https://arxiv.org/pdf/1911.12146.pdf). Data: https://github.com/ltgoslo/norne/ ; https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-49\n\nSwedish\n-------\n\n- Stockholm Internet Corpus: https://www.ling.su.se/english/nlp/corpora-and-resources/sic\n- SUC 3.0: https://spraakbanken.gu.se/eng/resource/suc3\n- Swedish manually annotated NER: https://github.com/klintan/swedish-ner-corpus/\n- Medical wikipedia data (Almgren et al., Named Entity Recognition in Swedish Health Records with Character-Based Deep Bidirectional LSTMs, 2016): https://github.com/olofmogren/biomedical-ner-data-swedish  \n- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data\n\n\nFinnish\n-------\n\n- data sets for Finnish Named Entity Recoginition: https://github.com/mpsilfve/finer-data\n- Turku NER corpus: https://github.com/TurkuNLP/turku-ner-corpus\n- HIPE-2022, named entity recognition and entity linking in multilingual historical documents: https://hipe-eval.github.io/HIPE-2022/ https://github.com/hipe-eval/HIPE-2022-data\n\nEstonian\n--------\n\n- Estonian NER corpus: https://metashare.ut.ee/repository/browse/estonian-ner-corpus/88d030c0acde11e2a6e4005056b40024f1def472ed254e77a8952e1003d9f81e/\n\nLatvian and Lithuanian\n----------------------\n\n- https://github.com/accurat-toolkit/TildeNER/tree/master/TEST (Pinnis,  \tLatvian and Lithuanian Named Entity Recognition with TildeNER, LREC 2012)\n- Training data for the LV Tagger: https://github.com/PeterisP/LVTagger/tree/master/NerTrainingData\n\nTurkish\n-------\n\n- K̈ucuk and Can, A Tweet Dataset Annotated for Named Entity Recognition and Stance Detection, 2019: https://github.com/dkucuk/Tweet-Dataset-NER-SD\n- K̈ucuk et al., Named Entity Recognition on Turkish Tweets: http://optima.jrc.it/Resources/2014_JRC_Twitter_TR_NER-dataset.zip\n- English/Turkish Wikipedia Named-Entity Recognition and Text Categorization Dataset (http://arxiv.org/abs/1702.02363): https://data.mendeley.com/datasets/cdcztymf4k/1\n- Çoban et al, Named Entity Recognition over FBNER: A New Facebook Dataset in Turkish: https://ieeexplore.ieee.org/document/9598971   Data available for research purposes on request\n\nKazakh\n------\n\n- KazNERD: https://arxiv.org/pdf/2111.13419.pdf, https://github.com/IS2AI/KazNERD\n\nUyghur\n------\n\n- Uyghur Named Entity Relation corpus: https://github.com/kaharjan/UyNeRel (Abiderexiti et al., Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus. IALP 2016)\n\nArmenian\n--------\n\n- pioNER (gold-standard and silver-standard datasets): https://github.com/ispras-texterra/pioner (Ghukasyan et al., pioNER: Datasets and Baselines for Armenian Named Entity Recognition, 2018)\n- ArmTDP-NER: https://github.com/myavrum/ArmTDP-NER\n\nCoptic\n------\n\n- The Coptic Universal Dependency Treebank: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/tree/dev (see also https://copticscriptorium.org/treebank.html). This contains 46,000 tokens of nested (non-)named and Wikified entities from Sahidic Coptic texts.\n\nAmharic\n-------\n\n- SAY corpus (see \"Named entity recognition for Amharic using deep learning\"): https://github.com/geezorg/data/tree/master/amharic/tagged/nmsu-say ; http://data.geez.org/\n\nArabic\n------\n\n- AQMAR Arabic Wikipedia Named Entity Corpus: http://www.cs.cmu.edu/~ark/ArabicNER/\n- NE3L named entities Arabic corpus (Arabic, Chinese, Russian): http://catalog.elra.info/en-us/repository/browse/ELRA-W0078/\n- REFLEX Entity Translation (Parallel corpus: English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2009T11\n- ANERCorp: http://users.dsic.upv.es/~ybenajiba/downloads.html (See also: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html)\n- ACE 2003 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2004T09\n- ACE 2004 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2005T09\n- ACE 2005 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2006T06\n- ACE 2007 (Spanish and Arabic): https://catalog.ldc.upenn.edu/LDC2014T18\n- OntoNotes 5 (English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2013T19\n- DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages (English, Spanish, French, Italian, German, Arabic): https://github.com/klout/opendata/tree/master/wiki_annotation\n- Wojood - 2022 Nested Arabic Named Entity Corpus.  https://dlnlp.ai/st/wojood/  https://aclanthology.org/2022.lrec-1.387.pdf  https://codalab.lisn.upsaclay.fr/competitions/11740\n\nPersian\n-------\n\n- ArmanPersoNERCorpus: http://islrn.org/resources/399-379-640-828-6/ ; https://github.com/HaniehP/PersianNER\n\nSindhi\n------\n\n- SiNER: https://aclanthology.org/2020.lrec-1.361/, https://github.com/AliWazir/SiNER-dataset\n\nUrdu\n----\n\n- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5\n- UNER Dataset (Khan et al., Named Entity Dataset for Urdu Named Entity Recognition Task, 2016). Available at http://www.iiu.edu.pk/?page_id=5181\n- MK-PUCIT: https://www.dropbox.com/sh/1ivw7ykm2tugg94/AAB9t5wnN7FynESpo7TjJW8la ; see: Kanwal et al., Urdu Named Entity Recognition: Corpus Generationand Deep Learning Applications, 2019 \n\nIndic\n-----\n\n- Naamapadam: Named Entity Recognition (NER) dataset for 11 major Indian languages from two language families.  https://research.ibm.com/publications/naamapadam-a-large-scale-named-entity-annotated-data-for-indic-languages   https://ai4bharat.iitm.ac.in/naamapadam\n\nHindi\n-----\n- HiNER: https://github.com/cfiltnlp/HiNER\n- Hindi Health Dataset: https://www.kaggle.com/aijain/hindi-health-dataset/home\n- FIRE 2015, ESM-IL (English, Hindi, Tamil, Malayalam) : http://au-kbc.org/nlp/ESM-FIRE2015/#traincorpus\n- FIRE NER 2013 (English, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/NER-FIRE2013/\n- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5\n\nBengali\n-------\n\n- FIRE NER 2013 (English, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/NER-FIRE2013/\n- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5\n- Bengali-NER: https://github.com/Rifat1493/Bengali-NER, https://ieeexplore.ieee.org/document/8944804\n- NER-Bangla: https://github.com/MISabic/NER-Bangla-Dataset, https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs179349\n\nTelugu\n------\n\n- NER_Telugu: https://github.com/anikethjr/NER_Telugu\n- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5\n- Named Entity Annotated Corpora for Telugu: http://www.tdil-dc.in/index.php?option=com_download\u0026task=showresourceDetails\u0026toolid=982\u0026lang=en\n\nMaithili\n--------\n\n- The first named entity recognizer in Maithili: Resource creation and system development: https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs210051\n\nNepali\n------\n\n- EverestNER: https://journals.flvc.org/FLAIRS/article/view/130725, https://github.com/nowalab/everest-ner\n\nMarathi\n-------\n\n- Named Entity Annotated Corpora for Marathi: http://www.tdil-dc.in/index.php?option=com_download\u0026task=showresourceDetails\u0026toolid=979\u0026lang=en\n- L3Cube MahaNER: https://arxiv.org/abs/2204.06029  https://github.com/l3cube-pune/MarathiNLP\n\nPunjabi\n-------\n\n- Named Entity Annotated Corpora for Punjabi: http://www.tdil-dc.in/index.php?option=com_download\u0026task=showresourceDetails\u0026toolid=980\u0026lang=en\n\nTamil\n-----\n\n- FIRE 2015, ESM-IL (English, Hindi, Tamil, Malayalam) : http://au-kbc.org/nlp/ESM-FIRE2015/#traincorpus\n- FIRE NER 2013 (English, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/NER-FIRE2013/\n\nMalayalam\n---------\n\n- FIRE 2015, ESM-IL (English, Hindi, Tamil, Malayalam) : http://au-kbc.org/nlp/ESM-FIRE2015/#traincorpus\n- FIRE NER 2013 (English, Hindi, Tamil, Malayalam, Bengali): http://au-kbc.org/nlp/NER-FIRE2013/\n\nOriya/Odia\n----------\n\n- IJCNLP 2008 SSEAL: http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5\n\nSinhala/Sinhalese\n-----------------\n\n- LORELEI (LDC2018E57)\n\nThai\n----\n\n- thai-named-entity-recognition-data: https://github.com/PyThaiNLP/thai-named-entity-recognition-data\n- Thai named entity corpora: http://pioneer.chula.ac.th/~awirote/resources/corpora--data.html ; http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip ; http://pioneer.chula.ac.th/~awirote/Data-Sasiwimon.zip ; http://pioneer.chula.ac.th/~awirote/Data-Nattadaporn.zip\n- LST20: https://huggingface.co/datasets/lst20 ; https://arxiv.org/abs/2008.05055\n- Thai-NNER: https://github.com/vistec-AI/Thai-NNER , https://aclanthology.org/2022.findings-acl.116\n\nIndonesian\n----------\n\n- IDENTIC: http://metashare.elda.org/repository/browse/identic/fed3fada7ef111e5aa3b001dd8b71c66c98eee36eabd42f18ffd9a95da9104cc/\n- https://github.com/yohanesgultom/nlp-experiments/tree/master/data/ner\n- indonesia-ner: Syaifudin \u0026 Nurwidyantoro  https://ieeexplore.ieee.org/document/7828656  https://github.com/yusufsyaifudin/Indonesia-ner\n- idner-news-2k: A dataset of Indonesian News for Named-Entity Recognition task.  Reannotation of Syaifudin \u0026 Nurwidyantoro https://dl.acm.org/doi/10.1145/3592854#fn8  https://github.com/khairunnisaor/idner-news-2k/\n- NERP and NER-grit: two Indonesian datasets from IndoNLP/IndoNLU   https://github.com/IndoNLP/indonlu/tree/master/dataset  https://aclanthology.org/2020.aacl-main.85/\n\nVietnamese\n----------\n\n- VLSP 2016: http://vlsp.org.vn/resources-vlsp2016 ; https://github.com/undertheseanlp/ner\n- VLSP 2018: http://vlsp.org.vn/resources-vlsp2018 ; https://github.com/undertheseanlp/ner\n- PhoNER_COVID19: https://github.com/VinAIResearch/PhoNER_COVID19\n\nJapanese\n--------\n\n- IREX: https://nlp.cs.nyu.edu/irex/Package/\n- MET-2 (Japanese, Chinese): https://www-nlpir.nist.gov/related_projects/muc/\n- BCCWJ Basic NE corpus: https://sites.google.com/site/projectnextnlpne/en (Iwakura et al., Constructing a Japanese Basic Named Entity Corpus of Various Genres, NEWS 2016)\n- DBpedia abstract corpus (English, German, Dutch, French, Italian, Japanese): http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/\n- Data from: Mai et al., An Empirical Study on Fine-Grained Named Entity Recognition, COLING 2018 (English, Japanese): https://fgner.alt.ai/duc/ene/testsets/comp/\n- Wikipedia NER Corpus: https://github.com/stockmarkteam/ner-wikipedia-dataset\n- WikiANN: https://elisa-ie.github.io/wikiann/  \n- GSD: Conversion of the UD GSD dataset to named entities by Megagon Labs  https://github.com/megagonlabs/UD_Japanese-GSD\n- KWDLC: Kyoto University Web Document Leads Corpus   https://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KWDLC  https://github.com/ku-nlp/KWDLC  https://nagisa.readthedocs.io/en/latest/tutorial_ner.html\n\nKorean\n------\n\n- National Institute of Korean Language (ROK) - NER Corpus: https://github.com/digitalprk/KoreaNER ; https://ithub.korean.go.kr/user/total/referenceView.do?boardSeq=5\u0026articleSeq=118\u0026boardGb=T\u0026isInsUpd\u0026boardType=CORPUS\n- KMOU NER - https://github.com/kmounlp/NER\n- Korean Language Understanding Evaluation - KLUE NER - https://klue-benchmark.com/tasks/69/overview/description\n- https://github.com/songys/entity\n- HLCT 2016 corpus, with updates - https://github.com/machinereading/KoreanNERCorpus\n\nChinese\n-------\n\n- ACE 2003 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2004T09\n- ACE 2004 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2005T09\n- ACE 2005 (English, Chinese, Arabic): https://catalog.ldc.upenn.edu/LDC2006T06\n- OntoNotes 5 (English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2013T19\n- MET-2 (Japanese, Chinese): https://www-nlpir.nist.gov/related_projects/muc/\n- REFLEX Entity Translation (Parallel corpus: English, Arabic, Chinese): https://catalog.ldc.upenn.edu/LDC2009T11\n- NE3L named entities Chinese corpus (Arabic, Chinese, Russian): http://catalogue.elra.info/en-us/repository/browse/ELRA-W0079/\n- Original Short-Message Data Collation I in Chinese (named entities): http://catalog.elra.info/en-us/repository/browse/ELRA-W0045_04/ \n- Original Short-Message Data Collation II in Chinese (named entities): http://catalog.elra.info/en-us/repository/browse/ELRA-W0045_08/\n- ERE DEFT Corpora (Parallel corpus: English, Chinese): Mott et al., Parallel Chinese-English Entities, Relations and Events Corpora, 2016 (LDC2015E78 , LDC2014E114)\n- Chinese Weibo: DEFT ERE style annotations for named and nominal mentions on Chinese social media (Weibo): https://github.com/hltcoe/golden-horse\n- Chinese EduNER: 2023 dataset in the Education domain:  https://link.springer.com/article/10.1007/s00521-023-08635-5  https://github.com/anonymous-xl/eduner\n- Chinese Aerospace NER: https://www.nature.com/articles/s41598-023-50705-0   https://github.com/Coder-XIAOKAI/Aerospace_NERdatasets\n- SciCN: A Chinese Dataset and Benchmark for Scientific Information Extraction   https://file.techscience.com/files/cmc/2024/TSP_CMC-78-3/TSP_CMC_35594/TSP_CMC_35594.pdf  https://github.com/yangjingla/SciCN\n- EMP NER: Historical Chinese  https://aclanthology.org/2024.lrec-main.35.pdf https://gitlab.com/enpchina/ENP-NER\n\nTagalog\n-------\n\n- TLUnifed: https://arxiv.org/abs/2311.07161 https://huggingface.co/datasets/ljvmiranda921/tlunified-ner\n\nRussian\n-------\n\n- BSNLP 2017 (Croatian, Czech, Polish, Russian, Slovak, Slovene, Ukrainian): http://bsnlp-2017.cs.helsinki.fi/shared_task_results.html\n- NE3L named entities Russian corpus (Arabic, Chinese, Russian): https://catalog.elra.info/en-us/repository/browse/ELRA-W0080/\n- WikiNER: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500\n- WikiNEuRal: https://github.com/Babelscape/wikineural\n- MultiNERD: https://github.com/Babelscape/multinerd\n- factRuEval-2016: https://github.com/dialogue-evaluation/factRuEval-2016\n- RuREBus 2020 (Russian Relation Extraction for Business) corpus https://github.com/dialogue-evaluation/RuREBus\n\nYoruba\n------\n\n- GV-Yorùbá-NER. Data: https://github.com/ajesujoba/YorubaTwi-Embedding/tree/master/Yoruba/Yor%C3%B9b%C3%A1-NER ; Data statement: https://drive.google.com/file/d/177xu-O2FTJ7VJQ-0ohCWjVd1qu61Tvml/view Paper: Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espãna-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yorùbá and Twi. In LREC, 2020 (https://arxiv.org/abs/1912.02481)\n\nSwahili\n-------\n\n- Helsinki Corpus of Swahili 2.0 (HCS 2.0) Annotated Version: http://metashare.csc.fi/repository/browse/helsinki-corpus-of-swahili-20-hcs-20-annotated-version/232c1910b9eb11e5915e005056be118e59fb2e920f1f4c0cafc94915fc6f5cac/ See: Shah et al., 2010. SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation\n\nIgbo\n----\n\n- IgboNER: https://aclanthology.org/2022.lrec-1.547/  https://github.com/Chiamakac/IgboNER-Models later updated in https://openreview.net/pdf?id=tHUS9-vmUfC  from https://sites.google.com/view/africanlp2023/home\n\nisiNdebele\n----------\n\n- NCHLT isiNdebele Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/306\n\nXhosa\n-----\n\n- NCHLT isiXhosa Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/312\n\nZulu\n----\n\n- NCHLT isiZulu Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/319\n\nSepedi\n------\n\n- NCHLT Sepedi Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/328\n\nSesotho\n-------\n\n- NCHLT Sesotho Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/334\n\nSetswana \n--------\n\n- NCHLT Setswana Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/341\n\nSiswati\n-------\n \n- NCHLT Siswati Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/346\n\nVenda\n-----\n\n- NCHLT Tshivenda Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/355\n- MPHAYANER: Named Entity Recognition for Tshivenḓa: https://openreview.net/pdf?id=0nneuL3bSLt https://github.com/rendanim/MphayaNER  from https://sites.google.com/view/africanlp2023/home\n\nXitsonga\n--------\n\n- NCHLT Xitsonga Named Entity Annotated Corpus: https://repo.sadilar.org/handle/20.500.12185/362\n\nLatin\n-----\n\n- Herodotos Project: https://github.com/alexerdmann/Herodotos_Project_Annotation\n\n\nA long list can be found here: http://damien.nouvels.net/resourcesen/corpora.html\n\nReferences\n==========\n\n[Alvarado et al., 2015] Alvarado, Julio Cesar Salinas, Karin Verspoor,\nand Timothy Baldwin. Domain adaption of named entity recognition to support\ncredit risk assessment. In Proceedings of the Australasian Language Technology\nAssociation Workshop 2015, pp. 84-90. 2015.\nAccessed: August 2018.\n\n[Balasuriya et al., 2009] Balasuriya, Dominic, Nicky Ringland, Joel Nothman,\nTara Murphy, and James R. Curran. Named entity recognition in wikipedia. In\nProceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively\nConstructed Semantic Resources, pp. 10-18. Association for Computational\nLinguistics, 2009\n\n[Bos et al., 2017] Bos, Johan, Valerio Basile, Kilian Evang,\nNoortje J. Venhuizen, and Johannes Bjerva. The Groningen meaning bank.\nIn Handbook of linguistic annotation, pp. 463-496. Springer, Dordrecht, 2017.\n\n[Derczynski et al., 2016] Derczynski, Leon, Kalina Bontcheva, and Ian Roberts.\nBroad twitter corpus: A diverse named entity recognition resource. In\nProceedings of COLING 2016, the 26th International Conference on Computational\nLinguistics: Technical Papers, pp. 1169-1179. 2016.\nAvailable at: https://github.com/GateNLP/broad_twitter_corpus\nAccessed: August 2018.\n\n[Derczynski et al., 2017] Leon Derczynski, Eric Nichols, Marieke van Erp,\nNut Limsopatham (2017) Results of the WNUT2017 Shared Task on Novel and\nEmerging Entity Recognition, in Proceedings of the 3rd Workshop on Noisy,\nUser-generated Text.\nAvailable at: https://noisy-text.github.io/2017/emerging-rare-entities.html\n\n[DSTL, 2017] Defence Science and Technology Laboratory. 2017. Relationship and\nEntity Extraction Evaluation Dataset.  https://github.com/dstl/re3d.\nAccessed: January 2018.\n\n[Grishman and Sundheim, 1996] Ralph Grishman and Beth Sundheim. 1996.\nMessage understanding conference- 6: A brief history. In COLING 1996 Volume 1:\nThe 16th International Conference on Computational Linguistics.\n\n[Karimi et al., 2015] Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna Kemp,\nand Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations.\nJournal of biomedical informatics, 55:73-81. Available at https://data.csiro.au\nAccessed: November 2017.\n\n[Lim et al., 2017] Lim, Swee Kiat, Aldrian Obaja Muis, Wei Lu, and\nChen Hui Ong. MalwareTextDB: A database for annotated malware articles.\nIn Proceedings of the 55th Annual Meeting of the Association for Computational\nLinguistics (Volume 1: Long Papers), vol. 1, pp. 1557-1567. 2017.\n\n[Liu et al., 2013a] Jingjing Liu, Panupong Pasupat, Scott Cyphers, and\nJim Glass. 2013. Asgard: A portable architecture for multilingual dialogue\nsystems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE\nInternational Conference on, pages 8386-8390. IEEE.\nAvailable at https://groups.csail.mit.edu/sls/downloads/restaurant/\nAccessed: January 2018\n\n[Liu et al., 2013b] Jingjing Liu, Panupong Pasupat, Yining Wang, Scott Cyphers,\nand Jim Glass. 2013. Query understanding enhanced by hierarchical parsing\nstructures. In Automatic Speech Recognition and Understanding (ASRU),\n2013 IEEE Workshop on, pages 72-77. IEEE.\nAvailable at https://groups.csail.mit.edu/sls/downloads/movie/\nWe used the trivia10k13 portion. Accessed: January 2018\n\n[NIST, 1999 IE-ER] NIST. 1999. Information Extraction - Entity Recognition\nEvaluation. http://www.nist.gov/speech/tests/ieer/er_99/er_99.htm.\nThe newswire development test data only (included in the NLTK package).\n\n[Ohta et al., 2012] Tomoko Ohta, Sampo Pyysalo, Jun'ichi Tsujii and Sophia\nAnaniadou. 2012. Open-domain Anatomical Entity Mention Detection. In\nProceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse\n(DSSD), pp. 27-36.\nAvailable at: http://www.nactem.ac.uk/anatomy/ and\nhttps://github.com/openbiocorpora/anem Accessed: November 2017.\n\n[Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Etzioni. 2011.\nNamed entity recognition in tweets: An experimental study. In Proceedings of\nthe 2011 Conference on Empirical Methods in Natural Language Processing,\npages 1524-1534, Edinburgh, Scotland, UK., July. Association for Computational\nLinguistics.\nAccessed January 2018.\n\n[Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003.\nIntroduction to the CoNLL-2003 shared task: Languageindependent named entity\nrecognition. In Proceedings of the Seventh Conference on Natural Language\nLearning at HLT-NAACL 2003.\n\n[Stubbs et al., 2015] Amber Stubbs and Ozlem Uzuner. 2015. Annotating\nlongitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth\ncorpus. Journal of biomedical informatics, 58:S20-S29. Available at\nhttps://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.\n\n[Uzuner et al., 2007] Ozlem Uzuner, Yuan Luo, and Peter Szolovits. 2007.\nEvaluating the state-of-the-art in automatic de-identification. Journal of the\nAmerican Medical Informatics Association, 14(5):550-563. Available at\nhttps://www.i2b2.org/NLP/DataSets/ Accessed: February 2018.\n\n[Weischedel and Brunstein, 2005] Ralph Weischedel and Ada Brunstein. 2005.\nBBN pronoun coreference and entity type corpus. Linguistic Data Consortium,\nPhiladelphia.\n\n[Weischedel et al., 2013] Weischedel, Ralph, Martha Palmer, Mitchell Marcus,\nEduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue et al. Ontonotes\nrelease 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA (2013).\n\n[Zeldes, 2017] Amir Zeldes. 2017. The GUM corpus: creating multilayer\nresources in the classroom. Language Resources and Evaluation, 51(3):581-612.\nAvailable at https://github.com/amir-zeldes/gum/tree/master/coref/tsv/\nAccessed: November 2017.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjuand-r%2Fentity-recognition-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjuand-r%2Fentity-recognition-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjuand-r%2Fentity-recognition-datasets/lists"}