{"id":20114812,"url":"https://github.com/gatenlp/corpusconversion-tiger","last_synced_at":"2026-05-10T08:51:47.187Z","repository":{"id":145770867,"uuid":"80838897","full_name":"GateNLP/corpusconversion-tiger","owner":"GateNLP","description":"Tool to convert the German Tiger corpus and other corpora in Tiger format to GATE","archived":false,"fork":false,"pushed_at":"2017-02-03T14:59:31.000Z","size":7,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-03-09T19:55:11.601Z","etag":null,"topics":["conversion","corpus","gate","gatenlp","nlp"],"latest_commit_sha":null,"homepage":null,"language":"Groovy","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GateNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-03T14:59:06.000Z","updated_at":"2017-02-03T14:59:33.000Z","dependencies_parsed_at":"2023-06-09T01:30:44.989Z","dependency_job_id":null,"html_url":"https://github.com/GateNLP/corpusconversion-tiger","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GateNLP/corpusconversion-tiger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GateNLP%2Fcorpusconversion-tiger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GateNLP%2Fcorpusconversion-tiger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GateNLP%2Fcorpusconversion-tiger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GateNLP%2Fcorpusconversion-tiger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GateNLP","download_url":"https://codeload.github.com/GateNLP/corpusconversion-tiger/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GateNLP%2Fcorpusconversion-tiger/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286079811,"owners_count":27282121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-27T02:00:05.795Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conversion","corpus","gate","gatenlp","nlp"],"created_at":"2024-11-13T18:32:10.086Z","updated_at":"2025-11-27T08:02:19.810Z","avatar_url":"https://github.com/GateNLP.png","language":"Groovy","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tool to convert the Tiger XML format to GATE\n\nFor a description of the format, see http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/doc/html/TigerXML.html\n\n## How to run\n\n* Make sure you have lots of RAM - this conversion is currently done by loading the whole input XML file into memory! The German Tiger corpus currently needs about 3.5G of RAM and this is what is used by default in the conversion script.\n* make sure the uncompressed/unpacked corpus file is accessible, this script expects the format of\n   file tiger_release_aug07.corrected.16012013.xml\n* make sure convert.sh is executable, groovy is installed and GATE_HOME is set\n* create some output directory to hold the generated GATE files\n* ./convert.sh [-n 1] tigerInfile outDir\n* this should fill the output directory out with documents which contain 1 sentence each. The\n  parameter \"-n 99\" can be used to specify a different number of sentences per document.\n\n## Metadata\n\nWe extract some metadata from the header and add it to every GATE document as document features.\nThis is done for convenience and for legal reasons, so that it is clear from looking at each \ndocument that it was converted from the Tiger corpus.\n\nThe following fields are converted to document features:\n* meta.name\n* meta.author\n* meta.date\n* meta.description\n* meta.history\n\n## Document data\n\nThe corpus has been collected from complete articles of Frankurter Allgemeine Zeitung, but there seems \nto be no explicity indication where one article ends and another starts, let alone any identification\nof article metadata.\nAn indirect indication of article end could be the a sentence of the form \"sch\" \"/\" \"rtr\" or of the form\n\"GABRIELE\" \"VENZKY\".\nif \"MARTIN DAHMS (Göttingen)\"\n\nBut there are other boundaries which cannot be detected like this, e.g. between sentence s177 s178\n\nMaybe also \"Mit ... sprach .. FR-Redakteur/in ...\"?\n\nAfter looking through more of the corpus, it becomes clear that none of those heuristics will work,\nsometimes the author is missing, sometimes specified at the top of the article etc. \nSo for now we just ignore this and split the corpus up so that each document contains a maximum \nof 50 sentences.\n\n## Converter strategy \n\n* go through the XML \n* if in head, parse and store metainformation we want to add to each document\n* if in body, iterate over the s elements\n* for each s element, get the graph element\n* from the graph element, get the terminals element first and iterate over the t elements to get the words\n  Also get the id of the root from the graph element\n* for each t element, create a feature map first, map the id to the number in the feature map array.\n  Store the string in a second array of word strings\n* Each feature map gets the lemma, pos, morph, case, number, gender, person, degree, thense and mood features\n* from the graph element get the nonterminals element and iterate over the nt elements\n* for each nt element get all the contained edge elements\n  Represent the tree somehow, one possibility is:\n  * each nt is an annotation that coveres all the annotations associated with the edge nodes \n  * each edge is an annotation of type label covering the destination node and having an id field for\n    from and to node annotations\n  * or each nt annotation contains, for each label, a list of ids for edges with that label, e.g. NK_ids=[23,55]\n* NOTE in the first version for just testing the lemmatizer, we can ignore the no-terminals\n\n\n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgatenlp%2Fcorpusconversion-tiger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgatenlp%2Fcorpusconversion-tiger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgatenlp%2Fcorpusconversion-tiger/lists"}