{"id":13741367,"url":"https://github.com/juditacs/wikt2dict","last_synced_at":"2025-12-18T00:58:13.902Z","repository":{"id":8858122,"uuid":"10568086","full_name":"juditacs/wikt2dict","owner":"juditacs","description":"Wiktionary parser tool for many language editions.","archived":false,"fork":false,"pushed_at":"2022-08-17T11:39:48.000Z","size":168,"stargazers_count":53,"open_issues_count":5,"forks_count":13,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-11-15T11:36:24.028Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"lgpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/juditacs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-06-08T12:06:25.000Z","updated_at":"2023-08-22T21:13:13.000Z","dependencies_parsed_at":"2022-08-27T20:52:15.723Z","dependency_job_id":null,"html_url":"https://github.com/juditacs/wikt2dict","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juditacs%2Fwikt2dict","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juditacs%2Fwikt2dict/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juditacs%2Fwikt2dict/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/juditacs%2Fwikt2dict/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/juditacs","download_url":"https://codeload.github.com/juditacs/wikt2dict/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253153163,"owners_count":21862318,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:58.322Z","updated_at":"2025-12-18T00:58:13.847Z","avatar_url":"https://github.com/juditacs.png","language":"Python","funding_links":[],"categories":["Software"],"sub_categories":["Utilities"],"readme":"# wikt2dict\n\nWiktionary translation parser tool for many language editions.\n\nWikt2dict parses only the translation sections.\nIt also has a triangulation mode which combines the extracted translation pairs to\ngenerate new ones. \n\n## News\n\nWikt2dict changed completely, hope for the better. If you would like to keep using the old one:\nhttps://github.com/juditacs/wikt2dict/tree/a08cc896c22dc78db62e1b790c3ec157d00ad08f\n\n* Changed interface. See below for details (April 2014)\n* Added support for German Wiktionary (Aug 2013)\n* Had a poster at the Building and Using Comparable Corpora Workshop (BUCC) at ACL13, updated Bibtex accordingly\n  * The paper is available here: http://www.aclweb.org/anthology/W/W13/W13-2507.pdf\n* \u003cs\u003eAll dictionaries are available here: http://nessie.nytud.hu/dict\u003c/s\u003e\n\n## Requirements\n\nWikt2dict should run on any mainstream Linux distribution. It needs Python2.7 and basic command line\ntools that should be found on most Linux distributions (wget, bzcat).\nIf you're working with large Wiktionaries such as the English Wiktionary, you need at least 10GB of\nfree space, preferrably more.\nFor all Wiktionary editions supported, you need about 35GB of free space.\n\n## Installation\n\n    git clone https://github.com/juditacs/wikt2dict.git\n    cd wikt2dict\n    sudo pip install -e .\n\nYou can install wikt2dict in virtualenv if you do not have root access.\n\nA very quick guide to virtualenv:\n\n    virtualenv w2d_env\n    source w2d_env/bin/activate\n    git clone https://github.com/juditacs/wikt2dict.git\n    cd wikt2dict\n    pip install -e .\n\nNote that this way wikt2dict can only be used once the virtualenv was activated.\nYou need to run source w2d\\_env/bin/activate every time you login.\n\n## Very quick start\n\nWikt2dict's basic functionalities can be accessed using the w2d.py script (which should be directly callable after running pip install).\n\n    $ w2d.py -h\n    Wikt2Dict\n    \n    Usage:\n      w2d.py (download|extract|triangulate|all) (--wikicodes=file|\u003cwc\u003e...)\n    \n    Options:\n      -h --help              Show this screen.\n      --version              Show version.\n      -w, --wikicodes=file   File containing a list of wikicodes.\n\nW2d.py currently supports 3+1 actions. All actions need a list of Wiktionary codes to work with.\nYou can either list the codes manually or provide them in a file (--wikicodes option).\n\nThe actions are:\n\n1. download: download the Wiktionary dumps. Convert them from XML to plaintext with a special page separator.\nThe files are saved in the directory specified in config.py:wiktionary\\_defaults['dump\\_path\\_base'].\nThe default is wikt2dict/dat/wiktionary/\u003clanguage name\u003e\n1. extract: extract translations.\nThe translations are saved to the file specified in config.py:wiktionary\\_defaults['output\\_path'].\nBy default this file is wikt2dict/dat/wiktionary/\u003clanguage name\u003e/translation\\_pairs.\n1. triangulate: use triangulation to generate more translations.\nTriangles are saved to the directory config.py:wiktionary\\_defaults['triangle\\_dir'] in separate files\nnamed as \u003cwc1\u003e\\_\u003cwc2\u003e\\_\u003cwc3\u003e. This file would contain pairs in wc1-wc3 languages triangulated via wc2.\nFor more information on triangulating, see: http://aclweb.org/anthology/W/W13/W13-2507.pdf\nNote that triangulating only makes sense if you specify at least 3 languages.\n1. all: do all of the above.\n\nLet's try it out on a few small Wiktionary editions.\n\nDownloading the Slovak, the Slovenian and the Occitan Wiktionaries:\n\n    w2d.py download sk sl li\n\nThe downloaded and textified Wiktionaries should appear in dat/wiktionary/\u003clanguage name\u003e/\u003cwikicode\u003ewiktionary.txt\n\nExtracting translations:\n\n    w2d.py extract sk sl li\n\nThe extracted translations should appear in dat/wiktionary/\u003clanguage name\u003e/translation\\_pairs.\n\nNow let's try triangulating to get a bunch of new translations:\n\n    w2d.py triangulate sk sl li\n\nThe results should appear in dat/triangle/ arranged in subdirectories with a maximum of 1000 files per directory\nto avoid filesystem problems.\nUsing only 3 such small editions for triangulating does not make much sense (it yielded 4 pairs on the April 2014 dumps).\n\nOr do all of it at once:\n\n    w2d.py all sk sl li\n\n## Output\n\nThe output is a tab-separated file. \nIf you only want the translation pairs you should just cut the first 4 columns:\n    \n    cut -f1-4 \u003coutput_file\u003e \u003e \u003cdictionary\u003e\n\nOr without Wiktionary codes:\n\n    cut -f2,4 \u003coutput_file\u003e \u003e \u003cdictionary\u003e\n\nWhere \u003coutput\\_file\u003e should be replaced by the output of either the Wiktionary extraction\nor the triangulating, and \u003cdictionary\u003e is the file where the filtered columns are saved.\n\nThe columns explained in details are below.\n\nThe one extracted from the Wiktionaries has the following columns:\n\n1. Wiktionary code 1 (language 1)\n2. Word or expression in language 1\n3. Wiktionary code 2 (language 2)\n4. Word or expression in language 2\n5. Wiktionary code of the Wiktionary from which the pair was extracted\n6. Article from which the pair was extracted\n7. Type of parser used (you probably don't need this)\n\nAn example:\n\n    en      dog     fr      chien   en      dog     defaultparser\n\nThe triangulating output has the following columns:\n\n1. Wiktionary code 1 (language 1)\n2. Word or expression in language 1\n3. Wiktionary code 2 (language 2)\n4. Word or expression in language 2\n5. 5-10. The articles and their source Wiktionary that were used to generate this pair\n\n    hu      kutya   oc      chin    hu      kutya   el      σκύλος  oc      chin\n\nThe pairs are listed with all possible ways they were found. I provided a little script to \nsort, unify and count the number of times one pair appears.\nUsage (from wikt2dict base directory):\n\n    cat \u003ctriangle_files_to_merge\u003e | bash bin/merge_triangle.sh \u003e output_file\n\nTo use with all triangle files:\n\n    cat \u003ctriangle_dir\u003e/*/* | bash bin/merge_triangle.sh \u003e output_file\n\nwhere the \u003ctriangle\\_dir\u003e should be replaced with the directory where the individual triangle files are\nstored (triangle\\_dir option).\n\nCongratulations, you have successfully finished the test tutorial of wikt2dict.\nPlease send your feedback to judit@sch.bme.hu.\n\n## Cite\n\nPlease cite:\n\n    @InProceedings{acs-pajkossy-kornai:2013:BUCC,  \n      author    = {Acs, Judit  and  Pajkossy, Katalin  and  Kornai, Andras},  \n      title     = {Building basic vocabulary across 40 languages},  \n      booktitle = {Proceedings of the Sixth Workshop on Building and Using Comparable Corpora},  \n      month     = {August},  \n      year      = {2013},  \n      address   = {Sofia, Bulgaria},  \n      publisher = {Association for Computational Linguistics},  \n      pages     = {52--58},  \n      url       = {http://www.aclweb.org/anthology/W13-2507}  \n    }  \n\nOr this one:\n\n    @InProceedings{CS14.864,\n    author = {Judit Ács},\n    title = {Pivot-based multilingual dictionary building using Wiktionary},\n    booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},\n    year = {2014},\n    month = {may},\n    date = {26-31},\n    address = {Reykjavik, Iceland},\n    editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},\n    publisher = {European Language Resources Association (ELRA)},\n    isbn = {978-2-9517408-8-4},\n    language = {english}\n    }\n\n    \n## Known Bugs\n\n* FIXED - Lithuanian and a few other Wiktionaries have translation tables in many articles\nnot only for Lithuanian words and these are parsed as they were Lithuanian words. \nLanguage detection for all articles should be added. This issue is fixed but configuration\nshould be updated.\n\n* Logging is not always accurate\n\n## Upcoming\n\n* 4lang coverage, finding translations for a list of words\n\n  * Check out our basic vocabulary at: http://hlt.sztaki.hu/resources/4lang/\n\n\u003c!---\nYou can create statistics of the coverage of 4lang and uroboros by calling:\n\n    cat ../dat/lang/*/res/word_pairs | python fourlang_coverage.py ../res/4lang/coverage\n\nThis would take all translations extracted from the Wiktionaries and compute\nthe coverage of 4lang and uroboros based on each language of 4lang and all of them\ncombined as well.\nThe statistics are saved in ../res/4lang/ with the coverage prefix.\n--\u003e\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjuditacs%2Fwikt2dict","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjuditacs%2Fwikt2dict","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjuditacs%2Fwikt2dict/lists"}