{"id":18754045,"url":"https://github.com/dcavar/treebankparser","last_synced_at":"2025-06-11T10:08:19.073Z","repository":{"id":150221167,"uuid":"153116740","full_name":"dcavar/TreebankParser","owner":"dcavar","description":"Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars","archived":false,"fork":false,"pushed_at":"2018-10-17T23:31:14.000Z","size":190,"stargazers_count":3,"open_issues_count":0,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-07T17:36:26.808Z","etag":null,"topics":["bnf","bnfc","context-free-grammar","lexical-functional-grammar","parser","penn-treebank","probabilistic-context-free-grammar","syntax","treebank"],"latest_commit_sha":null,"homepage":"http://damir.cavar.me/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcavar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-15T13:21:21.000Z","updated_at":"2023-08-23T11:26:19.000Z","dependencies_parsed_at":"2023-06-26T01:18:14.570Z","dependency_job_id":null,"html_url":"https://github.com/dcavar/TreebankParser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcavar%2FTreebankParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcavar%2FTreebankParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcavar%2FTreebankParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcavar%2FTreebankParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcavar","download_url":"https://codeload.github.com/dcavar/TreebankParser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231690700,"owners_count":18411507,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bnf","bnfc","context-free-grammar","lexical-functional-grammar","parser","penn-treebank","probabilistic-context-free-grammar","syntax","treebank"],"created_at":"2024-11-07T17:27:54.790Z","updated_at":"2024-12-29T00:55:19.476Z","avatar_url":"https://github.com/dcavar.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TreebankParser\n\n(C) 2016-2018 by [Damir Cavar] \u003c[dcavar@iu.edu](mailto:dcavar@iu.edu)\u003e\n\nThis code and the binaries are made available under the\n[Apache License, Version 2.0, January 2004](http://www.apache.org/licenses/). For details see the included\n*LICENSE.txt* file.\n\n\n\nThis is a tool that reads treebank files and generates a probabilistic grammar for use in [FLE].\n\nCurrently it can generate all Context-free Grammar rules from a treebank in the Penn-treebank format.\n\nTake for example the *test1.txt* file in the current source repository. You can run treebankparser to generate a frequency profile of the rules:\n\n\t./treebankparser -y S test1.txt\n\nThe *-y S* parameter generates an S-symbol for empty root nodes, as in *test1.txt*. The default is to generate *ROOT* as the label for such root nodes.\n\nThe out put should look like this:\n\n\t1\tADJP --\u003e JJ\n\t1\tIP-HLN --\u003e VP\n\t1\tJJ --\u003e 重要\n\t1\tNN --\u003e 企业\n\t1\tNN --\u003e 增长点\n\t1\tNN --\u003e 外商\n\t1\tNN --\u003e 外贸\n\t1\tNN --\u003e 投资\n\t2\tNP --\u003e NN\n\t1\tNP --\u003e NP\n\t1\tNP-OBJ --\u003e NP\n\t1\tNP-PN --\u003e NR\n\t1\tNP-SBJ --\u003e NN NN NN\n\t1\tNR --\u003e 中国\n\t1\tS --\u003e IP-HLN\n\t1\tVP --\u003e NP-OBJ\n\t1\tVV --\u003e 成为\n\nThe probability is tab-delimited from the rule. It can also be generated as a float using the *-r* parameter:\n\n\t./treebankparser -r -y S test1.txt \u003e res.log\n\nThe output should look like:\n\n\t0.0555556       ADJP --\u003e JJ\n\t0.0555556       IP-HLN --\u003e VP\n\t0.0555556       JJ --\u003e 重要\n\t0.0555556       NN --\u003e 企业\n\t0.0555556       NN --\u003e 增长点\n\t0.0555556       NN --\u003e 外商\n\t0.0555556       NN --\u003e 外贸\n\t0.0555556       NN --\u003e 投资\n\t0.111111        NP --\u003e NN\n\t0.0555556       NP --\u003e NP\n\t0.0555556       NP-OBJ --\u003e NP\n\t0.0555556       NP-PN --\u003e NR\n\t0.0555556       NP-SBJ --\u003e NN NN NN\n\t0.0555556       NR --\u003e 中国\n\t0.0555556       S --\u003e IP-HLN\n\t0.0555556       VP --\u003e NP-OBJ\n\t0.0555556       VV --\u003e 成为\n\n\nThe rules are printed to standard out with absolute or relative frequencies.\n\nI am adding more features, e.g.:\n \n- reloading existing grammars (multi-batch cycles for larger corpus collections)\n- elimination of terminal rules\n- parsing alternative coding formats for syntactic trees or treebanks (e.g. XML, TEI XML)\n- output probabilities for Left-hand-side symbols only, rather than rules\n- generation of a Weighted Finite State Transducer representation, as coded in [FLE]\n\nIf you have ideas or suggestions, let me know.\n\n\n\n\n## Prerequisites\n\nThe tool is written in [C++11] and requires the following libraries:\n\n- [Boost]\n- [Xerces-C++]\n\n\n## Compile\n\nUse [CLion] or otherwise run:\n\n\tcmake CMakeLists.txt\n\tmake\n\n\n\n[Damir Cavar]: http://damir.cavar.me/ \"Damir Cavar\"\n[CLion]: https://www.jetbrains.com/clion/ \"CLion IDE\"\n[Boost]: http://www.boost.org/ \"Boost C++ Libraries\"\n[C++11]: https://en.wikipedia.org/wiki/C%2B%2B11 \"C++11\"\n[Xerces-C++]: https://xerces.apache.org/xerces-c/ \"Xerces-C++ XML Parser\"\n[FLE]: http://gorilla.linguistlist.org/fle/ \"Free Linguistic Environment\"\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcavar%2Ftreebankparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcavar%2Ftreebankparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcavar%2Ftreebankparser/lists"}