{"id":24754181,"url":"https://github.com/harisont/l2-ud","last_synced_at":"2025-07-25T03:17:06.013Z","repository":{"id":154623193,"uuid":"506247711","full_name":"harisont/L2-UD","owner":"harisont","description":"Tools for working with UD treebanks of learner texts (and parallel treebanks in general).","archived":false,"fork":false,"pushed_at":"2025-02-14T22:23:25.000Z","size":2176,"stargazers_count":0,"open_issues_count":9,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-14T23:25:15.814Z","etag":null,"topics":["bea-workshop","nodalida"],"latest_commit_sha":null,"homepage":"","language":"Haskell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harisont.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-22T12:56:25.000Z","updated_at":"2023-10-19T23:10:10.000Z","dependencies_parsed_at":"2024-05-10T16:56:18.590Z","dependency_job_id":"41733d49-20b0-4583-bf13-4a739ed57ef5","html_url":"https://github.com/harisont/L2-UD","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harisont%2FL2-UD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harisont%2FL2-UD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harisont%2FL2-UD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harisont%2FL2-UD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harisont","download_url":"https://codeload.github.com/harisont/L2-UD/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245056906,"owners_count":20553856,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bea-workshop","nodalida"],"created_at":"2025-01-28T11:39:14.276Z","updated_at":"2025-03-23T05:13:40.653Z","avatar_url":"https://github.com/harisont.png","language":"Haskell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# L2-UD\nTools for working with [L1-L2 parallel UD treebanks](https://aclanthology.org/W17-6306.pdf).\n\n## Installation\n(requires [the Haskell Tool Stack](https://docs.haskellstack.org/en/stable/))\n\n1. clone this repository\n2. move inside the corresponding folder and run\n   ```\n   stack install\n   ```\n\nIf you are trying to install L2-UD on Windows and the latter does not work, try following [these instructions](win.md).\n\n## Usage\n\n### Querying parallel L1-L2 treebanks\nTo return the set of parallel L1-L2 sentences matching an error pattern, run\n\n```\nl2-ud match L1-TB L2-TB PATTERNS [OPTS]\n```\n\nwhere:\n\n- `L1-TB` is the CoNNL-U file containing correction hypotheses\n- `L2-TB` is the CoNNL-U file containing original learner sentences\n- `PATTERNS` is a list of space-separated [L1-L2 patterns](l1-l2-patterns) or the path to a file containing an L1-L2 pattern per line (see the [saved queries folder](queries) for examples).\n\nBy default, the `match` command prints the list of sentence IDs of the sentences matching the pattern.\n\nAvailable `OPTS`:\n\n- `--help`, `-h`: show usage instructions\n- `--markdown`, `-m`: rather than sentence IDs, output a markdown report showing the sentences with matches highlighted in bold, like [this one](results/sv/S-FinV-example.md)\n- `--conllu=DIR`, `-cDIR`: on top of printing sentence IDs to the standard output, extract the pairs of subtrees matching the pattern and write them to an `L1.conllu` and an `L2.conllu` file in the given `DIR`ectory (if no directory is specified, files are created in the current folder)\n\n### Extracting error patterns (__CURRENTLY UNDER DEVELOPMENT__)\nReturn the [error patterns](#l1-l2-patterns) contained in an L1-L2 treebank.\n\n```\nl2-ud extract L1-TB L2-TB [OPTS]\n```\n\nWhere:\n\n- `L1-TB` is the CoNNL-U file containing correction hypotheses\n- `L2-TB` is the CoNNL-U file containing original learner sentences\n\nAvailable `OPTS`:\n\n- `--help`, `-h`: show usage instructions\n- `--markdown`, `-m`: rather than sentence IDs, output a markdown report showing the sentences with errors highlighted in bold next to the error patterns that were detected\n- `--conllu=DIR`, `-cDIR`: on top of printing the error patterns to the standard output, extract the pairs of subtrees where the errors were found and write them to an `L1.conllu` and an `L2.conllu` file in the given `DIR`ectory (if no directory is specified, files are created in the current folder)\n\n### Retrieving similar examples\nGiven an L1-L2 sentence pair, return similar examples from an L1-L2 treebank, by:\n\n1. annotating the sentences in UD using UDPipe2 (requires an internet connection)\n2. extracting error patterns from them (analogous to [`l2-ud extract`](#extracting-error-patterns-currently-under-development))\n3. querying the treebank with such error patterns (analogous to [`l2-ud match`](#querying-parallel-l1-l2-treebanks))\n\n```\nl2-ud example L1-TB L2-TB L1-SENTENCE L2-SENTENCE LANG [OPTS]\n```\n\nWhere:\n\n- `L1-TB` is the CoNNL-U file containing correction hypotheses\n- `L2-TB` is the CoNNL-U file containing original learner sentences\n- `L1-SENTENCE` is a correct sentence\n- `L2-SENTENCE` is a sentence containing 1+ grammatical errors\n- `LANG` is the name of the [UDPipe 2 model](https://ufal.mff.cuni.cz/udpipe/2/models) to be used for annotating the sentences. The default model for each language is called the English name of the language, lowercased, e.g. `swedish`   \n\nAvailable `OPTS`:\n\n- `--help`, `-h`: show usage instructions\n- `--verbose`, `-v`: show intermediate results (UD-annotated example sentences and extracted patterns)\n- `--markdown`, `-m`: rather than sentence IDs, show similar examples found in the treebank a markdown report. \n\n## L1-L2 patterns\nAn L1-L2 error pattern is a \"parallel\" [`gf-ud`](https://github.com/GrammaticalFramework/gf-ud) pattern[^1], i.e. essentially a pair of UD patterns.\nFor conciseness, instead of writing full pairs of patterns, L1-L2 patterns are written as single UD patterns with discrepancies enclosed in curly braces. For instance, the pattern\n\n```\nAND [POS \"DET\", FEATS_ \"Gender={Masc-\u003eFem}\"]\n```\n\nreads as \"feminine determiners corrected with their masculine form\", or \"feminine determiners that should have been masculine\" and is expanded to two `gf-ud` patterns:\n\n- `AND [POS \"DET\", FEATS_ \"Gender=Masc\"]`, to be looked in the L1 corrections treebank\n- `AND [POS \"DET\", FEATS_ \"Gender=Fem\"]`, to be looked for in the L2 treebanks of original learner sentences.\n\n### L2-only patterns\nFor some types of error, an L2 pattern is sufficient to concisely describe an error. \nWhen that is the case, it is possible to write a single UD pattern `P`, which is expanded to a pair $\\langle$ `TRUE`, `P` $\\rangle$.\n\n### Variables (__EXPERIMENTAL__)\nTo avoid enumerating all combinations of values for categorial attributes, it is possible to use variables, i.e. capital letters preceded by a `$` sign.\n\nFor example, rather than writing \n\n```\nAND [POS \"DET\", FEATS_ \"Gender={Masc-\u003eFem}\"]\nAND [POS \"DET\", FEATS_ \"Gender={Fem-\u003eMasc}\"]\n```\n\nit is possible to write\n\n```\nAND [POS \"DET\", FEATS_ \"Gender={$A-\u003e$B}\"]\n```\n\nwhere `A` is assumed to be different from `B`.\n\nWriting\n\n```\nAND [POS \"$A\", FEATS_ \"Gender={$A-\u003e$B}\"]\n```\n\n(using the identifier `A` twice for two different attributes) does not constitute a problem. The `$A` in `POS \"$A\"` will be replaced with all possible values of UPOS tags.\n\nVariables are currently supported for morphological features, Universal POS tags and dependency relations with no subtypes. \n\nFor attributes with many possible values, like `POS`, querying can become very slow. \n\n### Example queries\n- generalized determiner-noun gender agreement error:\n  ```\n  TREE_ (AND [POS \"NOUN\", FEATS_ \"Gender=$A\"]) [AND [POS \"DET\", FEATS_ \"Gender={$A-\u003e$B}\"]]\n  ```\n- missing determiner with possessives (Italian):\n  ```\n  TREE (POS \"NOUN\") [{DEPREL \"det\", -\u003e } DEPREL \"det:poss\"]\n  ``` \n  (mind the position of the comma, see [issue #1](https://github.com/harisont/L2-UD/issues/1) for details).\n- V2 order violation when the first token is an adverb (Swedish):\n  ```\n  SEQUENCE [POS \"ADV\", OR [POS \"VERB\", POS \"AUX\"], DEPREL_ \"nsubj\"]\n  ```\n\nFor more examples, check the [saved queries folder](queries).\n\n## Citation\nIf you use this software in your research, you are welcome to cite\n\n```\n@inproceedings{masciolini-2023-query,\n    title = \"A query engine for {L}1-{L}2 parallel dependency treebanks\",\n    author = \"Masciolini, Arianna\",\n    booktitle = \"Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)\",\n    month = may,\n    year = \"2023\",\n    address = \"T{\\'o}rshavn, Faroe Islands\",\n    publisher = \"University of Tartu Library\",\n    url = \"https://aclanthology.org/2023.nodalida-1.57\",\n    pages = \"574--587\",\n    abstract = \"L1-L2 parallel dependency treebanks are learner corpora with interoperability as their main design goal. They consist of sentences produced by learners of a second language (L2) paired with native-like (L1) correction hypotheses. Rather than explicitly labelled for errors, these are annotated following the Universal Dependencies standard. This implies relying on tree queries for error retrieval. Work in this direction is, however, limited. We present a query engine for L1-L2 treebanks and evaluate it on two corpora, one manually validated and one automatically parsed.\",\n}\n```\n\nand\n\n```\n@inproceedings{masciolini-etal-2023-towards,\n    title = \"Towards automatically extracting morphosyntactical error patterns from {L}1-{L}2 parallel dependency treebanks\",\n    author = \"Masciolini, Arianna  and\n      Volodina, Elena  and\n      Dannlls, Dana\",\n    booktitle = \"Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)\",\n    month = jul,\n    year = \"2023\",\n    address = \"Toronto, Canada\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2023.bea-1.50\",\n    doi = \"10.18653/v1/2023.bea-1.50\",\n    pages = \"585--597\",\n    abstract = \"L1-L2 parallel dependency treebanks are UD-annotated corpora of learner sentences paired with correction hypotheses. Automatic morphosyntactical annotation has the potential to remove the need for explicit manual error tagging and improve interoperability, but makes it more challenging to locate grammatical errors in the resulting datasets. We therefore propose a novel method for automatically extracting morphosyntactical error patterns and perform a preliminary bilingual evaluation of its first implementation through a similar example retrieval task. The resulting pipeline is also available as a prototype CALL application.\",\n}\n```\n\n[^1]: The syntax of `gf-ud`'s pattern matching language is described extensively [here](https://github.com/GrammaticalFramework/gf-ud/blob/master/doc/patterns.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharisont%2Fl2-ud","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharisont%2Fl2-ud","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharisont%2Fl2-ud/lists"}