{"id":17614047,"url":"https://github.com/gsarti/interpreting-complexity","last_synced_at":"2026-03-02T08:34:28.290Z","repository":{"id":109884825,"uuid":"319024065","full_name":"gsarti/interpreting-complexity","owner":"gsarti","description":"Materials for the MSc Thesis \"Interpreting Neural Language Models for Linguistic Complexity Assessment\" and related works.","archived":false,"fork":false,"pushed_at":"2022-02-16T13:02:10.000Z","size":88,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-15T08:47:54.747Z","etag":null,"topics":["evaluation","garden-path-sentences","huggingface","interpretability","interpreting-models","linguistic-complexity","natural-language-processing","neural-language-model","neural-language-models","probing-task","readability","reading-comprehension","representational-similarity"],"latest_commit_sha":null,"homepage":"https://gsarti.com/thesis/introduction.html","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gsarti.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-06T12:08:44.000Z","updated_at":"2023-09-14T10:05:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"fb779f85-66b9-4d5c-8520-b11b0e3d4dfb","html_url":"https://github.com/gsarti/interpreting-complexity","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gsarti/interpreting-complexity","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Finterpreting-complexity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Finterpreting-complexity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Finterpreting-complexity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Finterpreting-complexity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gsarti","download_url":"https://codeload.github.com/gsarti/interpreting-complexity/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsarti%2Finterpreting-complexity/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29995912,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T01:47:34.672Z","status":"online","status_checked_at":"2026-03-02T02:00:07.342Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","garden-path-sentences","huggingface","interpretability","interpreting-models","linguistic-complexity","natural-language-processing","neural-language-model","neural-language-models","probing-task","readability","reading-comprehension","representational-similarity"],"created_at":"2024-10-22T18:22:08.966Z","updated_at":"2026-03-02T08:34:28.251Z","avatar_url":"https://github.com/gsarti.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Interpreting Models of Linguistic Complexity\n\nThis repository contains data and code implementations for reproducing all the experiments for:\n\n**Interpreting Neural Language Models for Linguistic Complexity Assessment**, [Gabriele Sarti](https://gsarti.com), *Data Science and Scientific Computing MSc Thesis, University of Trieste, 2020* [[Gitbook]](https://gsarti.com/thesis/introduction.html) [[Slides (Long)](https://drive.google.com/file/d/1mb_Wlzrvog5-eds6hcSrm7gHSj9PO6qw/view?usp=sharing)] [[Slides (Short)](https://drive.google.com/file/d/1j2zCavx4EzomRIoTwmtvvmGbWizKmHEA/view?usp=sharing)]\n\n**UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations**, [Gabriele Sarti](https://gsarti.com), *Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian*, [[ArXiv](https://arxiv.org/abs/2011.05197)] [CEUR](http://ceur-ws.org/Vol-2765/paper163.pdf) [Video](https://vimeo.com/487817662)\n\n**That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models**, [Gabriele Sarti](https://gsarti.com) and [Dominique Brunato](https://scholar.google.com/citations?user=JJV9ay4AAAAJ\u0026hl=it) and [Felice Dell'Orletta](https://scholar.google.com/citations?user=uhInFTQAAAAJ\u0026hl=it), *Proceeding of the Workshop on Cognitive Modeling and Computational Linguistics at NAACL 2021* [ACL Anthology]\n\nIf you find these resource useful for your research, please consider citing one or more following works:\n\n```bibtex\n@mastersthesis{sarti-2020-interpreting,\n    author = {Sarti, Gabriele},\n    institution = {University of Trieste},\n    school = {University of Trieste},\n    title = {Interpreting Neural Language Models for Linguistic Complexity Assessment},\n    year = 2020\n}\n\n@inproceedings{sarti-2020-umbertomtsa,\n    author = {Sarti, Gabriele},\n    title = {{UmBERTo-MTSA @ AcCompl-It}: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations},\n    booktitle = {Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020)},\n    editor = {Basile, Valerio and Croce, Danilo and Di Maro, Maria, and Passaro, Lucia C.},\n    publisher = {CEUR.org},\n    year = {2020},\n    address = {Online}\n}\n\n@inproceedings{sarti-etal-2021-looks,\n    title = \"That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models\",\n    author = \"Sarti, Gabriele and\n    Brunato, Dominique and\n    Dell'Orletta, Felice\",\n    booktitle = \"Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics\",\n    month = jun,\n    year = \"2021\",\n    address = \"Mexico City, Mexico\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"TBD\",\n    doi = \"TBD\",\n    pages = \"TBD\",\n}\n```\n\n## Overview\n\n⚠️ TODO: Short summary and images ⚠️\n\n## Installation\n\n**Prerequisites**\n\n- Python \u003e= 3.6 is required to run the scripts provided in this repository. Torch should be installed using the wheels available on the Pytorch website that are compatible with your CUDA version.\n\n- For CUDA 10 and Python 3.6, we used the wheel torch-1.3.0-cp36-cp36m-linux_x86_64.whl.\n\n- Python \u003e= 3.7 is required to run SyntaxGym-related scripts.\n\n**Main dependencies**\n\n- `torch == 1.6.0`\n- `farm == 0.5.0`\n- `transformers == 3.3.1`\n- `syntaxgym`\n\n**Setup procedure**\n\n```shell\npython3 -m venv env\nsource env/bin/activate\npip install --upgrade pip\n./scripts/setup.sh\n```\n\nRun `scripts/setup.sh` from the main project folder. This will install dependencies, download data and create the repository structure. If you want to download ZuCo MAT files (30GB), edit `setup.sh` setting `DOWNLOAD_ZUCO_MAT_FILES=false`.\n\nYou need to manually download the original perceived complexity dataset presented in [Brunato et al. 2018](https://www.aclweb.org/anthology/D18-1289/) from the [ItaliaNLP Lab website](http://www.italianlp.it/resources/corpus-of-sentences-rated-with-human-complexity-judgments/download-english-sentences/) and place it in the `data/complexity` folder.\n\nThe AcCompl-IT campaign data and the Dundee corpus cannot be redistributed due to copyright restrictions.\n\nAfter all datasets are in the respective folders, run `python script/preprocess.py --all` from the main project folder to preprocess the datasets. Refer to the [Getting Started](#getting-started) section for further steps.\n\n## Code Overview\n\n**Repository structure**\n\n- `data` contains the subfolders for all data used throughout the study:\n\n    - `complexity`: the Perceived Complexity corpus by [Brunato et al. 2018](https://www.aclweb.org/anthology/D18-1289/).\n    - `eyetracking`: Eye-tracking corpora (Dundee, GECO, ZuCo 1 \u0026 2).\n    - `eval`: SST dataset used for representational similarity evaluation.\n    - `garden_paths`: three test suites taken from the [SyntaxGym](syntaxgym.org/) benchmark.\n    - `readability`: OneStopEnglish corpus paragraphs by reading level.\n    - `preprocessed`: The preprocessed versions of each corpus produced by `scripts/preprocess.py`.\n\n- `src/lingcomp` is the library built behind this work, composed by:\n  - `data_utils`: Eye-tracking processors and utils.\n  - `farm`: Custom extension of the FARM library to add token-level regression, better multitask learning for NLMs and the GPT-2 model.\n  - `similarity`: Methods used for representational similarity evaluation.\n  - `syntaxgym`: Methods used to perform evaluation over SyntaxGym test suites.\n\n- `scripts`: Used to carry out the analysis and modeling experiment:\n  - `shortcuts`: **in development**, scripts calling other scripts multiple times to provide a quick interface.\n  - `analyze_linguistic_features`: Produces a report containing correlations across various complexity metrics and linguistic features.\n  - `compute_sentence_baselines`: Computes sentence-level avg., binned avg. and SVM baselines for complexity scores using cross-validation.\n  - `compute_similarity`: Evaluates the representational similarity of embeddings produced by neural language models using different methods.\n  - `evaluate_garden_paths`: Allows using custom metrics (surprisal, gaze metrics prediction) to estimate the presence of atypical construction over SyntaxGym test suites.\n  - `finetune_sentence_level`: Train NLMs on sentence-level regression or classification tasks in single or multi-task settings.\n  - `finetune_token_regression`: Train NLMs on token-level regression in single or multi-task settings.\n  - `get_surprisals`: Compute surprisal scores produced by NLMs for sentences.\n  - `preprocess`: Performs initial preprocessing and train/test splitting.\n\n## Getting Started\n\n**Preprocessing**\n\n```shell\n# Generate sentence-level dataset for eyetracking\npython scripts/preprocess.py \\\n    --all \\\n    --do_features \\\n    --eyetracking_mode sentence \\\n    --do_train_test_split\n```\n\n⚠️ TODO: Examples for all experiments ⚠️\n\n## Contacts\n\nIf you have any questions, feel free to contact me through email ([gabriele.sarti996@gmail.com](mailto:gabriele.sarti996@gmail.com)) or raise a Github issue in the repository!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsarti%2Finterpreting-complexity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgsarti%2Finterpreting-complexity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsarti%2Finterpreting-complexity/lists"}