{"id":19025105,"url":"https://github.com/divefish/transselect","last_synced_at":"2026-06-25T11:31:04.658Z","repository":{"id":133812712,"uuid":"445221484","full_name":"DiveFish/TransSelect","owner":"DiveFish","description":null,"archived":false,"fork":false,"pushed_at":"2022-01-06T15:46:00.000Z","size":1731,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-21T19:13:35.259Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DiveFish.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-01-06T15:39:56.000Z","updated_at":"2022-01-06T15:46:02.000Z","dependencies_parsed_at":"2023-03-14T20:00:16.909Z","dependency_job_id":null,"html_url":"https://github.com/DiveFish/TransSelect","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DiveFish/TransSelect","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiveFish%2FTransSelect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiveFish%2FTransSelect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiveFish%2FTransSelect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiveFish%2FTransSelect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DiveFish","download_url":"https://codeload.github.com/DiveFish/TransSelect/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DiveFish%2FTransSelect/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34773841,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-25T02:00:05.521Z","response_time":101,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T20:41:51.372Z","updated_at":"2026-06-25T11:31:04.640Z","avatar_url":"https://github.com/DiveFish.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Transformer Selectional Preferences for Predictions\nBy [janniss91](https://github.com/janniss91/)\n\n\nThe objective of this project is to investigate whether Transformer models make use of selectional preferences when making predictions. E.g. there is a strong association between the words `eat` and `spaghetti` if `spaghetti` is an object. However, there is a weak association if it is a subject.\n\nThe question is whether Transformers make use of the same kind of information for their predictions.\n\n## Setup\n\nFor the setup of this repository simply type:\n\n    make\n\nThis will\n\n- set up a virtual environment for this repository,\n- install all necessary project dependencies.\n\nMake sure that Python 3.7 or higher is installed in your virtual environment. Otherwise, there might be problems installing the `transformers` library.\n\n## Virtual Environment\n\nAfter having run the `make` command you will have installed a virtual environment.  \nAlways work in this environment to make sure to use the correct python interpreter and have access to the relevant dependencies.\n\nTo enter the environment in a shell, type:\n\n    . env/bin/activate\n\nIf you work in an IDE like e.g. Pycharm, make sure that it also makes use of the correct Python interpreter.\n\nTo deactivate the virtual environment, type:\n\n    deactivate\n\n## Clean and Re-install\n\nTo reset the repository to its inital state, type:\n\n    make dist-clean\n\nThis will remove the virtual environment and all dependencies.  \nWith the `make` command you can re-install them.\n\nTo remove temporary files like .pyc or .pyo files, type:\n\n    make clean\n\n## Pipeline (Recommended Usage)\n\nYou can run the pipeline of the whole repository very easily with just one command:\n\n    ./scripts/pipeline.sh your-input-file.tsv language search_term\n\n**your_input_file**: a tsv gold file from the SORTS repository  \n**language**: german or dutch  \n**search_term**: name for sentence type (e.g. vlight or base-acc)\n\nThis includes the following steps:\n\n1. Filter specific sentence type.\n2. Predict the next word for the filtered sentences with all models.\n3. Generate statistics for the predictions that were just made for this sentence type.\n4. Generate the files with binary information about whether the correct prediction was found.\n5. Test significance for all models.\n\n## Data Set and Data Preparation\n\nThe data that was used in this project comes from the SORTS project (https://github.com/DiveFish/SORTS).\n\n### German\n\nMore specifically the `german_part-ambiguous_gold.tsv` file was filtered for sentence entries with light verbs.\nThe filtering was done by using the script `data/data_prep_tools/filter_data.py`.\nThe data can be found in the `data` folder; it is called `filtered_vlight_data.tsv`.\n\n### Dutch\n\nFor dutch the `dutch_part-ambiguous_gold.tsv` file was filtered for sentence entries with light verbs.\n\nTo regenerate the filtered data, simply run:\n\n\u003cpre\u003e\npython3 data/data_prep_tools/filter_data.py \u003ci\u003epath_to_sorts_data\u003c/i\u003e.tsv data/\u003ci\u003efiltered_file\u003c/i\u003e.tsv \u003ci\u003elanguage\u003c/i\u003e \u003ci\u003esearch_term\u003c/i\u003e\n\u003c/pre\u003e\n\nThe file names must be adapted to your case.\n\n## Predicting\n\n### German\n\nPredictions have been done for the following transformer models:\n\n- xlm-roberta-base\n- xlm-roberta-large\n- bert-base-german-dbmdz-cased\n- bert-base-german-dbmdz-uncased\n- bert-base-german-cased\n- bert-base-multilingual-uncased\n- bert-base-multilingual-cased\n\n### Dutch\n\nPredictions have been done for the following transformer models:\n\n- xlm-roberta-base\n- xlm-roberta-large\n- wietsedv/bert-base-dutch-cased\n- bert-base-multilingual-uncased\n- bert-base-multilingual-cased\n\nTo make a prediction for a single of these models, run e.g.:\n\n\u003cpre\u003e\npython3 scripts/selective_preferences.py data/\u003ci\u003efiltered_file\u003c/i\u003e.tsv -m xlm-roberta-base -o predictions/\u003ci\u003ename_of_output_file\u003c/i\u003e.json\n\u003c/pre\u003e\n\nIf you do not want to store the results in a json file but print the output, just remove the `-o` flag and output file name from the command.\n\nTo make predictions for all models, run:\n\u003cpre\u003e\n./scripts/predict_all_models_pipeline.sh data/\u003ci\u003efiltered_file\u003c/i\u003e.tsv \u003ci\u003elanguage\u003c/i\u003e \u003ci\u003esearch_term\u003c/i\u003e\n\u003c/pre\u003e\n\n## Setting up Prediction Statistics\n\nTo set up prediction statistics, run e.g.:\n\n\u003cpre\u003e\npython3 scripts/generate_stats.py data/\u003ci\u003efiltered_file\u003c/i\u003e.tsv \u003ci\u003elanguage\u003c/i\u003e \u003ci\u003esearch_term\u003c/i\u003e\n\u003c/pre\u003e\n\n## Creating Binary Output Files for Subsequent Significance Testing\n\nTo create the binary file for one model, run:\n\n\u003cpre\u003e\npython3 binary_output_per_sentence.py statistics/binary_results/prediction_file.json data/\u003ci\u003efiltered_file\u003c/i\u003e.tsv \u003ci\u003ebinary_output_file\u003c/i\u003e\n\u003c/pre\u003e\n\nYou can also create the binary files for all models at once.\n\n\u003cpre\u003e\n./write_all_binary_files.sh data/\u003ci\u003efiltered_file\u003c/i\u003e.tsv \u003ci\u003elanguage\u003c/i\u003e \u003ci\u003esearch_term\u003c/i\u003e\n\u003c/pre\u003e\n\n## Significance Testing\n\nYou can test superiority of one model output over the other by running a significance test. To do this, run:\n\n\u003cpre\u003e\n./scripts/significance_test_all_files.sh \u003ci\u003elanguage\u003c/i\u003e \u003ci\u003esearch_term\u003c/i\u003e\n\u003c/pre\u003e\n\nThe output can be found in the file ```statistics/significance_test_results.tsv```.  \nIt is split up into four columns: model1, model2 and p-value, is_significant.  \nUsually a value under 0.05 indicates a significant result.  \nEverything above is considered insignificant.\n\nNote that the script ```scripts/testSignificance.py``` has been taken from this repository: https://github.com/rtmdrr/testSignificanceNLP.git\n\nIt has been amended (drastically shortened) for the purposes of this project and only makes use of the Wilcoxon significance test.  \n\nIMPORTANT: The script was written in Python 2.7. It can be run with python 3.8 but if you encounter any problems, try running it with a python 2.7 interpreter outside of the virtual environment (make sure the library ```scipy``` is installed).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdivefish%2Ftransselect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdivefish%2Ftransselect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdivefish%2Ftransselect/lists"}