{"id":23859299,"url":"https://github.com/strubell/preprocess-conll05","last_synced_at":"2026-04-02T02:03:26.932Z","repository":{"id":69099516,"uuid":"149311751","full_name":"strubell/preprocess-conll05","owner":"strubell","description":"Scripts for preprocessing the CoNLL-2005 SRL dataset.","archived":false,"fork":false,"pushed_at":"2019-03-28T17:32:53.000Z","size":22,"stargazers_count":23,"open_issues_count":3,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-09-08T07:41:03.570Z","etag":null,"topics":["conll-2005","dataset","nlp","nlp-resources","preprocessing","semantic-role-labeling"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/strubell.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-18T15:33:38.000Z","updated_at":"2025-01-17T13:16:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"82beb34c-ff1a-4c95-991e-fd9b8533e5b7","html_url":"https://github.com/strubell/preprocess-conll05","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/strubell/preprocess-conll05","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strubell%2Fpreprocess-conll05","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strubell%2Fpreprocess-conll05/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strubell%2Fpreprocess-conll05/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strubell%2Fpreprocess-conll05/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/strubell","download_url":"https://codeload.github.com/strubell/preprocess-conll05/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/strubell%2Fpreprocess-conll05/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31294388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T01:43:37.129Z","status":"online","status_checked_at":"2026-04-02T02:00:08.535Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conll-2005","dataset","nlp","nlp-resources","preprocessing","semantic-role-labeling"],"created_at":"2025-01-03T03:35:12.536Z","updated_at":"2026-04-02T02:03:26.911Z","avatar_url":"https://github.com/strubell.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# preprocess-conll05\nScripts for preprocessing the CoNLL-2005 SRL dataset.\n\n### Requirements:\n- Python 3\n- Bash\n- A copy of the [Penn TreeBank](https://catalog.ldc.upenn.edu/LDC99T42)\n\n## Basic CoNLL-2005 pre-processing \nThese pre-processing steps download the CoNLL-2005 data and gather gold part-of-speech \nand parse info from your copy of the PTB. The output will look like:\n```\nThe         DT    (S(NP-SBJ-1(NP*  *    -   -      (A1*      \neconomy     NN    *                *    -   -      *      \n's          POS   *)               *    -   -      *      \ntemperature NN    *)               *    -   -      *)     \nwill        MD    (VP*             *    -   -      (AM-MOD*)     \nbe          VB    (VP*             *    -   -      *      \ntaken       VBN   (VP*             *    01  take   (V*) \n```\n\n- Field 1: word form\n- Field 2: gold part-of-speech tag\n- Field 3: gold sytax\n- Field 4: placeholder\n- Field 5: verb sense\n- Field 6: predicate (infinitive form)\n- Field 7+: for each predicate, a column representing the labeled arguments of the predicate.\n\nFirst, set up paths to existing data:\n```bash\nexport WSJ=\"/your/path/to/wsj/\"\nexport BROWN=\"/your/path/to/brown\"\n```\n\nDownload CoNLL-2005 data and scripts:\n```bash\n./bin/basic/get_data.sh\n```\n\nExtract pos/parse info from gold data:\n```bash\n./bin/basic/extract_train_from_ptb.sh\n./bin/basic/extract_dev_from_ptb.sh\n./bin/basic/extract_test_from_ptb.sh\n./bin/basic/extract_test_from_brown.sh\n```\n\nFormat into combined output files:\n```bash\n./bin/basic/make-trainset.sh\n./bin/basic/make-devset.sh \n./bin/basic/make-wsj-test.sh\n./bin/basic/make-brown-test.sh \n```\n\n## Further pre-processing (e.g. for [LISA](https://github.com/strubell/LISA))\nSometimes it's nice to convert constituencies to dependency parses and provide automatic\npart-of-speech tags, e.g. if you wish to train a parsing model. BIO format is also a \nmore standard way of representing spans than the default CoNLL-2005 format. This pre-processing\nconverts the constituency parses to Stanford dependencies (v3.5), assigns automatic part-of-speech\ntags from the Stanford left3words tagger, and converts SRL spans to BIO format. The output will look like:\n\n```\nconll05 0       0       The         DT      DT      2       det         _       -       -       -       -       O       B-A1\nconll05 0       1       economy     NN      NN      4       poss        _       -       -       -       -       O       I-A1\nconll05 0       2       's          POS     POS     2       possessive  _       -       -       -       -       O       I-A1\nconll05 0       3       temperature NN      NN      7       nsubjpass   _       -       -       -       -       O       I-A1\nconll05 0       4       will        MD      MD      7       aux         _       -       -       -       -       O       B-AM-MOD\nconll05 0       5       be          VB      VB      7       auxpass     _       -       -       -       -       O       O\nconll05 0       6       taken       VBN     VBN     0       root        _       01      take    -       -       O       B-V\n```\n\n- Field 1: domain placeholder\n- Field 2: sentence id\n- Field 3: token id\n- Field 4: word form\n- Field 5: gold part-of-speech tag\n- Field 6: auto part-of-speech tag\n- Field 7: dependency parse head\n- Field 8: dependency parse label\n- Field 9: placeholder\n- Field 10: verb sense\n- Field 11: predicate (infinitive form)\n- Field 12: placeholder\n- Field 13: placeholder\n- Field 14: NER placeholder\n- Field 15+: for each predicate, a column representing the labeled arguments of the predicate.\n\nFirst, set up paths to Stanford parser and part-of-speech tagger:\n```bash\nexport STANFORD_PARSER=\"/your/path/to/stanford-parser-full-2017-06-09\"\nexport STANFORD_POS=\"/your/path/to/stanford-postagger-full-2017-06-09\"\n```\n\nThe following script will then convert dependencies, tag, and reformat the data. This will create a new file in the\n`$CONLL05` directory with the same name as the input and suffix `.parse.sdeps.combined`. \nIf `$CONLL05` is not set, you should set it to the `conll05st-release` directory.\n```bash\n./bin/preprocess_conll05_sdeps.sh $CONLL05/train-set.gz\n./bin/preprocess_conll05_sdeps.sh $CONLL05/dev-set.gz\n./bin/preprocess_conll05_sdeps.sh $CONLL05/test.wsj.gz\n./bin/preprocess_conll05_sdeps.sh $CONLL05/test.brown.gz\n```\n\nNow all that remains is to convert fields to BIO format. The following script will create a new file\nin the same directory as the old file with the suffix `.bio`:\n```bash\n./bin/convert-bio.sh $CONLL05/train-set.gz.parse.sdeps.combined\n./bin/convert-bio.sh $CONLL05/dev-set.gz.parse.sdeps.combined\n./bin/convert-bio.sh $CONLL05/test.wsj.gz.parse.sdeps.combined\n./bin/convert-bio.sh $CONLL05/test.brown.gz.parse.sdeps.combined\n```\n\nYou may also want to generate a matrix of transition probabilities for performing Viterbi inference at test time. You\ncan use the following to do so:\n```bash\npython3 bin/compute_transition_probs.py --in_file_name $CONLL05/train-set.gz.parse.sdeps.combined.bio \u003e $CONLL05/transition_probs.tsv\n```\n\n## Pre-processing for evaluation scripts\n\nTo evaluate using the CoNLL `eval.pl` and `srl-eval.pl` scripts, you'll need files in a different\nformat to evaluate against. To generate files for parse evaluation (`eval.pl`), use the following script:\n```bash\npython3 bin/eval/extract_conll_parse_file.py --input_file $CONLL05/dev-set.gz.parse.sdeps.combined --id_field 2 --word_field 3 --pos_field 4 --head_field 6 --label_field 7 \u003e $CONLL05/conll2005-dev-gold-parse.txt\n```\n\nFor SRL evaluation, use the following: \n```bash\npython3 bin/eval/extract_conll_prop_file.py --input_file $CONLL05/dev-set.gz.parse.sdeps.combined --take_last --word_field 3 --pred_field 10 --first_prop_field 14 \u003e $CONLL05/conll2005-dev-gold-props.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstrubell%2Fpreprocess-conll05","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstrubell%2Fpreprocess-conll05","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstrubell%2Fpreprocess-conll05/lists"}