{"id":15014059,"url":"https://github.com/explosion/princetondh","last_synced_at":"2025-10-19T14:31:56.964Z","repository":{"id":152188178,"uuid":"611817347","full_name":"explosion/princetondh","owner":"explosion","description":"Code for our presentation in Princeton DH 2023 April.","archived":false,"fork":false,"pushed_at":"2023-04-26T10:34:14.000Z","size":35,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-29T18:38:17.872Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-09T15:54:01.000Z","updated_at":"2023-12-19T09:24:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"83b6e76d-dd0c-49c5-8b71-0e7669842289","html_url":"https://github.com/explosion/princetondh","commit_stats":{"total_commits":24,"total_committers":2,"mean_commits":12.0,"dds":0.04166666666666663,"last_synced_commit":"36295c45fb8938d638731eadf55e0b7ab8341f49"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fprincetondh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fprincetondh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fprincetondh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fprincetondh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/princetondh/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237152776,"owners_count":19263780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T19:45:08.155Z","updated_at":"2025-10-19T14:31:56.607Z","avatar_url":"https://github.com/explosion.png","language":"Jupyter Notebook","readme":"# Hi there\n\n\nWelcome Digital Humanists! This is the repo where you can find\nall code used for the the Priceton University online Workshop titled\n\"spaCy: A Python Library for Natural Language Processing\" and held\nTue, Apr 4, 2023 4:30 PM – 6 PM EDT (GMT-4). Hope those of you who participated\nhad a good time and for the rest we hope you'll find this repository useful!\n\n\np.s.: We use [radicli](https://github.com/explosion/radicli) library for creating command line\ninterfaces, which we will be using in spaCy soon! Also, during the VSCode plugin for spaCy, which\nis also coming soon!\n\n\n## Notebooks\n\nThe [`notebooks`](notebooks) directory has three Jupyter notebooks:\n\n1. [`intro_to_spacy.ipynb`](notebooks/intro_to_spacy.ipynb) is a short introduction to how to work with spaCy and a whirlwind tour of many of the tools spaCy provides.\n2. [`casestudy_1.ipynb`](notebooks/casestudy_1.ipynb) walks through building a pipeline to extract information from restaurant reviews by identifying spans of interest such as mentions of cuisines or ratings. The pipeline is a blend of rule-based and learning-based techniques and there is an excersize to build your own rules.\n3. [`casestudy_2.ipynb`](notebooks/casestudy_2.ipynb) focuses only on learned pipelines and the various tools spaCy provides to find spans in texts. It runs some parts of the [`litbank_pipeline`](litbank_pipeline) project.\n\n\n## LitBank pipeline\n\nThe [LitBank dataset](https://github.com/dbamman/litbank/) is a collection of a 100 works of fiction\npublicly available from Project Gutenberg majority of which were published between 1852 and 1911.\nEach document is approximately the first 2000 words of the novels leading to a total of \n210532 tokens in the entire data set.\n\nThe [`litbank_pipeline`](litbank_pipeline) downloads LitBank and trains models on the Named Entity and Event\nannotations. To learn about the entity annotations please checkout\n[this paper](https://people.ischool.berkeley.edu/~dbamman/pubs/pdf/naacl2019_literary_entities.pdf) \nand [this one](https://aclanthology.org/P19-1353.pdf) for the event annotations.\n\nMost config files in [`litbank_pipeline/configs`](litbank_pipeline/configs) project\nwere generated with an appropriate \n[`init config`](https://spacy.io/api/cli#init-config) command.  \n\nThe commands to preprocess are in \n[`litbank_pipeline/scripts/prepare.py`](litbank_pipeline/scripts/prepare.py). \nFor the event trigger detection we wrote a special scoring function that computes the\nprecision, recall and F1 score only for the positive class i.e. the tokens that\nhave `EVENT` label. You can find the scorer in [`litbank_pipeline/scripts/positive_tagger_scorer.py`](litbank_pipeline/scripts/positive_tagger_scorer.py).\n\n\n\nFor the named entity recognition tasks there are config files to train\n[`ner`](https://spacy.io/api/entityrecognizer), [`spancat` or `spancat_singlelabel`](https://spacy.io/api/spancategorizer) components with either the default \n[Convolutional Network](https://spacy.io/api/architectures#MaxoutWindowEncoder)\nor a [Recurrent Network](https://spacy.io/api/architectures#TorchBiLSTMEncoder) encoder.\n\nThe `ner` component does only a single left-to-right pass over the document to find\nall entities, while `spancat` classifies each possible span. This means that `ner` is\nmuch more efficient than `spancat`, but `spancat` is more flexible. For a comparison between\nthe to checkout this [blogpost](https://explosion.ai/blog/spancat).\n\n\n## Homework\n\nAs an excersize to get more familiar with spaCy we recommend training the different\narchitectures with the different encoders and see how they compare in terms of accuracy, speend\nand the kinds of mistakes they make.  \n\nWe also think it would be a useful excersize to train a pipeline that has a single\n`tok2vec` component providing representations both to a `tagger` component for the\nevent detection and a `ner` or `spancat` or `spancat_singlelabel` component for entity recognition.\nTo learn more about shared `tok2vec` layers please checkout: https://spacy.io/usage/embeddings-transformers#embedding-layers.\n\n\n## References\n\n- [An annotated dataset of literary entities](https://aclanthology.org/N19-1220) (Bamman et al., NAACL 2019)\n- [Literary Event Detection](https://aclanthology.org/P19-1353) (Sims et al., ACL 2019)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fprincetondh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fprincetondh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fprincetondh/lists"}