{"id":17985013,"url":"https://github.com/prasoonvarshney/scientific-entity-recognition","last_synced_at":"2025-04-04T02:22:06.476Z","repository":{"id":126477186,"uuid":"550525961","full_name":"prasoonvarshney/scientific-entity-recognition","owner":"prasoonvarshney","description":"End-to-end pipeline for (1) automatic scraping and parsing of NLP research papers, (2) token-level entity annotations in Label Studio, and (3) BERT-based models for span identification and entity recognition","archived":false,"fork":false,"pushed_at":"2022-11-03T02:47:04.000Z","size":3071,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-09T13:43:57.296Z","etag":null,"topics":["bert","data-annotation","entity-recognition","token-classification"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prasoonvarshney.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-12T23:13:41.000Z","updated_at":"2023-04-30T19:41:36.000Z","dependencies_parsed_at":"2023-06-16T23:00:33.488Z","dependency_job_id":null,"html_url":"https://github.com/prasoonvarshney/scientific-entity-recognition","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prasoonvarshney%2Fscientific-entity-recognition","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prasoonvarshney%2Fscientific-entity-recognition/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prasoonvarshney%2Fscientific-entity-recognition/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prasoonvarshney%2Fscientific-entity-recognition/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prasoonvarshney","download_url":"https://codeload.github.com/prasoonvarshney/scientific-entity-recognition/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247108357,"owners_count":20884883,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","data-annotation","entity-recognition","token-classification"],"created_at":"2024-10-29T18:23:42.205Z","updated_at":"2025-04-04T02:22:06.456Z","avatar_url":"https://github.com/prasoonvarshney.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Advanced NLP: From Scratch: Fall 2022\n[nlp-from-scratch-assignment](https://github.com/neubig/nlp-from-scratch-assignment-2022/) for 11-711, Advanced NLP course at Carnegie Mellon University.\n\n## Run Instructions\nTo run the model training and evaluation pipeline, cd into the root folder of the repository and run:\n\n`python code/model_pipeline/sciner.py --model-checkpoint KISTI-AI/scideberta-cs --lr 5e-5 --epochs 10 --weight_decay 1e-6 --batch_size 8`\n\n# Assignment Description\nThis objective is to perform recognition of scientific entities in research papers. \nThis repository contains all required components to do the task: \n1. Scraping scripts to fetch research paper PDFs at code/data_collection\n2. Parsing scripts to parse the PDFs at code/data_collection\n3. A collection of 32 manually annotated papers (gold-standard) at data/annotated\n    a. Scripts to split and create train and dev sets are located at data/created_data_train_test_splits\n    b. Held out test set is located at data/test\n4. Model training pipeline that achieves 0.626 F1 on the held-out set.\n\n\n## Directory Tree Structure\n\nadvanced-nlp-f22-hw2\\\n├── code\\\n│   ├── data_collection\\\n│   │   ├── annotation_scripts\\\n│   │   │   ├── bert.conll\\\n│   │   │   ├── bert_copy.json\\\n│   │   │   ├── bert_edited.json\\\n│   │   │   ├── bert.json\\\n│   │   │   ├── bert_min.json\\\n│   │   │   ├── generate_mapping.py\\\n│   │   │   ├── lda_edited.json\\\n│   │   │   ├── lda.json\\\n│   │   │   ├── mapping.json\\\n│   │   │   ├── reverse_mapping.json\\\n│   │   │   └── rule_based.py\\\n│   │   ├── constants.py\\\n│   │   ├── example_scipdf_output.json\\\n│   │   ├── labelstudio_collector.py\\\n│   │   ├── parser.py\\\n│   │   ├── pdf_urls.txt\\\n│   │   ├── random_pdf.py\\\n│   │   ├── sampled_urls.txt\\\n│   │   ├── scrape_all.py\\\n│   │   └── scrape.py\\\n│   ├── model_pipeline\\\n│   │   ├── constants.py\\\n│   │   ├── dataloader.py\\\n│   │   ├── pipeline.py\\\n│   │   └── sciner.py\\\n│   └── notebooks\\\n│       ├── ANLP_NER_BertyBoy.ipynb\\\n│       ├── ANLP_NER_deberta.ipynb\\\n│       ├── ANLP_NER_SCIBERT.ipynb\\\n│       ├── ANLP_NER_scideberta.ipynb\\\n│       └── NER_Model_Pipeline.ipynb\\\n├── data\\\n│   ├── annotated\\\n|   |       (our annotated conll files by paper as exported from Label Studio)\\\n│   ├── created_data_train_test_splits\\\n|   |       (our annotations split into train.conll and test.conll)\\\n│   ├── parsed_pdfs\\\n|   |       (txt files of parsed papers)\\\n│   ├── summary_of_parsed_files.json\\\n│   ├── test\\\n|   |       (held out test set files and our model predictions on them)\\\n├── github.txt\\\n├── LICENSE\\\n├── README.md\\\n├── requirements.txt\\\n└── setup.sh\\","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprasoonvarshney%2Fscientific-entity-recognition","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprasoonvarshney%2Fscientific-entity-recognition","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprasoonvarshney%2Fscientific-entity-recognition/lists"}