{"id":17894105,"url":"https://github.com/blainerothrock/nlp-group-2","last_synced_at":"2025-10-13T18:44:37.867Z","repository":{"id":53530760,"uuid":"234200243","full_name":"blainerothrock/nlp-group-2","owner":"blainerothrock","description":null,"archived":false,"fork":false,"pushed_at":"2021-03-25T23:21:32.000Z","size":51607,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-03T04:29:02.208Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blainerothrock.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-16T00:24:38.000Z","updated_at":"2020-03-28T17:28:12.000Z","dependencies_parsed_at":"2022-09-21T00:52:28.378Z","dependency_job_id":null,"html_url":"https://github.com/blainerothrock/nlp-group-2","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/blainerothrock/nlp-group-2","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blainerothrock%2Fnlp-group-2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blainerothrock%2Fnlp-group-2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blainerothrock%2Fnlp-group-2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blainerothrock%2Fnlp-group-2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blainerothrock","download_url":"https://codeload.github.com/blainerothrock/nlp-group-2/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blainerothrock%2Fnlp-group-2/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279016601,"owners_count":26085852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-28T15:00:02.641Z","updated_at":"2025-10-13T18:44:37.850Z","avatar_url":"https://github.com/blainerothrock.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# nlp-group-2\n\n## To Install\n* Install required dependencies\n    - `pip install -r requirements.txt`\n* Execute main\n    - `python main.py`\n\n## Configurations\nConfigurations/parameters can be found in `constants.py`\n\n## Project Structure To Do:\n- [ ] Configure main method take parameters based on task to be ran\n    * **For example**: `python main.py build_corpus`\n    \n## Assignment 1 Checklist\n### Task 1\n- [x] Generated raw Wikipedia Text\n- [x] Write a Query that results in ~5 million tokens\n\n### Task 2\n- [x] Strip HTML\n- [x] Added `\u003cs\u003e` \u0026 `\u003c/s\u003e` boundries\n- [x] Tokenized using `nltk`\n    - [x] Remove punctuation\n    - [x] Remove \"[citation needed]\" - could just adjust current regex code in prepare-corpus branch\n- [x] Generated train, test \u0026 validation text files\n- [x] Remove tokens with frequency \u003c 3\n- [x] Add punctuation back in!\n\n### Task 3\n- [x] Construct a vocabulary from the training set\n- [x] Replace out-of-vocabularly words in test and validation with `\u003cunk\u003e`\n- [x] Remove all one-character tokens that are not 'a' (see group2.test.txt for examples)\n- [x] Save Python list of vocabulary\n- [x] Save dictionary `{ [WORD] : [IDX] }`\n- [x] Construct integer representation of training, validation and test corpora, save as lists\n- [x] Don't forget to write integer representations to pickle files\n\n### Task 4\n- [x] Insert tags for years\n- [x] Insert tags for real numbers\n- [x] Keep other numbers in as tokens (don't tag or remove them)\n- [x] Insert tags for country name\n- [x] Insert tags for month name\n- [x] Add these tags to the `vocab` list before making integer representation\n- [x] Construct integer representation of training, validation and test corpora, save as lists\n\n### Task 5\n- [x] Prepare statistical summary of corpus\n    - number of tokens\n    - vocabulary size\n        - untagged\n        - tagged corpus\n        - each 4 word classes\n\n### Task 6\n- [x] [Summary](https://docs.google.com/document/d/1dFqweNHXq2So4Abm2NIHZwCo0SC6XjwCwIdu5MjohlQ/edit) describing each task, including ambiguity and decisions we made \n    \n## List of Questions for David\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblainerothrock%2Fnlp-group-2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblainerothrock%2Fnlp-group-2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblainerothrock%2Fnlp-group-2/lists"}