{"id":15912950,"url":"https://github.com/x-tabdeveloping/language-analytics-assignment1","last_synced_at":"2025-07-28T14:40:32.359Z","repository":{"id":222636851,"uuid":"757935204","full_name":"x-tabdeveloping/language-analytics-assignment1","owner":"x-tabdeveloping","description":"First assignment for language analytics course.","archived":false,"fork":false,"pushed_at":"2024-05-10T12:03:40.000Z","size":202,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-08T17:14:03.374Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x-tabdeveloping.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-15T09:42:03.000Z","updated_at":"2024-05-10T12:03:43.000Z","dependencies_parsed_at":"2024-05-02T08:35:44.707Z","dependency_job_id":"467ef1a2-b3cb-469e-a56a-06f8ae5bb09d","html_url":"https://github.com/x-tabdeveloping/language-analytics-assignment1","commit_stats":null,"previous_names":["x-tabdeveloping/language-analytics-assignment1"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment1","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment1/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment1/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x-tabdeveloping%2Flanguage-analytics-assignment1/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x-tabdeveloping","download_url":"https://codeload.github.com/x-tabdeveloping/language-analytics-assignment1/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246927843,"owners_count":20856198,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T16:21:57.743Z","updated_at":"2025-04-03T03:15:56.452Z","avatar_url":"https://github.com/x-tabdeveloping.png","language":"Python","readme":"# language-analytics-assignment1\nFirst assignment for language analytics course.\n\nThe assignment is about extracting POS tag and NER data from the Uppsala Student English Corpus using the SpaCy NLP framework.\nThe data can be downloaded from the [official website](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).\n\n## Setup:\n\nThe corpus needs to be in the `data/` folder, where the USEcorpus folder should contain all the subcorpora in its subfolders:\n\nThe file hierarchy should follow this structure:\n```\n- data\n  - USEcorpus\n    - a1\n      - 1011.a1.txt\n        ...\n      - 5031.a1.txt\n    ...\n    - c1\n```\n\nInstall the requirements of the scripts:\n\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\nRun the script:\n\n```bash\npython3 src/run_analysis.py\n```\n\nThis will produce a bunch of `.csv` files in the `output/` folder for each subcorpus.\n\n```\n- output\n  - a1.csv\n  ...\n  - c1.csv\n```\n\nEvery row of the tables contains result for one file in the corpus with relative frequencies of UPOS tags per 10000 words and number of unique named entities per category.\n\n\u003e Additionally the script will produce a csv file with the CO2 emissions of the substasks in the code (`emissions/`).\n\u003e This is necessary for Assignment 5, and is not directly relevant to this assignment.\n\n\u003e Note: The `emissions/emissions.csv` file should be ignored. This is due to the fact, that codecarbon can't track process and task emissions at the same time.\n\n## Potential Limitations\n\nThe code in this repository utilizes the `en_core_web_sm` SpaCy model. Results are likely to be slightly inaccurate, as this model is not the most performant out of all English SpaCy models. A transformer-based pipeline would likely outperform this model at POS tagging and named entity recognition.\nEfficiency could also be made better by disabling unneccesary components in the pipeline, such as the parser or the lemmatizer.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment1","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment1","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-tabdeveloping%2Flanguage-analytics-assignment1/lists"}