https://github.com/x-tabdeveloping/language-analytics-assignment1
First assignment for language analytics course.
https://github.com/x-tabdeveloping/language-analytics-assignment1
Last synced: 7 months ago
JSON representation
First assignment for language analytics course.
- Host: GitHub
- URL: https://github.com/x-tabdeveloping/language-analytics-assignment1
- Owner: x-tabdeveloping
- License: mit
- Created: 2024-02-15T09:42:03.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-10T12:03:40.000Z (almost 2 years ago)
- Last Synced: 2025-02-08T17:14:03.374Z (about 1 year ago)
- Language: Python
- Size: 197 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# language-analytics-assignment1
First assignment for language analytics course.
The assignment is about extracting POS tag and NER data from the Uppsala Student English Corpus using the SpaCy NLP framework.
The data can be downloaded from the [official website](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).
## Setup:
The corpus needs to be in the `data/` folder, where the USEcorpus folder should contain all the subcorpora in its subfolders:
The file hierarchy should follow this structure:
```
- data
- USEcorpus
- a1
- 1011.a1.txt
...
- 5031.a1.txt
...
- c1
```
Install the requirements of the scripts:
```bash
pip install -r requirements.txt
```
## Usage
Run the script:
```bash
python3 src/run_analysis.py
```
This will produce a bunch of `.csv` files in the `output/` folder for each subcorpus.
```
- output
- a1.csv
...
- c1.csv
```
Every row of the tables contains result for one file in the corpus with relative frequencies of UPOS tags per 10000 words and number of unique named entities per category.
> Additionally the script will produce a csv file with the CO2 emissions of the substasks in the code (`emissions/`).
> This is necessary for Assignment 5, and is not directly relevant to this assignment.
> Note: The `emissions/emissions.csv` file should be ignored. This is due to the fact, that codecarbon can't track process and task emissions at the same time.
## Potential Limitations
The code in this repository utilizes the `en_core_web_sm` SpaCy model. Results are likely to be slightly inaccurate, as this model is not the most performant out of all English SpaCy models. A transformer-based pipeline would likely outperform this model at POS tagging and named entity recognition.
Efficiency could also be made better by disabling unneccesary components in the pipeline, such as the parser or the lemmatizer.