https://github.com/x-tabdeveloping/language-analytics-assignment1

First assignment for language analytics course.
https://github.com/x-tabdeveloping/language-analytics-assignment1

Last synced: 7 months ago
JSON representation

First assignment for language analytics course.

Host: GitHub
URL: https://github.com/x-tabdeveloping/language-analytics-assignment1
Owner: x-tabdeveloping
License: mit
Created: 2024-02-15T09:42:03.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-05-10T12:03:40.000Z (almost 2 years ago)
Last Synced: 2025-02-08T17:14:03.374Z (about 1 year ago)
Language: Python
Size: 197 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# language-analytics-assignment1
First assignment for language analytics course.

The assignment is about extracting POS tag and NER data from the Uppsala Student English Corpus using the SpaCy NLP framework.
The data can be downloaded from the [official website](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2457).

## Setup:

The corpus needs to be in the `data/` folder, where the USEcorpus folder should contain all the subcorpora in its subfolders:

The file hierarchy should follow this structure:
```
- data
- USEcorpus
- a1
- 1011.a1.txt
...
- 5031.a1.txt
...
- c1
```

Install the requirements of the scripts:

```bash
pip install -r requirements.txt
```

## Usage

Run the script:

```bash
python3 src/run_analysis.py
```

This will produce a bunch of `.csv` files in the `output/` folder for each subcorpus.

```
- output
- a1.csv
...
- c1.csv
```

Every row of the tables contains result for one file in the corpus with relative frequencies of UPOS tags per 10000 words and number of unique named entities per category.

> Additionally the script will produce a csv file with the CO2 emissions of the substasks in the code (`emissions/`).
> This is necessary for Assignment 5, and is not directly relevant to this assignment.

> Note: The `emissions/emissions.csv` file should be ignored. This is due to the fact, that codecarbon can't track process and task emissions at the same time.

## Potential Limitations

The code in this repository utilizes the `en_core_web_sm` SpaCy model. Results are likely to be slightly inaccurate, as this model is not the most performant out of all English SpaCy models. A transformer-based pipeline would likely outperform this model at POS tagging and named entity recognition.
Efficiency could also be made better by disabling unneccesary components in the pipeline, such as the parser or the lemmatizer.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/x-tabdeveloping/language-analytics-assignment1

Awesome Lists containing this project

README