https://github.com/centre-for-humanities-computing/word-associations
Word associations in Indre Mission and Kirkeligt Samfund
https://github.com/centre-for-humanities-computing/word-associations
Last synced: 8 months ago
JSON representation
Word associations in Indre Mission and Kirkeligt Samfund
- Host: GitHub
- URL: https://github.com/centre-for-humanities-computing/word-associations
- Owner: centre-for-humanities-computing
- License: mit
- Created: 2024-05-01T10:02:03.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-19T11:19:49.000Z (over 1 year ago)
- Last Synced: 2025-09-09T23:59:05.732Z (9 months ago)
- Language: Python
- Size: 345 KB
- Stars: 0
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# word-associations
Word associations in Indre Mission and Kirkeligt Samfund.
## Usage
Folder structure:
```
- dataset/
- data/
- pr983_204.txt
...
- Stopord.txt
- metadata_nordveck.csv
```
Install requirements:
```bash
pip install -r requirements.txt
```
### Preprocessing
Preprocess the corpus (lemmatization, stop word removal, normalization):
```python
python3 src/clean_texts.py
```
This will output the cleaned corpus as a csv file with `id`, `text` and `clean_text` columns.
```
- dataset/
- clean_data.csv
```
### Word count
You can use the `src/word_count.py run` CLI to extraxct the most common words.
### Collect collocations
You can use the `src/cooccurrences.py run` CLI, to extract the highest scoring collocations of a target word based on PMI.
#### Arguments
| Argument | Description | Type | Default |
|-------------------------|----------------------------------------------------------------------------------------------|--------|-------------------|
| `seed_word` | Seed word to start off from. | str | - |
| `-h`, `--help` | Show help message and exit. | | |
| `--group_by GROUP_BY`,
`-g GROUP_BY` | Metadata column to group results by. | str | None |
| `--out_file OUT_FILE`,
`-o OUT_FILE` | JSON file to output results to. | str | results/coocurrences.json |
| `--top_k TOP_K`,
`-k TOP_K` | Top K ranking cooccurring words to output. | int | 50 |
| `--n_context N_CONTEXT`,
`-n N_CONTEXT` | Number of context words to consider in each direction. | int | 5 |