https://github.com/centre-for-humanities-computing/word-associations

Word associations in Indre Mission and Kirkeligt Samfund
https://github.com/centre-for-humanities-computing/word-associations

Last synced: 8 months ago
JSON representation

Word associations in Indre Mission and Kirkeligt Samfund

Host: GitHub
URL: https://github.com/centre-for-humanities-computing/word-associations
Owner: centre-for-humanities-computing
License: mit
Created: 2024-05-01T10:02:03.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-19T11:19:49.000Z (over 1 year ago)
Last Synced: 2025-09-09T23:59:05.732Z (9 months ago)
Language: Python
Size: 345 KB
Stars: 0
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# word-associations
Word associations in Indre Mission and Kirkeligt Samfund.

## Usage

Folder structure:
```
- dataset/
- data/
- pr983_204.txt
...
- Stopord.txt
- metadata_nordveck.csv
```

Install requirements:
```bash
pip install -r requirements.txt
```

### Preprocessing

Preprocess the corpus (lemmatization, stop word removal, normalization):
```python
python3 src/clean_texts.py
```

This will output the cleaned corpus as a csv file with `id`, `text` and `clean_text` columns.
```
- dataset/
- clean_data.csv
```

### Word count
You can use the `src/word_count.py run` CLI to extraxct the most common words.

### Collect collocations

You can use the `src/cooccurrences.py run` CLI, to extract the highest scoring collocations of a target word based on PMI.

#### Arguments

| Argument | Description | Type | Default |
|-------------------------|----------------------------------------------------------------------------------------------|--------|-------------------|
| `seed_word` | Seed word to start off from. | str | - |
| `-h`, `--help` | Show help message and exit. | | |
| `--group_by GROUP_BY`,
`-g GROUP_BY` | Metadata column to group results by. | str | None |
| `--out_file OUT_FILE`,
`-o OUT_FILE` | JSON file to output results to. | str | results/coocurrences.json |
| `--top_k TOP_K`,
`-k TOP_K` | Top K ranking cooccurring words to output. | int | 50 |
| `--n_context N_CONTEXT`,
`-n N_CONTEXT` | Number of context words to consider in each direction. | int | 5 |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/centre-for-humanities-computing/word-associations

Awesome Lists containing this project

README