An open API service indexing awesome lists of open source software.

https://github.com/centre-for-humanities-computing/word-associations

Word associations in Indre Mission and Kirkeligt Samfund
https://github.com/centre-for-humanities-computing/word-associations

Last synced: 8 months ago
JSON representation

Word associations in Indre Mission and Kirkeligt Samfund

Awesome Lists containing this project

README

          

# word-associations
Word associations in Indre Mission and Kirkeligt Samfund.

## Usage

Folder structure:
```
- dataset/
- data/
- pr983_204.txt
...
- Stopord.txt
- metadata_nordveck.csv
```

Install requirements:
```bash
pip install -r requirements.txt
```

### Preprocessing

Preprocess the corpus (lemmatization, stop word removal, normalization):
```python
python3 src/clean_texts.py
```

This will output the cleaned corpus as a csv file with `id`, `text` and `clean_text` columns.
```
- dataset/
- clean_data.csv
```

### Word count
You can use the `src/word_count.py run` CLI to extraxct the most common words.

### Collect collocations

You can use the `src/cooccurrences.py run` CLI, to extract the highest scoring collocations of a target word based on PMI.

#### Arguments

| Argument | Description | Type | Default |
|-------------------------|----------------------------------------------------------------------------------------------|--------|-------------------|
| `seed_word` | Seed word to start off from. | str | - |
| `-h`, `--help` | Show help message and exit. | | |
| `--group_by GROUP_BY`,
`-g GROUP_BY` | Metadata column to group results by. | str | None |
| `--out_file OUT_FILE`,
`-o OUT_FILE` | JSON file to output results to. | str | results/coocurrences.json |
| `--top_k TOP_K`,
`-k TOP_K` | Top K ranking cooccurring words to output. | int | 50 |
| `--n_context N_CONTEXT`,
`-n N_CONTEXT` | Number of context words to consider in each direction. | int | 5 |