https://github.com/miserman/dictionary_builder
Web tool to help build dictionaries
https://github.com/miserman/dictionary_builder
Last synced: 10 months ago
JSON representation
Web tool to help build dictionaries
- Host: GitHub
- URL: https://github.com/miserman/dictionary_builder
- Owner: miserman
- License: mit
- Created: 2023-12-16T22:06:19.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-08-10T09:34:03.000Z (11 months ago)
- Last Synced: 2025-08-10T11:37:36.337Z (11 months ago)
- Language: TypeScript
- Homepage: https://miserman.github.io/dictionary_builder/
- Size: 112 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
React app to help create and analyze text analysis dictionaries.
# Features
**Import** an existing dictionary (such as from [osf.io/y6g5b](https://osf.io/y6g5b/)), or start from scratch.
Imported dictionaries are saved locally with every edit, which can be encrypted.
As part of _creation_, you can...
- add fixed or fuzzy (glob or regex) terms,
- add suggested terms based on word-form matching, embeddings-based similarity, or wordnet-based similarity,
- and assign terms a sense and category weights (directly, or automatically based on similarity to a given set of terms).
As part of _analysis_, the tool will...
- expand fuzzy terms using a word list extracted from embeddings,
- suggest senses from a wordnet, ranked by similarity to other terms that share a category,
- and calculate similarity to terms within select categories to visualize as a graph.
**Export** dictionaries in common formats, such as those accepts by [lingmatch](https://miserman.github.io/lingmatch/) for processing in R, and [adicat](https://miserman.github.io/adicat/highlight/) for processing in browser.
# Sources
Term associations come from the pre-trained embeddings spaces available at [osf.io/489he](https://osf.io/489he/).
Synsets are from the [Open English WordNet](https://github.com/globalwordnet/english-wordnet), with some added information:
- clusters from a [Coarse Sense Inventory](https://sapienzanlp.github.io/csi/)
- frequencies from an [evaluation framework](http://lcl.uniroma1.it/wsdeval/), which come from [SemCor](https://web.eecs.umich.edu/~mihalcea/downloads.html#semcor) and [OMSTI](https://www.comp.nus.edu.sg/~nlp/corpora.html)
- BabelNet IDs from another [evaluation framwork](https://sapienzanlp.github.io/xl-wsd/docs/data/), as mapped from SemCor labels
Additional associated terms are from [ConceptNet](https://github.com/commonsense/conceptnet5).
The [preprocess.R](/preprocess.R) script was used to make the resources from these sources that are used within the app.
Some background to this tool is discussed in [Introduction to Dictionary Creation](https://miserman.github.io/lingmatch/articles/dictionary_creation.html).
# Testing
Tests depend on a running dev server, which can be started with the `test-serve` command:
```sh
npm run test-serve
```
After the site has been compiled, the tests can be run:
```sh
npm run test
```
Note: visit `http://localhost:3000/dictionary_builder` to compile before running tests.
See [current coverage](https://miserman.github.io/dictionary_builder/coverage/).