Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jakesteam/wordnettosqlite
Python script for converting WordNet into a SQLite database with types and definitions
https://github.com/jakesteam/wordnettosqlite
data dictionary python sqlite wordnet
Last synced: 4 days ago
JSON representation
Python script for converting WordNet into a SQLite database with types and definitions
- Host: GitHub
- URL: https://github.com/jakesteam/wordnettosqlite
- Owner: JakeSteam
- License: other
- Created: 2024-12-14T21:25:31.000Z (7 days ago)
- Default Branch: main
- Last Pushed: 2024-12-14T22:46:46.000Z (7 days ago)
- Last Synced: 2024-12-14T23:19:19.391Z (7 days ago)
- Topics: data, dictionary, python, sqlite, wordnet
- Language: Python
- Homepage:
- Size: 9.71 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# WordNet to SQLite
This repo provides a Python script to convert WordNet's word data (`/wordnet-data`) into a SQLite database (`words.db`) with 70,433 unique combinations of word & type, with structure:
```
words (word TEXT, type TEXT, definitions TEXT)
```The intended purpose is for a word game, so non-words, proper nouns, and profanity have been removed where possible.
## Sample contents
Filtering `word` to `article`, alphabetical order:
| word | type | definitions |
| :------------ | :-------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| antiparticle | noun | a particle that has the same mass as another particle but has opposite values for its other properties |
| article | noun | one of a class of artifacts#nonfictional prose forming an independent part of a publication#(grammar) a determiner that may indicate the specificity of reference of a noun phrase#a separate section of a legal document (as a statute or contract or will) |
| article | verb | bind by a contract |
| articled | adjective | bound by contract |
| particle | noun | a function word that can be used in English to form phrasal verbs#a body having finite mass and internal structure but negligible dimensions#(nontechnical usage) a tiny piece of anything |
| quasiparticle | noun | a quantum of energy (in a crystal lattice or other system) that has position and momentum and can in some respects be regarded as a particle |## Notes on contents
Word definitions for the same `type` are combined (e.g. with the noun `article`, but not the verb) with a `#` between definitions.
- `word`:
- Any words with uppercase letters (e.g. proper nouns) are removed.
- Any 1 character words are removed.
- Any words with numbers are removed.
- Any words with other characters (apostrophes, spaces) are removed.
- Most profane words (133) are removed.
- Roman numerals are removed (e.g. `XVII`).
- `type`:
- Always `adjective` / `adverb` / `noun` / `verb`.
- `definition`:
- Definition of the word, will contain multiple separated by `#` if the word appears as a synonym for another word.
- Most profane definitions (385) are replaced with empty space.
- May contain bracketed usage information, e.g. `(dated)`.
- May contain special characters like `'`, `$`, `!`, `<`, `[`, etc.Profanity removal (90% of the processing time) is performed using [better_profanity 0.6.1](https://github.com/snguyenthanh/better_profanity) (with a whitelist for the word "horny", only used in a lizard context). This isn't perfect for biological words, but works quite well on the higher priority slurs. A full list of removed words and definitions is available in [removed-data.txt](/notes/removed-data.txt).
## Reproducing results
If you wish to recreate `words.db` from scratch, you can:
1. Download `WNdb-3.0.tar.gz` from [WordNet](https://wordnet.princeton.edu/download/current-version).
2. Extract it, and place the `data.x` files in `/wordnet-data/`.
3. Run `py wordnet-to-sqlite.py`.The raw data looks like this ("unknown" is the only valid noun to extract):
```
08632096 15 n 03 unknown 0 unknown_region 0 terra_incognita 0 001 @ 08630985 n 0000 | an unknown and unexplored region; "they came like angels out the unknown"
```This script takes 10-15 seconds on an average laptop. Efficiency is not a priority (with profanity removal taking the majority of the time), as the output database only needs generating once ever.
Notes on WordNet's data files [are here](https://wordnet.princeton.edu/documentation/wndb5wn), this repo just does a "dumb" parse then filters out numerical data.