Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jakesteam/wordnettosqlite

Python script for converting WordNet into a SQLite database with types and definitions
https://github.com/jakesteam/wordnettosqlite

data dictionary python sqlite wordnet

Last synced: 4 days ago
JSON representation

Python script for converting WordNet into a SQLite database with types and definitions

Awesome Lists containing this project

README

        

# WordNet to SQLite

This repo provides a Python script to convert WordNet's word data (`/wordnet-data`) into a SQLite database (`words.db`) with 70,433 unique combinations of word & type, with structure:

```
words (word TEXT, type TEXT, definitions TEXT)
```

The intended purpose is for a word game, so non-words, proper nouns, and profanity have been removed where possible.

## Sample contents

Filtering `word` to `article`, alphabetical order:

| word | type | definitions |
| :------------ | :-------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| antiparticle | noun | a particle that has the same mass as another particle but has opposite values for its other properties |
| article | noun | one of a class of artifacts#nonfictional prose forming an independent part of a publication#(grammar) a determiner that may indicate the specificity of reference of a noun phrase#a separate section of a legal document (as a statute or contract or will) |
| article | verb | bind by a contract |
| articled | adjective | bound by contract |
| particle | noun | a function word that can be used in English to form phrasal verbs#a body having finite mass and internal structure but negligible dimensions#(nontechnical usage) a tiny piece of anything |
| quasiparticle | noun | a quantum of energy (in a crystal lattice or other system) that has position and momentum and can in some respects be regarded as a particle |

## Notes on contents

Word definitions for the same `type` are combined (e.g. with the noun `article`, but not the verb) with a `#` between definitions.

- `word`:
- Any words with uppercase letters (e.g. proper nouns) are removed.
- Any 1 character words are removed.
- Any words with numbers are removed.
- Any words with other characters (apostrophes, spaces) are removed.
- Most profane words (133) are removed.
- Roman numerals are removed (e.g. `XVII`).
- `type`:
- Always `adjective` / `adverb` / `noun` / `verb`.
- `definition`:
- Definition of the word, will contain multiple separated by `#` if the word appears as a synonym for another word.
- Most profane definitions (385) are replaced with empty space.
- May contain bracketed usage information, e.g. `(dated)`.
- May contain special characters like `'`, `$`, `!`, `<`, `[`, etc.

Profanity removal (90% of the processing time) is performed using [better_profanity 0.6.1](https://github.com/snguyenthanh/better_profanity) (with a whitelist for the word "horny", only used in a lizard context). This isn't perfect for biological words, but works quite well on the higher priority slurs. A full list of removed words and definitions is available in [removed-data.txt](/notes/removed-data.txt).

## Reproducing results

If you wish to recreate `words.db` from scratch, you can:

1. Download `WNdb-3.0.tar.gz` from [WordNet](https://wordnet.princeton.edu/download/current-version).
2. Extract it, and place the `data.x` files in `/wordnet-data/`.
3. Run `py wordnet-to-sqlite.py`.

The raw data looks like this ("unknown" is the only valid noun to extract):

```
08632096 15 n 03 unknown 0 unknown_region 0 terra_incognita 0 001 @ 08630985 n 0000 | an unknown and unexplored region; "they came like angels out the unknown"
```

This script takes 10-15 seconds on an average laptop. Efficiency is not a priority (with profanity removal taking the majority of the time), as the output database only needs generating once ever.

Notes on WordNet's data files [are here](https://wordnet.princeton.edu/documentation/wndb5wn), this repo just does a "dumb" parse then filters out numerical data.