https://github.com/andrianllmm/eng-dictionary-parser
Parser for an English XML dictionary
https://github.com/andrianllmm/eng-dictionary-parser
dictionary english parser text-mining xml-parser
Last synced: 12 months ago
JSON representation
Parser for an English XML dictionary
- Host: GitHub
- URL: https://github.com/andrianllmm/eng-dictionary-parser
- Owner: andrianllmm
- License: apache-2.0
- Created: 2024-08-09T10:29:54.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2025-06-22T18:32:41.000Z (about 1 year ago)
- Last Synced: 2025-07-09T12:06:51.697Z (12 months ago)
- Topics: dictionary, english, parser, text-mining, xml-parser
- Language: Python
- Homepage:
- Size: 23.1 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# English Dictionary Parser
**A Python script that parses an English dictionary in XML format and converts
it into several useful formats**
## About
This parser parses the dictionary from [GCIDE](https://www.ibiblio.org/webster/)
in XML format and outputs it to [JSON format](output/eng_dictionary.json),
[frequency list](output/eng_freqlist.csv), and
[word list](output/eng_wordlist.txt).
## Output
> 103,396 words collected (as
> of 08/09/2024)
| Resource | Format | Link |
| -------------- | ------ | -------------------------------------------------------- |
| Dictionary | json | [output/eng_dictionary.json](output/eng_dictionary.json) |
| Frequency list | csv | [output/eng_freqlist.csv](output/eng_freqlist.csv) |
| Word list | txt | [output/eng_wordlist.txt](output/eng_wordlist.txt) |
### JSON Dictionary
The JSON dictionary is structured as a list of words and its corresponding list
of attributes. The attributes include part of speech, definition, etymology,
classifications, synonyms, antonyms, example sentences, inflections, and
sources. The entries are sorted alphabetically.
```json
[
{
"word": "The word itself",
"attributes": [
{
"pos": "Simplified arts of speech",
"definition": "The definition",
"origin": "The etymology",
"classification": "Any classification",
"similar": ["List of synonyms"],
"opposite": ["List of antonyms"],
"examples": ["List of example sentences that use the word"],
"inflections": ["List of inflected forms"],
"sources": ["List of sources"]
}
]
}
]
```
### Frequency list
The frequency list is structured as a list of words and its corresponding
frequency value derived from the
[Leipzig Corpora Collection Dataset (2021 Wikipedia 100k corpus)](https://wortschatz.uni-leipzig.de/en/download/English).
The list is sorted from highest to lowest frequency value.
```csv
the,101717
of,43438
in,37524
```
### Word list
The word list is simply the list of words sorted alphabetically.