Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/jqhoogland/wiktionary-api

Python client to read & parse wiktionary entries & their semantic relations.
https://github.com/jqhoogland/wiktionary-api
Last synced: 10 days ago
JSON representation
Python client to read & parse wiktionary entries & their semantic relations.
Host: GitHub
URL: https://github.com/jqhoogland/wiktionary-api
Owner: jqhoogland
License: gpl-3.0
Created: 2022-07-10T20:06:38.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-07-10T20:08:56.000Z (over 2 years ago)
Last Synced: 2024-11-19T18:55:05.264Z (2 months ago)
Language: Python
Size: 47.9 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Wiktionary API 

> This is a **non-official** Wiktionary API client written in Python. 

> It is very non-stable. Expect many breaking changes.

I'm actually focusing on building out a Typescript client instead, as part of [Open Dictionary](https://github.com/jqhoogland/open-dictionary/).

If you want me to put more work in this, let me know. For now, it has fulfilled its role as sandbox.

## ❓ How it works

Wiktionary articles contain lots of internal structure in the form of [templates](https://meta.wikimedia.org/wiki/Help:Template).

With a bit of manual pruning, we convert these templates into Wiktionary-agnostic ["semantic triples"](https://en.wikipedia.org/wiki/SemanticTriple). (These use a new [ontology for natural language](/ontology)).

(Note: they're not perfect triples yet.)

Then, with [SPARQL](https://en.wikipedia.org/wiki/SPARQL) and tools built on top of it ([GraphQL-LD](https://github.com/rubensworks/GraphQL-LD.js)),

we can query the data consistently.

In actual fact, there's a bit more going on: we look at context (so section headings & also crawl standard wikilinks and non-templated information).

## The Problem

There is an actual [Wiktionary API](https://en.wiktionary.org/w/api.php).

Unfortunately, it only returns pages in `html` or `wikitext` (Wikipedia's internal markup language), which is not the friendliest for computers to read. 

Fortunately, `wikitext` has a decent bit of internal structure and is ~~easy~~ not impossible to work with.

This library provides a client that uses [wikitextparser](https://github.com/5j9/wikitextparser) (plus some custom logic) to convert wikitext into 

straightforward `json` objects (read: Python dictionaries) that you can use at your leisure.

It's much more comprehensive and accurate than [WiktionaryParser](https://github.com/Suyash458/WiktionaryParser) since it works with the raw wikitext rather than the generated html.

## One Wiktionary to rule them all.

There's one problem: Wiktionary is not one thing. 

No [there are 183 different Wiktionaries in 183 different languages](https://meta.wikimedia.org/wiki/Wiktionary#ListOfWiktionaries). 

Pretty much every single Wiktionary has its own, non-interoperable standards. So you have to have to build unique parsing logic for each Wiktionary.

(These are identified by the an [ISO 639 code](https://www.iso.org/iso-639-language-codes.html) subdomain (as in `en.wiktionary.com`))

So... we're starting with the English-language Wiktionary. Depending on how that goes, we might move on to additional wiktionaries.

# Structure

The current return structure is as follows:

```ts

{

  word: string,

  lang: string, // ISO language code

  altForms: LinkedWord[],

  etymology: LinkedWord[],

  pronunciations: (Pronunciation | Qualifier)[][],

  definitions: LinkedWord   

}[]

```

There is an entry for each 

# Example

Using the CLI to retrieve the English entries for "foo"

```sh

python wiktionary/cli.py foo en

```

Returns the following object, a list of entries (one for each unique etymology):

```json

[

  {

    "word": "foo",

    "lang": "en",

    "altForms": [],

    "etymology": [

      {

        "@id": "derived",

        "lang": "en",

        "srcLang": "cmn",

        "src": "府",

        "transliteration": "fǔ"

      },

      {

        "@id": "label",

        "lang": "en",

        "2": "historical",

        "3": "obsolete"

      }

    ],

    "pronunciations": [

      [

        {

          "@id": "dialect",

          "dialects": [

            "UK"

          ]

        },

        {

          "@id": "ipa",

          "lang": "en",

          "pronunciations": [

            {

              "ipa": "/fuː/"

            }

          ]

        }

      ],

      [

        {

          "@id": "En-au-foo.ogg",

          "lang": "en",

          "url": "Audio (AU)"

        }

      ],

      [

        {

          "@id": "rhymes",

          "lang": "en",

          "s": "1",

          "rhymes": [

            {

              "rhyme": "uː"

            }

          ]

        }

      ],

      [

        {

          "@id": "homophones",

          "lang": "en",

          "rhymes": [

            {

              "homophone": "-fu"

            }

          ]

        }

      ]

    ],

    "glyphOrigin": null,

    "description": null,

    "definitions": [

      {

        "@id": "references",

        "linked": [],

        "category": null,

        "data": "\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"

      },

      {

        "@id": "anagrams",

        "linked": [

          {

            "@id": "anagram",

            "lang": "en",

            "alphagram": "foo",

            "anagrams": [

              "oof"

            ]

          }

        ]

      },

      {

        "@id": "noun",

        "data": []

      }

    ]

  },

  {

    "word": "foo",

    "lang": "en",

    "altForms": [],

    "etymology": [

      {

        "@id": "derived",

        "lang": "en",

        "srcLang": "zh",

        "src": "福",

        "alt": "",

        "gloss": "[[fortunate]]; [[prosperity]], [[good]] [[luck]]",

        "transliteration": "fú"

      },

      {

        "@id": "mention",

        "lang": "zh",

        "src": "福星",

        "alt": "",

        "gloss": "[[Jupiter]]",

        "transliteration": "Fúxīng"

      },

      {

        "@id": "mention",

        "lang": "en",

        "src": "om mani padme hum"

      },

      {

        "@id": "mention",

        "lang": "en",

        "src": "FUBAR"

      },

      {

        "@id": "label",

        "lang": "en",

        "2": "programming"

      },

      {

        "@id": "label",

        "lang": "en",

        "2": "fandom slang"

      },

      {

        "@id": "link",

        "lang": "en",

        "src": "foobar"

      },

      {

        "@id": "link",

        "lang": "en",

        "src": "FUBAR"

      }

    ],

    "pronunciations": [

      [

        {

          "@id": "dialect",

          "dialects": [

            "UK"

          ]

        },

        {

          "@id": "ipa",

          "lang": "en",

          "pronunciations": [

            {

              "ipa": "/fuː/"

            }

          ]

        }

      ],

      [

        {

          "@id": "En-au-foo.ogg",

          "lang": "en",

          "url": "Audio (AU)"

        }

      ],

      [

        {

          "@id": "rhymes",

          "lang": "en",

          "s": "1",

          "rhymes": [

            {

              "rhyme": "uː"

            }

          ]

        }

      ],

      [

        {

          "@id": "homophones",

          "lang": "en",

          "rhymes": [

            {

              "homophone": "-fu"

            }

          ]

        }

      ]

    ],

    "glyphOrigin": null,

    "description": null,

    "definitions": [

      {

        "@id": "references",

        "linked": [],

        "category": null,

        "data": "\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"

      },

      {

        "@id": "anagrams",

        "linked": [

          {

            "@id": "anagram",

            "lang": "en",

            "alphagram": "foo",

            "anagrams": [

              "oof"

            ]

          }

        ]

      },

      {

        "@id": "noun",

        "data": [

          {

            "@id": "derivedTerms",

            "linked": [

              {

                "@id": "link",

                "lang": "en",

                "src": "foobar"

              }

            ]

          },

          {

            "@id": "relatedTerms",

            "linked": [

              {

                "@id": "link",

                "lang": "en",

                "src": "FUBAR"

              }

            ]

          }

        ]

      },

      {

        "@id": "derivedTerms",

        "linked": [

          {

            "@id": "link",

            "lang": "en",

            "src": "foobar"

          }

        ]

      },

      {

        "@id": "relatedTerms",

        "linked": [

          {

            "@id": "link",

            "lang": "en",

            "src": "FUBAR"

          }

        ]

      }

    ]

  },

  {

    "word": "foo",

    "lang": "en",

    "altForms": [],

    "etymology": [

      {

        "@id": "mention",

        "lang": "en",

        "src": "fuck"

      },

      {

        "@id": "sense",

        "sense": "expression of disgust"

      },

      {

        "@id": "link",

        "lang": "en",

        "src": "darn"

      },

      {

        "@id": "link",

        "lang": "en",

        "src": "drat"

      }

    ],

    "pronunciations": [

      [

        {

          "@id": "dialect",

          "dialects": [

            "UK"

          ]

        },

        {

          "@id": "ipa",

          "lang": "en",

          "pronunciations": [

            {

              "ipa": "/fuː/"

            }

          ]

        }

      ],

      [

        {

          "@id": "En-au-foo.ogg",

          "lang": "en",

          "url": "Audio (AU)"

        }

      ],

      [

        {

          "@id": "rhymes",

          "lang": "en",

          "s": "1",

          "rhymes": [

            {

              "rhyme": "uː"

            }

          ]

        }

      ],

      [

        {

          "@id": "homophones",

          "lang": "en",

          "rhymes": [

            {

              "homophone": "-fu"

            }

          ]

        }

      ]

    ],

    "glyphOrigin": null,

    "description": null,

    "definitions": [

      {

        "@id": "references",

        "linked": [],

        "category": null,

        "data": "\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"

      },

      {

        "@id": "anagrams",

        "linked": [

          {

            "@id": "anagram",

            "lang": "en",

            "alphagram": "foo",

            "anagrams": [

              "oof"

            ]

          }

        ]

      },

      {

        "@id": "interjection",

        "data": [

          {

            "@id": "synonyms",

            "linked": [

              {

                "@id": "sense",

                "sense": "expression of disgust"

              },

              {

                "@id": "link",

                "lang": "en",

                "src": "darn"

              },

              {

                "@id": "link",

                "lang": "en",

                "src": "drat"

              }

            ]

          }

        ]

      },

      {

        "@id": "synonyms",

        "linked": [

          {

            "@id": "sense",

            "sense": "expression of disgust"

          },

          {

            "@id": "link",

            "lang": "en",

            "src": "darn"

          },

          {

            "@id": "link",

            "lang": "en",

            "src": "drat"

          }

        ]

      }

    ]

  },

  {

    "word": "foo",

    "lang": "en",

    "altForms": [

      {

        "word": {

          "@id": "link",

          "lang": "en",

          "src": "foo'"

        },

        "qualifiers": []

      }

    ],

    "etymology": [

      {

        "@id": "link",

        "lang": "en",

        "src": "foo'"

      },

      {

        "@id": "label",

        "lang": "en",

        "2": "slang"

      }

    ],

    "pronunciations": [

      [

        {

          "@id": "dialect",

          "dialects": [

            "UK"

          ]

        },

        {

          "@id": "ipa",

          "lang": "en",

          "pronunciations": [

            {

              "ipa": "/fuː/"

            }

          ]

        }

      ],

      [

        {

          "@id": "En-au-foo.ogg",

          "lang": "en",

          "url": "Audio (AU)"

        }

      ],

      [

        {

          "@id": "rhymes",

          "lang": "en",

          "s": "1",

          "rhymes": [

            {

              "rhyme": "uː"

            }

          ]

        }

      ],

      [

        {

          "@id": "homophones",

          "lang": "en",

          "rhymes": [

            {

              "homophone": "-fu"

            }

          ]

        }

      ]

    ],

    "glyphOrigin": null,

    "description": null,

    "definitions": [

      {

        "@id": "references",

        "linked": [],

        "category": null,

        "data": "\n* [[rfc:3092]], ''Etymology of \"Foo\"'', {{w|Internet Engineering Task Force}} (IETF)\n\n"

      },

      {

        "@id": "anagrams",

        "linked": [

          {

            "@id": "anagram",

            "lang": "en",

            "alphagram": "foo",

            "anagrams": [

              "oof"

            ]

          }

        ]

      },

      {

        "@id": "noun",

        "data": []

      }

    ]

  }

]

```