{"id":17464184,"url":"https://github.com/onlyphantom/elang","last_synced_at":"2025-04-15T14:57:51.756Z","repository":{"id":45205685,"uuid":"239681057","full_name":"onlyphantom/elang","owner":"onlyphantom","description":"Word Embedding utilities for Language Models (English \u0026 Indonesian)","archived":false,"fork":false,"pushed_at":"2021-02-18T11:40:05.000Z","size":15036,"stargazers_count":39,"open_issues_count":0,"forks_count":36,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-15T14:57:45.239Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://elang.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/onlyphantom.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-02-11T05:07:40.000Z","updated_at":"2024-03-18T07:19:33.000Z","dependencies_parsed_at":"2022-09-02T10:11:28.542Z","dependency_job_id":null,"html_url":"https://github.com/onlyphantom/elang","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Felang","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Felang/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Felang/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/onlyphantom%2Felang/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/onlyphantom","download_url":"https://codeload.github.com/onlyphantom/elang/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249094938,"owners_count":21211836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-18T10:44:55.283Z","updated_at":"2025-04-15T14:57:51.736Z","avatar_url":"https://github.com/onlyphantom.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Word Embedding utilities for Language Models\n[![PyPI version](https://img.shields.io/pypi/v/elang?color=green)](https://badge.fury.io/py/elang) [![PyPI license](https://img.shields.io/pypi/l/Elang?color=red)](https://pypi.python.org/pypi/elang/) [![Activity](https://img.shields.io/github/commit-activity/m/onlyphantom/elang)](https://github.com/onlyphantom/elang) [![maintained](https://img.shields.io/maintenance/yes/2020)](https://github.com/onlyphantom/elang/graphs/commit-activity) [![PyPI format](https://img.shields.io/pypi/format/elang)](https://pypi.org/project/elang/) [![pypi downloads](https://img.shields.io/pypi/dm/elang)](https://pypi.org/project/elang/) [![Documentation Status](https://readthedocs.org/projects/elang/badge/?version=latest)](https://elang.readthedocs.io/en/latest/?badge=latest)\n\n\nElang is an acronym that combines the phrases **Embedding (E)** and **Language (Lang) Models**. Its goal is to help NLP (natural language processing) researchers, Word2Vec practitioners, educators and data scientists be more productive in training language models and explaining key concepts in word embeddings. \n\nKey features as of the 0.1 release can be grouped as follow:\n\n- **Corpus-building utility**\n    - [x] `build_from_wikipedia_random`: Build English / Indonesian corpus using random articles from Wikipedia\n    - [x] `build_from_wikipedia_branch`: Build English / Indonesian corpus by building a \"topic branch\" off Wikipedia\n\n- **Text processing utility**\n    - [x] `remove_stopwords_id`: Remove stopwords (Indonesian)\n    - [x] `remove_region_id`: Remove region entity (Indonesian)\n    - [x] `remove_calendar_id`: Remove calendar words (Indonesian)\n    - [x] `remove_vulgarity_id`: Remove vulgarity (Indonesian)\n\n- **Embedding Visualization Utility** (see illustration below)\n    - [x] `plot2d`: 2D plot with emphasis on words of interest\n    - [x] `plotNeighbours`: 2D plot with neighbors of words\n\n\n\u003cimg align=\"left\" width=\"35%\" src=\"https://github.com/onlyphantom/elangdev/blob/master/assets/elang_light.png?raw=true\" style=\"margin-right:10%\"\u003e\n\n## Elang\nElang also means \"eagle\" in Bahasa Indonesia, and the _elang Jawa_ (Javan hawk-eagle) is the national bird of Indonesia, more commonly referred to as Garuda. \n\nThe package provides a collection of utility functions and tools that interface with `gensim`, `matplotlib` and `scikit-learn`, as well as curated negative lists for Bahasa Indonesia (kata kasar / vulgar words, _stopwords_ etc) and useful preprocesisng functions. It abstracts away the mundane task so you can train your Word2Vec model faster, and obtain visual feedback on your model more quickly.\n\n# Quick Demo\n\n### 2-d Word Embedding Visualization\nInstall the latest version of `elang`:\n```bash\npip install --upgrade elang\n```\n\nPerforming word embeddings in **2 lines of code** gets you a visualization:\n```py\nfrom elang.plot.utils import plot2d\nfrom gensim.models import Word2Vec\n\nmodel = Word2Vec.load(\"path.to.model\")\nplot2d(model)\n# output:\n```\n\n\u003cimg width=\"60%\" src=\"https://github.com/onlyphantom/elangdev/raw/master/assets/embedding.png\"\u003e\n\nIt even looks like a soaring eagle with its outstretched wings!\n\n### Visualizing Neighbors in 2-dimensional space\n\n`elang` also includes visualization methods to help you visualize a user-defined _k_ number of neighbors to each words. When `draggable` is set to `True`, you will obtain a legend that you can move around in the resulting plot.\n\n```py\nwords = ['bca', 'hitam', 'hutan', 'pisang', 'mobil', \"cinta\", \"pejabat\", \"android\", \"kompas\"]\n\nplotNeighbours(model, \n    words, \n    method=\"TSNE\", \n    k=15,\n    draggable=True)\n```\n\n\u003cimg width=\"60%\" src=\"https://github.com/onlyphantom/elangdev/raw/master/assets/neighbors.png\"\u003e\n\n\nThe plot above plots the 15 nearest neighbors for each word in the supplied `words` argument. It then renders the plot with a draggable legend.\n\n### Scikit-Learn Compatability\nBecause the dimensionality reduction procedure is handled by the underlying `sklearn` code, you can use any of the valid [parameters](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) in the function call to `plot2d` and `plotNeighbours` and they will be handed off to the underlying method. Common examples are the `perplexity`, `n_iter` and `random_state` parameters:\n\n```py\nmodel = Word2Vec.load(\"path.to.model\")\nbca = model.wv.most_similar(\"bca\", topn=14)\nsimilar_bca = [w[0] for w in bca]\nplot2d(\n    model,\n    method=\"PCA\",\n    targets=similar_bca,\n    perplexity=20,\n    early_exaggeration=50,\n    n_iter=2000,\n    random_state=0,\n)\n```\n\nOutput:\n\n\u003cimg width=\"60%\" src=\"https://github.com/onlyphantom/elangdev/raw/master/assets/tsne.png\"\u003e\n\n### Building a Word2Vec model from Wikipedia\nInstall the `requests` package (`pip install requests`) to use the builder functions below:\n\n```py\nfrom elang.word2vec.builder import build_from_wikipedia\n# a convenient wrapper to build_from_wikipedia_random or build_from_wikipedia_branch\nmodel1 = build_from_wikipedia(n=3, lang=\"id\")\nmodel2 = build_from_wikipedia(slug=\"Koronavirus\", lang=\"id\", levels=2)\nprint(model1)\n# returns: Word2Vec(vocab=190, size=100, alpha=0.025)\n```\n\nThe code above constructs two Word2Vec models, `model1` and `model2`. The function that constructs these models does so by building a corpus from 3 (`n`) random articles on id.wikipedia.org (`id`). The corpus can optionally be saved by passing the `save=True` argument to the function call. \n\nIn `model2`, the function starts off by looking at the article: `https://id.wikipedia.org/wiki/Koronavirus` (determined by `id` and `slug`), and then find all related articles (level 1), and subsequently all related articles to those related articles (level 2). A corpus is built using all articles it find along this search branch (`levels`).\n\nYou can now use the other utilities function with `model1` that you created above:\n\n```python\nfrom elang.plot.utils import plot2d\nplot2d(model)\n```\n\nOr find the most similar words to any word in your Word2Vec model:\n```python\nmodel1.wv.most_similar(\"koronavirus\")\n# return:\n[('subtipe', 0.9947343468666077), ('mers', 0.9941919445991516), ('influenza', 0.9937061667442322), ('flu', 0.993574857711792), ('galur', 0.9933352470397949), ('hanta', 0.9925214052200317), ('sindrom', 0.992496907711029), ('hku', 0.9921219944953918), ('herpes', 0.9921203851699829), ('adenovirus', 0.9920581579208374)]\n```\n\n#### Building a Corpus from Wikipedia (without Word2Vec model)\n\nIf you would like to build a corpus, but not have the function _return_ a Word2Vec model, simply pass `model=False` and `save=True`. The `save` argument will create a `/corpus` directory and save the corpus in a `.txt` file. \n\n```py\nbuild_from_wikipedia(n=10, lang=\"en\", save=True)\n```\n\nThe function call above will create a Corpus from the international (english) version of Wikipedia and save it to the following file in your working directory: `corpus/wikipedia_random_10_en.txt`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonlyphantom%2Felang","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fonlyphantom%2Felang","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fonlyphantom%2Felang/lists"}