{"id":19118875,"url":"https://github.com/akhvorov/vgram","last_synced_at":"2025-06-28T01:37:30.554Z","repository":{"id":62587352,"uuid":"138793942","full_name":"akhvorov/vgram","owner":"akhvorov","description":"Feature extraction from sequential data","archived":false,"fork":false,"pushed_at":"2019-07-04T10:51:18.000Z","size":558,"stargazers_count":7,"open_issues_count":4,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-15T19:47:45.439Z","etag":null,"topics":["byte-pair-encoding","feature-extraction","natural-language-processing","sequential-data","text-classification","vgram","word-segmentation"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/akhvorov.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-26T21:12:03.000Z","updated_at":"2019-11-29T13:44:30.000Z","dependencies_parsed_at":"2022-11-03T22:09:50.065Z","dependency_job_id":null,"html_url":"https://github.com/akhvorov/vgram","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akhvorov%2Fvgram","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akhvorov%2Fvgram/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akhvorov%2Fvgram/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/akhvorov%2Fvgram/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/akhvorov","download_url":"https://codeload.github.com/akhvorov/vgram/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252516038,"owners_count":21760710,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["byte-pair-encoding","feature-extraction","natural-language-processing","sequential-data","text-classification","vgram","word-segmentation"],"created_at":"2024-11-09T05:07:58.995Z","updated_at":"2025-05-05T14:40:49.472Z","avatar_url":"https://github.com/akhvorov.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Documentation Status](https://readthedocs.org/projects/vgram/badge/?version=latest)](https://vgram.readthedocs.io/en/latest/?badge=latest)\n[![PyPI version](https://badge.fury.io/py/vgram.svg)](https://badge.fury.io/py/vgram)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n# vgram\n\nThis is implementation of CIKM'18 paper [Construction of Efficient V-Gram Dictionary for Sequential Data Analysis](https://dl.acm.org/citation.cfm?id=3271789)\n\n`vgram` is a Python package, which provide sklearn-like fit-transform interface for easy integration into your pipeline.\n\n`vgram` is similar to BPE (Sennrich et al., 2016), but instead of frequencies takes into account the informativeness of subwords.\nThis allows you to compress the dictionary to several thousand elements with increasing accuracy.\n\n## Install\n\n```bash\npip install vgram\n``` \nAlso you should have Python 3 and cmake. Maybe you keep some errors about not installed pybind11 but it is okay.\n\n## Examples\n\n### Basic usage\n\nFit vgram dictionary\n\n```python\nfrom vgram import VGram\n\ntexts = [\"hello world\", \"a cat sat on the mat\"]\nvgram = VGram(size=20, iter_num=300)\nvgram.fit(texts)\nresult = vgram.transform(texts)\n\n```\n\n### Working with integer sequences\n\n`vgram` can be applied not only for text, but also to integer sequences.\nThis generalization allow works with non-textual data or transorm text to tokens by yourself.\nThis example is equivalent to previous.\n\n```python\nfrom vgram import IntVGram, CharTokenizer\n\ntexts = [\"hello world\", \"a cat sat on the mat\"]\ntokenizer = CharTokenizer()\n\n# transform text to tokens ids and pass to IntVGram\ntok_texts = tokenizer.fit_transform(texts)\nvgram = IntVGram(size=10000, iter_num=10)\nvgram.fit(tok_texts)\nresult = tokenizer.decode(vgram.transform(tok_texts))\n\n``` \n\n### Custom tokenization\n\nBy default `VGram` make all texts lowercase and remove all non-alphanumeric symbols amd split on characters. \nThis normalizations is not good for many tasks, that's why you can fit vgram dictionary with custom normalization and tokenization. \nYou should only derive class BaseTokenizer and implement normalize and tokenize methods. \n\nNote: This feature is not stable, use previous variant for custom tokenizers.\n\n```python\nfrom vgram import VGram, BaseTokenizer\n\nclass MyTokenizer(BaseTokenizer):\n    def normalize(self, s):\n        return s\n        \n    def tokenize(self, s):\n        return s.split(' ')\n        \ntexts = [\"hello world\", \"a cat sat on the mat\"]\ntokenizer = MyTokenizer()\ntok_vgram = VGram(size=10000, iter_num=10, tokenizer=tokenizer)\ntok_vgram.fit(texts)\ntok_result = tok_vgram.transform(texts)\n\n```\nYou can change tokenization and try to build dictionary of vgrams, where words are symbols.  \n\n### Classification pipeline\n\nBasic example of 20 news groups dataset classification\n\n```python\nimport numpy as np\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.pipeline import Pipeline\nfrom vgram import VGram\n\n# fetch data\ntrain, test = fetch_20newsgroups(subset='train'), fetch_20newsgroups(subset='test')\ndata = train.data + test.data\n\n# make vgram pipeline and fit it\nvgram = Pipeline([\n    (\"vgb\", VGram(size=10000, iter_num=10)),\n    (\"vect\", CountVectorizer())\n])\n# it's ok, vgram fit only once\nvgram.fit(data)\n\n# fit classifier and get score\npipeline = Pipeline([\n    (\"features\", vgram),\n    ('tfidf', TfidfTransformer(sublinear_tf=True)),\n    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-4, max_iter=100, random_state=42))\n])\npipeline.fit(train.data, train.target)\nprint(\"train accuracy: \", np.mean(pipeline.predict(train.data) == train.target))\nprint(\"test accuracy: \", np.mean(pipeline.predict(test.data) == test.target))\n\n# show first ten elements of constructed vgram dictionary\nalpha = vgram.named_steps[\"tokenizer\"].decode(vgram.named_steps[\"vgb\"].alphabet())\nprint(\"First 10 alphabet elements:\", alpha[:10])\n```\n\nV-Gram is unsupervised method that's why we fit vgram to all data.\nOnce fitted, vgram don't fit again and we could not trouble about doubled fitting.  \nIn last two lines shown how to get dictionary alphabet and print some elements.\n\nRead full [documentation](https://vgram.readthedocs.io) for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakhvorov%2Fvgram","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fakhvorov%2Fvgram","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fakhvorov%2Fvgram/lists"}