{"id":13936323,"url":"https://github.com/raghakot/keras-text","last_synced_at":"2025-04-05T07:03:07.377Z","repository":{"id":62573976,"uuid":"101573647","full_name":"raghakot/keras-text","owner":"raghakot","description":"Text Classification Library in Keras","archived":false,"fork":false,"pushed_at":"2018-06-04T19:44:35.000Z","size":12114,"stargazers_count":420,"open_issues_count":17,"forks_count":97,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-03-29T06:04:39.320Z","etag":null,"topics":["deep-learning","keras","machine-learning","neural-network","tensorflow","text-classification","theano"],"latest_commit_sha":null,"homepage":"https://raghakot.github.io/keras-text/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raghakot.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-27T18:59:02.000Z","updated_at":"2024-02-29T04:50:00.000Z","dependencies_parsed_at":"2022-11-03T17:30:33.845Z","dependency_job_id":null,"html_url":"https://github.com/raghakot/keras-text","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raghakot%2Fkeras-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raghakot%2Fkeras-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raghakot%2Fkeras-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raghakot%2Fkeras-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raghakot","download_url":"https://codeload.github.com/raghakot/keras-text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247299831,"owners_count":20916190,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","keras","machine-learning","neural-network","tensorflow","text-classification","theano"],"created_at":"2024-08-07T23:02:34.116Z","updated_at":"2025-04-05T07:03:07.363Z","avatar_url":"https://github.com/raghakot.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Keras Text Classification Library\n[![Build Status](https://travis-ci.org/raghakot/keras-text.svg?branch=master)](https://travis-ci.org/raghakot/keras-text)\n[![license](https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000)](https://github.com/raghakot/keras-text/blob/master/LICENSE)\n[![Slack](https://img.shields.io/badge/slack-discussion-E01563.svg)](https://join.slack.com/t/keras-text/shared_invite/MjMzNDU3NDAxODMxLTE1MDM4NTg0MTktNzgxZTNjM2E4Zg)\n\nkeras-text is a one-stop text classification library implementing various state of the art models with a clean and \nextendable interface to implement custom architectures.\n\n## Quick start\n\n### Create a tokenizer to build your vocabulary\n\n- To represent you dataset as `(docs, words)` use `WordTokenizer`\n- To represent you dataset as `(docs, sentences, words)` use `SentenceWordTokenizer`\n- To create arbitrary hierarchies, extend `Tokenizer` and implement the `token_generator` method.\n\n```python\nfrom keras_text.processing import WordTokenizer\n\n\ntokenizer = WordTokenizer()\ntokenizer.build_vocab(texts)\n```\n\nWant to tokenize with character tokens to leverage character models? Use `CharTokenizer`.\n\n\n### Build a dataset\n\nA dataset encapsulates tokenizer, X, y and the test set. This allows you to focus your efforts on \ntrying various architectures/hyperparameters without having to worry about inconsistent evaluation. A dataset can be \nsaved and loaded from the disk.\n\n```python\nfrom keras_text.data import Dataset\n\n\nds = Dataset(X, y, tokenizer=tokenizer)\nds.update_test_indices(test_size=0.1)\nds.save('dataset')\n```\n\nThe `update_test_indices` method automatically stratifies multi-class or multi-label data correctly.\n\n### Build text classification models\n\nSee tests/ folder for usage.\n\n#### Word based models\n\nWhen dataset represented as `(docs, words)` word based models can be created using `TokenModelFactory`.\n\n```python\nfrom keras_text.models import TokenModelFactory\nfrom keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN\n\n\n# RNN models can use `max_tokens=None` to indicate variable length words per mini-batch.\nfactory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d')\nword_encoder_model = YoonKimCNN()\nmodel = factory.build_model(token_encoder_model=word_encoder_model)\nmodel.compile(optimizer='adam', loss='categorical_crossentropy')\nmodel.summary()\n``` \n\nCurrently supported models include:\n\n- [Yoon Kim CNN](https://arxiv.org/abs/1408.5882)\n- Stacked RNNs\n- Attention (with/without context) based RNN encoders.\n\n`TokenModelFactory.build_model` uses the provided word encoder which is then classified via `Dense` block.\n\n#### Sentence based models\n\nWhen dataset represented as `(docs, sentences, words)` sentence based models can be created using `SentenceModelFactory`.\n\n```python\nfrom keras_text.models import SentenceModelFactory\nfrom keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN, AveragingEncoder\n\n\n# Pad max sentences per doc to 500 and max words per sentence to 200.\n# Can also use `max_sents=None` to allow variable sized max_sents per mini-batch.\nfactory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500, max_tokens=200, embedding_type='glove.6B.100d')\nword_encoder_model = AttentionRNN()\nsentence_encoder_model = AttentionRNN()\n\n# Allows you to compose arbitrary word encoders followed by sentence encoder.\nmodel = factory.build_model(word_encoder_model, sentence_encoder_model)\nmodel.compile(optimizer='adam', loss='categorical_crossentropy')\nmodel.summary()\n``` \n\nCurrently supported models include:\n\n- [Yoon Kim CNN](https://arxiv.org/abs/1408.5882)\n- Stacked RNNs\n- Attention (with/without context) based RNN encoders.\n\n`SentenceModelFactory.build_model` created a tiered model where words within a sentence is first encoded using  \n`word_encoder_model`. All such encodings per sentence is then encoded using `sentence_encoder_model`.\n\n- [Hierarchical attention networks](http://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf) \n(HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.\n- For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using\n`token_encoder_model=AveragingEncoder()`\n- Mix and match encoders as you see fit for your problem.\n\n\n## Resources\n\nTODO: Update documentation and add notebook examples.\n\nStay tuned for better documentation and examples. \nUntil then, the best resource is to refer to the [API docs](https://raghakot.github.io/keras-text/)\n\n\n## Installation\n\n1) Install [keras](https://github.com/fchollet/keras/blob/master/README.md#installation) \nwith theano or tensorflow backend. Note that this library requires Keras \u003e 2.0\n\n2) Install keras-text\n\u003e From sources\n```bash\nsudo python setup.py install\n```\n\n\u003e PyPI package\n```bash\nsudo pip install keras-text\n```\n\n3) Download target spacy model\n\nkeras-text uses the excellent spacy library for tokenization. See instructions on how to \n[download model](https://spacy.io/docs/usage/models#download) for target language.\n\n\n## Citation\n\nPlease cite keras-text in your publications if it helped your research. Here is an example BibTeX entry:\n\n```\n@misc{raghakotkerastext\n  title={keras-text},\n  author={Kotikalapudi, Raghavendra and contributors},\n  year={2017},\n  publisher={GitHub},\n  howpublished={\\url{https://github.com/raghakot/keras-text}},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraghakot%2Fkeras-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraghakot%2Fkeras-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraghakot%2Fkeras-text/lists"}