{"id":13585144,"url":"https://github.com/AlexGidiotis/Document-Classifier-LSTM","last_synced_at":"2025-04-07T06:32:42.159Z","repository":{"id":24444179,"uuid":"100877828","full_name":"AlexGidiotis/Document-Classifier-LSTM","owner":"AlexGidiotis","description":"A bidirectional LSTM with attention for multiclass/multilabel text classification.","archived":false,"fork":false,"pushed_at":"2024-08-30T23:53:12.000Z","size":84,"stargazers_count":171,"open_issues_count":8,"forks_count":52,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-11-06T02:38:46.922Z","etag":null,"topics":["arxiv","attention-mechanism","hierarchical-attention-networks","keras","lstm","multilabel-multiclass","recurrent-neural-networks","tensorflow","text-classification"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AlexGidiotis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-08-20T17:32:57.000Z","updated_at":"2024-10-22T05:35:46.000Z","dependencies_parsed_at":"2024-11-06T03:03:19.339Z","dependency_job_id":"3e83e0ee-d4e8-4972-8913-0e5cd36e58c2","html_url":"https://github.com/AlexGidiotis/Document-Classifier-LSTM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGidiotis%2FDocument-Classifier-LSTM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGidiotis%2FDocument-Classifier-LSTM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGidiotis%2FDocument-Classifier-LSTM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlexGidiotis%2FDocument-Classifier-LSTM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AlexGidiotis","download_url":"https://codeload.github.com/AlexGidiotis/Document-Classifier-LSTM/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607550,"owners_count":20965942,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arxiv","attention-mechanism","hierarchical-attention-networks","keras","lstm","multilabel-multiclass","recurrent-neural-networks","tensorflow","text-classification"],"created_at":"2024-08-01T15:04:45.820Z","updated_at":"2025-04-07T06:32:42.146Z","avatar_url":"https://github.com/AlexGidiotis.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Document-Classifier-LSTM\nRecurrent Neural Networks for multilclass, multilabel classification of texts. The models that learn to tag samll texts with 169 different tags from arxiv. \n\nIn classifier.py is implemented a standard BLSTM network with attention.\n\nIn hatt_classifier.py you can find the implementation of [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf).\n\nThe neural networks were built using Keras and Tensorflow.\n\nThe best performing model is the attention BLSTM that achieves a micro f-score of 0.67 on the test set.\n\nThe Hierarchical Attention Network achieves only 0.65 micro f-score.\n\nI am using 500k paper abstracts from arxiv. In order to download your own data refer to the [arxiv OAI api](https://arxiv.org/help/bulk_data).\n\nPretrained word embeddings can be used. The embeddings can either be GloVe or Word2Vec. You can download the   [GoogleNews-vectors-negative300.bin](https://code.google.com/archive/p/word2vec) or the [GloVe embeddings](https://nlp.stanford.edu/projects/glove). \n\n\n## Usage:\n\n1) In order to train your own model you must prepare your data set using the data_prep.py script. The preprocessing converts to lower case, tokenizes and removes very short words. The preprocessed files and label files should be saved in a /data folder.\n\n2) You can now run classifier.py or hatt_classifier.py to build and train the models.\n\n3) The trained models are exported to json and the weights to h5 for later use.\n\n4) You can use utils.visualize_attention to visualize the attention weights.\n\n\n## Requirements\n\n- Python\n- NLTK\n- NumPy\n- Pandas\n- SciPy\n- OpenCV\n- scikit-learn\n- [Tensorflow](https://github.com/tensorflow/tensorflow)\n- [Keras](https://github.com/fchollet/keras)\n\nRun `pip install -r requirements.txt` to install the requirements.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlexGidiotis%2FDocument-Classifier-LSTM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAlexGidiotis%2FDocument-Classifier-LSTM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlexGidiotis%2FDocument-Classifier-LSTM/lists"}