{"id":15643254,"url":"https://github.com/harshgeek4coder/multilabel_document_categorization_","last_synced_at":"2025-03-29T22:40:46.758Z","repository":{"id":106560594,"uuid":"384196137","full_name":"harshgeek4coder/Multilabel_Document_Categorization_","owner":"harshgeek4coder","description":"This Repository consists of work done for performing Multilabel document categorization using Semi-Supervised Learning","archived":false,"fork":false,"pushed_at":"2021-07-23T18:19:02.000Z","size":14595,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-05T00:05:24.957Z","etag":null,"topics":["lda","neural-network","nlp","nmf","semi-supervised-learning","svd-matrix-factorisation","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harshgeek4coder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-08T17:08:15.000Z","updated_at":"2021-07-23T18:20:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"a541f824-6c97-4f07-9e2b-61d207dc0f6c","html_url":"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshgeek4coder%2FMultilabel_Document_Categorization_","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshgeek4coder%2FMultilabel_Document_Categorization_/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshgeek4coder%2FMultilabel_Document_Categorization_/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harshgeek4coder%2FMultilabel_Document_Categorization_/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harshgeek4coder","download_url":"https://codeload.github.com/harshgeek4coder/Multilabel_Document_Categorization_/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246254100,"owners_count":20747948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lda","neural-network","nlp","nmf","semi-supervised-learning","svd-matrix-factorisation","topic-modeling"],"created_at":"2024-10-03T11:59:42.024Z","updated_at":"2025-03-29T22:40:46.738Z","avatar_url":"https://github.com/harshgeek4coder.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Multilabel Document Categorization\n\nThis Repository consists of work done for performing multilabel document categorization using both unsupervised and supervised learning.\n\nThis Work consists of two subtasks :\n- Subtask I  : Unsupervised topic modelling\n- Subtask II : Learning a supervised multi-Topic Classifier\n\n### Running :\n- Clone the Repo\n- Activate your virtualenv.\n- Run the following script in CLI :\n```\npip install -r requirements.txt in your shell.\npython main.py\n```\n- Please NOTE : \n  - You are supposed to put this file : ``` glove.6B.300d.txt ``` in this ```glove directory``` before running ```main.py```. Please Refer to this [Readme](https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/glove/README.md) for further instructions.\n   - You would also have to download the following files [If Not Already Downloaded]:\n   ```\n   - nltk.download('stopwords')\n   - nltk.download('punkt')\n   - nltk.download('wordnet')\n   ```\n\n### Root Folder Structure : \n```\n│   data_process.py\n│   get_features.py\n│   inference.py\n│   main.py\n│   post_process.py\n│   prepare_embed_matrix.py\n│   process_supervised_data.py\n│   requirements.txt\n│   save_n_load_state.py\n│   supervised_models.py\n│   tokenize_n_padding.py\n│   unsupervised_models.py\n│   utils.py\n│\n├───datasets\n│       pre_processed_df.csv [This file will be automatically added once you run main.py]\n│       sentisum-assessment-dataset.csv\n│\n└───glove\n        glove.6B.300d.txt [After downloading and putting this file in glove directory.]\n```\n\n### Visual Analysis :\n- Flow Chart of subtask 1 and 2 : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/flowchart.png\"\u003e\n\n- Model Architecture : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/model%20plot.jpg\"\u003e\n\n- Model Performance : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/model%20performance%20plot.jpg\"\u003e\n\n- Word Cloud : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/word%20cloud.png\"\u003e\n\n- Top 10 Words : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/top%2010%20words.png\"\u003e\n\n### Results :\n\n- For Unsupervised Multi Label Classifier - GridSearchCV on LDA Model : \u003cbr\u003e\n```\nFitting 3 folds for each of 12 candidates, totalling 36 fits\n[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 10.7min finished\nBest model's params:  {'learning_decay': 0.75, 'learning_offset': 30, 'n_components': 12}\nBest log likelihood score:  -83160.21471701778\nModel perplexity:  55.26047238754284\n```\n\n- For Supervised Multi Label Classifier : \u003cbr\u003e\n```\n|                             Model                             \t| Accuracy  \t|  Loss \t|\n|:-------------------------------------------------------------:\t|:---------:\t|:-----:\t|\n|  Bidirectional LSTMs + Convnet 1-D + Latent Embeddings - 256-D \t|  76.22 %  \t| 0.808 \t|\n|  Bidirectional LSTMs + Convnet 1-D + GloVe Embeddings - 300-D \t|  80.46 %  \t| 0.667 \t|\n```\n### Inferences :\n\nFor Supervised Inference : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/inference%20supervised.jpg\"\u003e\n\nFor Unsupervised Inference : \u003cbr\u003e\n\n\u003cimg src=\"https://github.com/harshgeek4coder/Multilabel_Document_Categorization_/blob/main/visuals/inference%20unsupervised.jpg\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharshgeek4coder%2Fmultilabel_document_categorization_","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharshgeek4coder%2Fmultilabel_document_categorization_","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharshgeek4coder%2Fmultilabel_document_categorization_/lists"}