{"id":19054704,"url":"https://github.com/ssbuild/multi-label-text-classification","last_synced_at":"2025-11-12T11:02:53.428Z","repository":{"id":117215168,"uuid":"409056181","full_name":"ssbuild/Multi-Label-Text-Classification","owner":"ssbuild","description":null,"archived":false,"fork":false,"pushed_at":"2021-09-22T03:48:54.000Z","size":261,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-02-22T01:19:52.238Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ssbuild.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":null,"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":["https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/.github/Alipay.jpeg","https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/.github/Wechat.jpeg"]}},"created_at":"2021-09-22T03:48:43.000Z","updated_at":"2023-05-10T14:48:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"df8a83b0-f913-42e6-9189-785efe9429cf","html_url":"https://github.com/ssbuild/Multi-Label-Text-Classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ssbuild/Multi-Label-Text-Classification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssbuild%2FMulti-Label-Text-Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssbuild%2FMulti-Label-Text-Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssbuild%2FMulti-Label-Text-Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssbuild%2FMulti-Label-Text-Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ssbuild","download_url":"https://codeload.github.com/ssbuild/Multi-Label-Text-Classification/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ssbuild%2FMulti-Label-Text-Classification/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":284021192,"owners_count":26933845,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-12T02:00:06.336Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T23:39:27.472Z","updated_at":"2025-11-12T11:02:53.257Z","avatar_url":"https://github.com/ssbuild.png","language":"Python","funding_links":["https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/.github/Alipay.jpeg","https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/.github/Wechat.jpeg"],"categories":[],"sub_categories":[],"readme":"# Deep Learning for Multi-Label Text Classification\n\n[![Python Version](https://img.shields.io/badge/language-python3.6-blue.svg)](https://www.python.org/downloads/) [![Build Status](https://travis-ci.org/RandolphVI/Multi-Label-Text-Classification.svg?branch=master)](https://travis-ci.org/RandolphVI/Multi-Label-Text-Classification) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/c45aac301b244316830b00b9b0985e3e)](https://www.codacy.com/app/chinawolfman/Multi-Label-Text-Classification?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=RandolphVI/Multi-Label-Text-Classification\u0026amp;utm_campaign=Badge_Grade) [![License](https://img.shields.io/github/license/RandolphVI/Multi-Label-Text-Classification.svg)](https://www.apache.org/licenses/LICENSE-2.0) [![Issues](https://img.shields.io/github/issues/RandolphVI/Multi-Label-Text-Classification.svg)](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues)\n\nThis repository is my research project, and it is also a study of TensorFlow, Deep Learning (Fasttext, CNN, LSTM, etc.).\n\nThe main objective of the project is to solve the multi-label text classification problem based on Deep Neural Networks. Thus, the format of the data label is like [0, 1, 0, ..., 1, 1] according to the characteristics of such a problem.\n\n## Requirements\n\n- Python 3.6\n- Tensorflow 1.15.0\n- Tensorboard 1.15.0\n- Sklearn 0.19.1\n- Numpy 1.16.2\n- Gensim 3.8.3\n- Tqdm 4.49.0\n\n## Project\n\nThe project structure is below:\n\n```text\n.\n├── Model\n│   ├── test_model.py\n│   ├── text_model.py\n│   └── train_model.py\n├── data\n│   ├── word2vec_100.model.* [Need Download]\n│   ├── Test_sample.json\n│   ├── Train_sample.json\n│   └── Validation_sample.json\n└── utils\n│   ├── checkmate.py\n│   ├── data_helpers.py\n│   └── param_parser.py\n├── LICENSE\n├── README.md\n└── requirements.txt\n```\n\n\n\n## Innovation\n\n### Data part\n1. Make the data support **Chinese** and English (Can use `jieba` or `nltk` ).\n2. Can use **your pre-trained word vectors** (Can use `gensim`). \n3. Add embedding visualization based on the **tensorboard** (Need to create `metadata.tsv` first).\n\n### Model part\n1. Add the correct **L2 loss** calculation operation.\n2. Add **gradients clip** operation to prevent gradient explosion.\n3. Add **learning rate decay** with exponential decay.\n4. Add a new **Highway Layer** (Which is useful according to the model performance).\n5. Add **Batch Normalization Layer**.\n\n### Code part\n1. Can choose to **train** the model directly or **restore** the model from the checkpoint in `train.py`.\n2. Can predict the labels via **threshold** and **top-K** in `train.py` and `test.py`.\n3. Can calculate the evaluation metrics --- **AUC** \u0026 **AUPRC**.\n4. Can create the prediction file which including the predicted values and predicted labels of the Testset data in `test.py`.\n5. Add other useful data preprocess functions in `data_helpers.py`.\n6. Use `logging` for helping to record the whole info (including **parameters display**, **model training info**, etc.).\n7. Provide the ability to save the best n checkpoints in `checkmate.py`, whereas the `tf.train.Saver` can only save the last n checkpoints.\n\n## Data\n\nSee data format in `/data` folder which including the data sample files. For example:\n\n```json\n{\"testid\": \"3935745\", \"features_content\": [\"pore\", \"water\", \"pressure\", \"metering\", \"device\", \"incorporating\", \"pressure\", \"meter\", \"force\", \"meter\", \"influenced\", \"pressure\", \"meter\", \"device\", \"includes\", \"power\", \"member\", \"arranged\", \"control\", \"pressure\", \"exerted\", \"pressure\", \"meter\", \"force\", \"meter\", \"applying\", \"overriding\", \"force\", \"pressure\", \"meter\", \"stop\", \"influence\", \"force\", \"meter\", \"removing\", \"overriding\", \"force\", \"pressure\", \"meter\", \"influence\", \"force\", \"meter\", \"resumed\"], \"labels_index\": [526, 534, 411], \"labels_num\": 3}\n```\n\n- **\"testid\"**: just the id.\n- **\"features_content\"**: the word segment (after removing the stopwords)\n- **\"labels_index\"**: The label index of the data records.\n- **\"labels_num\"**: The number of labels.\n\n### Text Segment\n\n1. You can use `nltk` package if you are going to deal with the English text data.\n\n2. You can use `jieba` package if you are going to deal with the Chinese text data.\n\n### Data Format\n\nThis repository can be used in other datasets (text classification) in two ways:\n1. Modify your datasets into the same format of [the sample](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/data).\n2. Modify the data preprocessing code in `data_helpers.py`.\n\nAnyway, it should depend on what your data and task are.\n\n**🤔Before you open the new issue about the data format, please check the `data_sample.json` and read the other open issues first, because someone maybe ask me the same question already. For example:**\n\n- [输入文件的格式是什么样子的？](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues/1)\n- [Where is the dataset for training?](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues/7)\n- [在 data_helpers.py 中的 content.txt 与 metadata.tsv 是什么，具体格式是什么，能否提供一个样例？](https://github.com/RandolphVI/Multi-Label-Text-Classification/issues/12)\n\n### Pre-trained Word Vectors\n\n**You can download the [Word2vec model file](https://drive.google.com/file/d/1S33iejwuQOIaNQfXW7fA_6zBwHHClT--/view?usp=sharing) (dim=100). Make sure they are unzipped and under the `/data` folder.**\n\nYou can pre-training your word vectors (based on your corpus) in many ways:\n- Use `gensim` package to pre-train data.\n- Use `glove` tools to pre-train data.\n- Even can use a **fasttext** network to pre-train data.\n\n## Usage\n\nSee [Usage](https://github.com/RandolphVI/Multi-Label-Text-Classification/blob/master/Usage.md).\n\n## Network Structure\n\n### FastText\n\n![](https://farm2.staticflickr.com/1917/45609842012_30f370a0ee_o.png)\n\nReferences:\n\n- [Bag of Tricks for Efficient Text Classification](https://arxiv.org/pdf/1607.01759.pdf)\n\n---\n\n### TextANN\n\n![](https://farm2.staticflickr.com/1965/44745949305_50f831a579_o.png)\n\nReferences:\n\n- **Personal ideas 🙃**\n\n---\n\n### TextCNN\n\n![](https://farm2.staticflickr.com/1927/44935475604_1d6b8f71a3_o.png)\n\nReferences:\n\n- [Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1408.5882)\n- [A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification](http://arxiv.org/abs/1510.03820)\n\n---\n\n### TextRNN\n\n**Warning: Model can use but not finished yet 🤪!**\n\n![](https://farm2.staticflickr.com/1925/30719666177_6665038ea2_o.png)\n\n#### TODO\n1. Add BN-LSTM cell unit.\n2. Add attention.\n\nReferences:\n\n- [Recurrent Neural Network for Text Classification with Multi-Task Learning](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9745/9552)\n\n---\n\n### TextCRNN\n\n![](https://farm2.staticflickr.com/1915/43842346360_e4660c5921_o.png)\n\nReferences:\n\n- **Personal ideas 🙃**\n\n---\n\n### TextRCNN\n\n![](https://farm2.staticflickr.com/1950/31788031648_b5cba7bbf0_o.png)\n\nReferences:\n\n- **Personal ideas 🙃**\n\n---\n\n### TextHAN\n\nReferences:\n\n- [Hierarchical Attention Networks for Document Classification](https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf)\n\n---\n\n### TextSANN\n\n**Warning: Model can use but not finished yet 🤪!**\n\n#### TODO\n1. Add attention penalization loss.\n2. Add visualization.\n\nReferences:\n\n- [A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING](https://arxiv.org/pdf/1703.03130.pdf)\n\n---\n\n## About Me\n\n黄威，Randolph\n\nSCU SE Bachelor; USTC CS Ph.D.\n\nEmail: chinawolfman@hotmail.com\n\nMy Blog: [randolph.pro](http://randolph.pro)\n\nLinkedIn: [randolph's linkedin](https://www.linkedin.com/in/randolph-%E9%BB%84%E5%A8%81/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fssbuild%2Fmulti-label-text-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fssbuild%2Fmulti-label-text-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fssbuild%2Fmulti-label-text-classification/lists"}