{"id":27231223,"url":"https://github.com/cryptum169/textcnn-for-chinese","last_synced_at":"2026-01-24T04:36:23.603Z","repository":{"id":40927608,"uuid":"139683834","full_name":"Cryptum169/textCNN-for-Chinese","owner":"Cryptum169","description":"An implementation of textCNN text classification task for Chinese. 2 Versions with and without pre-trained Doc2Vec model.","archived":false,"fork":false,"pushed_at":"2022-06-22T17:20:18.000Z","size":486,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2023-04-28T13:36:52.111Z","etag":null,"topics":["chinese","classification","nlp","textcnn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Cryptum169.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-04T07:18:02.000Z","updated_at":"2023-03-31T16:13:42.000Z","dependencies_parsed_at":"2022-09-20T19:50:58.293Z","dependency_job_id":null,"html_url":"https://github.com/Cryptum169/textCNN-for-Chinese","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cryptum169%2FtextCNN-for-Chinese","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cryptum169%2FtextCNN-for-Chinese/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cryptum169%2FtextCNN-for-Chinese/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cryptum169%2FtextCNN-for-Chinese/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Cryptum169","download_url":"https://codeload.github.com/Cryptum169/textCNN-for-Chinese/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248226443,"owners_count":21068204,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chinese","classification","nlp","textcnn"],"created_at":"2025-04-10T13:44:06.879Z","updated_at":"2026-01-24T04:36:23.562Z","avatar_url":"https://github.com/Cryptum169.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Notice\nThis repository is not being maintained and likely has legacy Keras interface.\n\n# An implementation of textCNN for Chinese with and without gensim Doc2Vec. \nWith pre-trained Doc2Vec model, textCNN's model size shrink from around 200mb to 30 mb. Training speed has increased significantly. On my dataset, textCNN with Doc2Vec implementation is able to get to around 80% accuracy on first epoch, with average step time cost 1ms. Vanilla textCNN will take several more epoch to converge and average step time of 70~ms. \n\nBe aware tho that Doc2Vec model make take up to several giga bytes\n\n# Requirements\nRun ```pip install -r requirements.txt```\n\n# Creating a Gensim Doc2Vec model\nUse ```doc2vec_model_gen.py``` under ```src``` to generate a gensim Doc2Vec model. Directly call doc2vec_model_gen.py in terminal with corpus named \"Content.txt\" in the same directory. Content.txt need to be line-separated articles. i.e. Article1 \\n Article2 \\n Article3.\n\nPut all four model files under model/doc2vec directory.\n\n# Un-uploaded files\nPre-generated doc2vec model, pre-trained textCNN model with and without doc2vec versions can be found ~~here~~ still uploading. Training data can be found at this [link](https://pan.baidu.com/s/1CqSusnOfBFXtjG7o2NRcWQ).\n\nWhen downloaded, place ```training_data``` under ```data/Sogou/training_data```, directly replace local model directory with downloaded model folder is ok.\n\n# Sample Code for training the model\n## Sample textCNN without Doc2Vec:\n```\nfrom src.textCNN import textCNN\nfrom src.data_helpers import load_data\n\n# Load all .txt file under directory\nx, y, vocabulary, vocabulary_inv = load_data('data/Sogou/training_data/')\n\n# Construct textCNN Object\nclassifier = textCNN(\n    sequence_length=x.shape[1],\n    vocabulary_size=len(vocabulary_inv),\n    num_classifier=y.shape[1]\n)\n\n# Instantiate Model\nclassifier.construct_model()\n\n# start Training, the function is already wrapped with random shuffler\nclassifier.train(x, y, checkpoint_path='model/textCNN/classification.hdf5', epochs=20)\n```\n\n## Sample textCNN with Doc2Vec model:\n```\nfrom src.textCNN import textCNN\nfrom src.data_helpers_doc2vec import load_data_doc2vec\n# Load data in terms of Doc2Vec vector matrix\nx, y = load_data_doc2vec(doc2vec_directory = 'model/doc2vec/d2v.model')\n\n# Construct model\nclassifier_doc2vec = textCNN(\n    sequence_length=x.shape[1],\n    num_classifier=y.shape[1],\n    embedding_dim=x.shape[2],\n    doc2vec=True)\n\n# Instantiate Model\nclassifier_doc2vec.construct_model()\n\n# Train\nclassifier_doc2vec.train(\n    x, y, checkpoint_path='model/textCNN_with_doc2vec/classification.hdf5', epochs=20)\n```\n\n# Prediction Using the Model\n```sample.py``` contains a code snippet on how to predict/verify the model. An wrapped-around feature will be added in the future.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcryptum169%2Ftextcnn-for-chinese","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcryptum169%2Ftextcnn-for-chinese","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcryptum169%2Ftextcnn-for-chinese/lists"}