{"id":13910575,"url":"https://github.com/Lipairui/textgo","last_synced_at":"2025-07-18T09:32:08.808Z","repository":{"id":57474666,"uuid":"279469334","full_name":"Lipairui/textgo","owner":"Lipairui","description":"Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!","archived":false,"fork":false,"pushed_at":"2022-03-27T05:02:07.000Z","size":545,"stargazers_count":43,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-11T01:51:51.491Z","etag":null,"topics":["bert","nlp","text-classification","text-preprocessing","text-representation","text-search","text-similarity"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Lipairui.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-14T03:15:45.000Z","updated_at":"2024-05-27T06:50:54.000Z","dependencies_parsed_at":"2022-09-10T02:21:46.730Z","dependency_job_id":null,"html_url":"https://github.com/Lipairui/textgo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lipairui%2Ftextgo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lipairui%2Ftextgo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lipairui%2Ftextgo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lipairui%2Ftextgo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Lipairui","download_url":"https://codeload.github.com/Lipairui/textgo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226388647,"owners_count":17617311,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","nlp","text-classification","text-preprocessing","text-representation","text-search","text-similarity"],"created_at":"2024-08-07T00:01:35.104Z","updated_at":"2024-11-25T19:31:19.851Z","avatar_url":"https://github.com/Lipairui.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# TextGo\n\n*TextGo* is a python package to help you work with text data conveniently and efficiently. It's a powerful NLP tool, which provides various apis including text preprocessing, representation, similarity calculation, text search and classification. Besides, it supports both English and Chinese language.\n\n## Highlights\n* Support both English and Chinese languages in text preprocessing\n* Provide various text representation algorithms including BOW, TF-IDF, LDA, LSA, PCA, Word2Vec/GloVe/FastText, BERT...\n* Support fast text search based on [Faiss](https://github.com/facebookresearch/faiss)\n* Support various text classification algorithms including FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet\n* Very easy to use/employ in just a few lines of code\n\n## Installing\nInstall and update using pip:      \n`pip install textgo`\n\nNote: successfully tested on python3.     \nTips: the fasttext package needs to be installed manually as follows:\n\n```\ngit clone https://github.com/facebookresearch/fastText.git\ncd fastText-master\nmake\npip install .\n```\n\n## Getting Started\n### 1. Text preprocessing\n   \n**Clean text**\n\n```\nfrom textgo import Preprocess\n# Chinese\ntp1 = Preprocess(lang='zh')\ntexts1 = [\"\u003ctext\u003e自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。\u003c\\text\u003e\", \"??文本预处理~其实很简单！\"]\nptexts1 = tp1.clean(texts1)\nprint(ptexts1)\n```\n\nOutput: `['自然语言处理是计算机科学领域与人工智能领域中的一个重要方向', '文本预处理其实很简单']`\n  \n```\n# English\ntp2 = Preprocess(lang='en')\ntexts2 = [\"\u003ctext\u003eNatural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language\u003c\\text\u003e\"]\nptexts2 = tp2.clean(texts2)\nprint(ptexts2)\n```\nOutput: `['natural language processing usually shortened as nlp is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language']`\n\n**Tokenize and drop stopwords**\n```\n# Chinese\ntokens1 = tp1.tokenize(ptexts1)\nprint(tokens1)\n```\nOutput: `[['自然语言', '处理', '计算机科学', '领域', '人工智能', '领域', '中', '重要', '方向'], ['文本', '预处理', '其实', '很', '简单']]`\n\n```\n# English\ntokens2 = tp2.tokenize(ptexts2)\nprint(tokens2)\n```\nOutput: `[['natural', 'language', 'processing', 'usually', 'shortened', 'nlp', 'branch', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'humans', 'using', 'natural', 'language']]`\n\n**Preprocess (Clean + Tokenize + Remove stopwords + Join words)**\n```\n# Chinese\nptexts1 = tp1.preprocess(texts1)\nprint(ptexts1)\n```\nOutput: `['自然语言 处理 计算机科学 领域 人工智能 领域 中 重要 方向', '文本 预处理 其实 很 简单']`\n\n```\n# English\nptexts2 = tp2.preprocess(texts2)\nprint(ptexts2)\n```\nOutput: `['natural language processing usually shortened nlp branch artificial intelligence deals interaction computers humans using natural language']`\n\n### 2. Text representation\n```\nfrom textgo import Embeddings\npetxts = ['自然语言 处理 计算机科学 领域 人工智能 领域 中 重要 方向', '文本 预处理 其实 很 简单']\nemb = Embeddings()\n# BOW\nbow_emb = emb.bow(ptexts)\n\n# TF-IDF\ntfidf_emb = emb.tfidf(ptexts)\n\n# LDA\nlda_emb = emb.lda(ptexts, dim=2)\n\n# LSA\nlsa_emb = emb.lsa(petxts, dim=2)\n\n# PCA\npca_emb = emb.pca(ptexts, dim=2)\n\n# Word2Vec\nw2v_emb = emb.word2vec(ptexts, method='word2vec', model_path='model/word2vec.bin')\n\n# GloVe\nglove_emb = emb.word2vec(ptexts, method='glove', model_path='model/glove.bin')\n\n# FastText\nft_emb = emb.word2vec(ptexts, method='fasttext', model_path='model/fasttext.bin')\n\n# BERT\nbert_emb = emb.bert(ptexts, model_path='model/bert-base-chinese')\n\n```\nTips: For methods like Word2Vec and BERT, you can load the model first and then get embeddings to avoid loading model repeatedly. Take BERT For example:\n```\nemb.load_model(method=\"bert\", model_path='model/bert-base-chinese')\nbert_emb1 = emb.bert(ptexts1)\nbert_emb2 = emb.bert(ptexts2)\n```\n\n### 3. Similarity calculation\n\nSupport calculating similarity/distance between texts based on text representation mentioned above. For example, we can use bert sentence embeddings to compute cosine similarity between two sentences one by one.\n```\nfrom textgo import TextSim\ntexts1 = [\"她的笑渐渐变少了。\",\"最近天气晴朗适合出去玩！\"]\ntexts2 = [\"她变得越来越不开心了。\",\"近来总是风雨交加没法外出！\"]\n\nts = TextSim(lang='zh', method='bert', model_path='model/bert-base-chinese')\nsim = ts.similarity(texts1, texts2, mutual=False)\nprint(sim)\n```   \n\nOutput: `[0.9143135, 0.7350756]`\n\nBesides, we can also calculate similarity between each sentences among two datasets by setting mutual=True.\n```\nsim = ts.similarity(texts1, texts2, mutual=True)\nprint(sim)\n```\n\nOutput: `\narray([[0.9143138 , 0.772496  ],\n       [0.704296  , 0.73507595]], dtype=float32)\n`\n       \n### 4. Text search\nIt also supports searching query text in a large text database based on cosine similarity or euclidean distance. It provides two kinds of implementation: the normal one which is suitable for small dataset and the optimized one which is based on Faiss and suitable for large dataset.\n```\nfrom textgo import TextSim\n# query texts\ntexts1 = [\"A soccer game with multiple males playing.\"]\n# database\ntexts2 = [\"Some men are playing a sport.\", \"A man is driving down a lonely road.\", \"A happy woman in a fairy costume holds an umbrella.\"]\nts = TextSim(lang='en', method='word2vec', model_path='model/word2vec.bin')\n```\n\n**Normal search**\n```\nres = ts.get_similar_res(texts1, texts2, metric='cosine', threshold=0.5, topn=2)\nprint(res)\n```\nOutput: `[[(0, 'Some men are playing a sport.', 0.828474), (1, 'A man is driving down a lonely road.', 0.60927737)]]`\n\n**Fast search**\n```\nts.build_index(texts2, metric='cosine')\nres = ts.search(texts1, threshold=0.5, topn=2)\nprint(res)\n```\nOutput: `[[(0, 'Some men are playing a sport.', 0.828474), (1, 'A man is driving down a lonely road.', 0.60927737)]]`\n\n### 5. Text classification\nTrain a text classifier just in several lines. Models supported: FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet.\n```\nfrom textgo import Classifier\n\n# Prepare data\nX = [text1, text2, ... textn]\ny = [label1, label2, ... labeln]\n\n# load config\nconfig_path = \"./config.ini\"  # Include all model parameters\nmodel_name = \"Bert\" # Supported models: FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet\nargs = load_config(config_path, model_name) \nargs['model_name'] = model_name \nargs['save_path'] = \"output/%s\"%model_name\n\n# train \nclf = Classifier(args) \nclf.train(X_train, y_train, evaluate_test=False) # If evaluate_test=True, then it will split 10% for test dataset and evaluate on test dataset. \n\n# predict\npredclass = clf.predict(X_train) \n```\n\n## Resources\n### 1. Pretrained word embeddings\n#### Chinese\n1. 各种中文词向量：https://github.com/Embedding/Chinese-Word-Vectors\n2. 腾讯AI Lab中文词向量：https://ai.tencent.com/ailab/nlp/en/embedding.html\n#### English\n1. GloVe: https://nlp.stanford.edu/projects/glove/\n2. FastText: https://fasttext.cc/docs/en/english-vectors.html\n3. Word2Vec: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit\n### 2. Pretrained models\nhttps://huggingface.co/models \n\n## LICENSE\nTextGo is MIT-licensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLipairui%2Ftextgo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLipairui%2Ftextgo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLipairui%2Ftextgo/lists"}