{"id":13609937,"url":"https://github.com/textvec/textvec","last_synced_at":"2025-04-05T22:06:54.067Z","repository":{"id":40986877,"uuid":"129260609","full_name":"textvec/textvec","owner":"textvec","description":"Text vectorization tool to outperform TFIDF for classification tasks","archived":false,"fork":false,"pushed_at":"2024-06-17T22:44:04.000Z","size":818,"stargazers_count":193,"open_issues_count":5,"forks_count":26,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-29T20:06:56.281Z","etag":null,"topics":["machine-learning","natural-language-processing","nlp","python","text-analysis","text-classification","text-processing","tf-idf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/textvec.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-12T14:03:53.000Z","updated_at":"2025-01-21T03:43:50.000Z","dependencies_parsed_at":"2024-01-13T03:30:07.850Z","dependency_job_id":"4527afba-98c2-4b0e-a0e2-ac3dcbee25f3","html_url":"https://github.com/textvec/textvec","commit_stats":{"total_commits":58,"total_committers":7,"mean_commits":8.285714285714286,"dds":0.5689655172413793,"last_synced_commit":"afb9cc6955a01a4904680341b7a3fc5c1246b68a"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textvec%2Ftextvec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textvec%2Ftextvec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textvec%2Ftextvec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textvec%2Ftextvec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/textvec","download_url":"https://codeload.github.com/textvec/textvec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247406088,"owners_count":20933803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","natural-language-processing","nlp","python","text-analysis","text-classification","text-processing","tf-idf"],"created_at":"2024-08-01T19:01:39.484Z","updated_at":"2025-04-05T22:06:54.038Z","avatar_url":"https://github.com/textvec.png","language":"Python","readme":"![textvec logo](/examples/images/logo.png?raw=true)\n## WHAT: Supervised text vectorization tool\n\nTextvec is a text vectorization tool, with the aim to implement all the \"classic\" text vectorization NLP methods in Python. The main idea of this project is to show alternatives for an excellent TFIDF method which is highly overused for supervised tasks. All interfaces are similar to [scikit-learn](https://github.com/scikit-learn/scikit-learn) so you should be able to test the performance of this supervised methods just with a few changes.\n\nTextvec is compatible with: __Python 2.7-3.7__.\n\n------------------\n\n## WHY: Comparison with TFIDF\nAs you can read in the different articles\u003csup\u003e1,2\u003c/sup\u003e almost on every dataset supervised methods outperform unsupervised.\nBut most text classification examples on the internet ignores that fact.\n\n|          |      IMDB_bin      |   RT_bin   |  Airlines Sentiment_bin  | Airlines Sentiment_multiclass | 20news_multiclass |\n|----------|--------------------|------------|--------------------------|-------------------------------|-------------------|\n| TF       |       0.8984       |   0.7571   |          0.9194          |            0.8084             |       0.8206      |\n| TFIDF    |       0.9052       |   0.7717   |        __0.9259__        |            0.8118             |     __0.8575__    |\n| TFPF     |       0.8813       |   0.7403   |          0.9212          |              NA               |         NA        |\n| TFRF     |       0.8797       |   0.7412   |          0.9194          |              NA               |         NA        |\n| TFICF    |       0.8984       |   0.7642   |          0.9199          |          __0.8125__           |       0.8292      |\n| TFBINICF |       0.8984       |   0.7571   |          0.9194          |              NA               |         NA        |\n| TFCHI2   |       0.8898       |   0.7398   |          0.9108          |              NA               |         NA        |\n| TFGR     |       0.8850       |   0.7065   |          0.8956          |              NA               |         NA        |\n| TFRRF    |       0.8879       |   0.7506   |          0.9194          |              NA               |         NA        |\n| TFOR     |     __0.9092__     | __0.7806__ |          0.9207          |              NA               |         NA        |\n\nHere is a comparison for binary classification on imdb sentiment data set. Labels sorted by accuracy score and the heatmap shows the correlation between different approaches. As you can see some methods are good for to ensemble models or perform features selection.\n\n![Binary comparison](/examples/images/imdb_bin.png?raw=true)\n\nFor more dataset benchmarks (rotten tomatoes, airline sentiment) see [Binary classification quality comparison](/examples/binary_comparison.ipynb)\n\n------------------\n\n## Install:\nUsage:\n```\npip install textvec\n```\n\nSource code:\n```\ngit clone https://github.com/textvec/textvec\ncd textvec\npip install .\n```\n\n------------------\n\n## HOW: Examples\nThe usage is similar to scikit-learn:\n``` python\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom textvec.vectorizers import TfBinIcfVectorizer\n\ncvec = CountVectorizer().fit(train_data.text)\n\ntficf_vec = TfBinIcfVectorizer(sublinear_tf=True)\ntficf_vec.fit(cvec.transform(text), y)\n```\nFor more detailed examples see [Basic example](/examples/basic_usage.ipynb) and other notebooks in [Examples](/examples)\n\n### Currently implemented methods:\n\n- TfIcfVectorizer\n- TforVectorizer\n- TfgrVectorizer\n- TfigVectorizer\n- Tfchi2Vectorizer\n- TfrfVectorizer\n- TfrrfVectorizer\n- TfBinIcfVectorizer\n- TfpfVectorizer\n- SifVectorizer\n- TfbnsVectorizer\n\nMost of the vectorization techniques you can find in articles\u003csup\u003e1,2,3\u003c/sup\u003e. If you see any method with wrong name or reference please commit!\n\n------------------\n\n## TODO\n- [ ] Docs\n\n------------------\n\n## REFERENCE\n- [1] [Deqing Wang and Hui Zhang] [Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization](https://arxiv.org/pdf/1012.2609.pdf)\n- [2] [M. Lan, C. L. Tan, J. Su, and Y. Lu] [Supervised and traditional term weighting methods for automatic text categorization](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.3665\u0026rep=rep1\u0026type=pdf)\n- [3] [Sanjeev Arora, Yingyu Liang and Tengyu Ma] [A Simple But Tough-To-Beat Baseline For Sentence Embeddings](https://openreview.net/pdf?id=SyK00v5xx)\n- [4] Thanks [aysent](https://aysent.github.io/2015/10/21/supervised-term-weighting.html#motivation-for-text-classification-tasks) for an inspiration\n","funding_links":[],"categories":["文本数据和NLP"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextvec%2Ftextvec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftextvec%2Ftextvec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextvec%2Ftextvec/lists"}