{"id":13557287,"url":"https://github.com/jbesomi/texthero","last_synced_at":"2025-05-14T09:08:55.308Z","repository":{"id":39376389,"uuid":"253535629","full_name":"jbesomi/texthero","owner":"jbesomi","description":"Text preprocessing, representation and visualization from zero to hero.","archived":false,"fork":false,"pushed_at":"2023-08-29T08:45:13.000Z","size":23184,"stargazers_count":2904,"open_issues_count":82,"forks_count":240,"subscribers_count":43,"default_branch":"master","last_synced_at":"2025-04-11T04:09:15.032Z","etag":null,"topics":["machine-learning","nlp","nlp-pipeline","text-clustering","text-mining","text-preprocessing","text-representation","text-visualization","texthero","word-embeddings"],"latest_commit_sha":null,"homepage":"https://texthero.org","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jbesomi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-04-06T15:16:05.000Z","updated_at":"2025-04-09T05:27:29.000Z","dependencies_parsed_at":"2024-01-12T04:54:46.104Z","dependency_job_id":"b61e415a-eb71-4dc9-96cb-ae6307cfcec5","html_url":"https://github.com/jbesomi/texthero","commit_stats":{"total_commits":248,"total_committers":25,"mean_commits":9.92,"dds":0.5282258064516129,"last_synced_commit":"25728bb0670e6410c76f2a9bbe6f1dba9f54fa67"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbesomi%2Ftexthero","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbesomi%2Ftexthero/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbesomi%2Ftexthero/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jbesomi%2Ftexthero/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jbesomi","download_url":"https://codeload.github.com/jbesomi/texthero/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253609606,"owners_count":21935556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","nlp","nlp-pipeline","text-clustering","text-mining","text-preprocessing","text-representation","text-visualization","texthero","word-embeddings"],"created_at":"2024-08-01T12:04:15.682Z","updated_at":"2025-05-14T09:08:50.290Z","avatar_url":"https://github.com/jbesomi.png","language":"Python","funding_links":[],"categories":["Python","Data Processing","文本数据和NLP","APIs and Libraries","nlp","🐍 Python","Frameworks and libraries"],"sub_categories":["Data Representation","Knowledge Graphs","Useful Python Tools for Data Analysis",":snake: Python"],"readme":"\u003cp align=\"center\"\u003e\n   \u003ca href=\"https://github.com/jbesomi/texthero/stargazers\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/stars/jbesomi/texthero.svg?colorA=orange\u0026colorB=orange\u0026logo=github\"\n         alt=\"Github stars\"\u003e\n   \u003c/a\u003e\n   \u003ca href=\"https://pypi.org/search/?q=texthero\"\u003e\n      \u003cimg src=\"https://img.shields.io/pypi/v/texthero.svg?colorB=brightgreen\"\n           alt=\"pip package\"\u003e\n   \u003c/a\u003e\n   \u003ca href=\"https://pypi.org/project/texthero/\"\u003e\n      \u003cimg alt=\"pip downloads\" src=\"https://img.shields.io/pypi/dm/texthero\"\u003e\n   \u003c/a\u003e\n   \u003ca href=\"https://github.com/jbesomi/texthero/issues\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/issues/jbesomi/texthero.svg\"\n             alt=\"Github issues\"\u003e\n   \u003c/a\u003e\n   \u003ca href=\"https://github.com/jbesomi/texthero/blob/master/LICENSE\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/license/jbesomi/texthero.svg\"\n             alt=\"Github license\"\u003e\n   \u003c/a\u003e   \n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/jbesomi/texthero/raw/master/github/logo.png\"\u003e\n\u003c/p\u003e\n\n\u003cp style=\"font-size: 20px;\" align=\"center\"\u003eText preprocessing, representation and visualization from zero to hero.\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n   \u003ca href=\"#from-zero-to-hero\"\u003eFrom zero to hero\u003c/a\u003e •\n   \u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e •\n   \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e •\n   \u003ca href=\"#examples\"\u003eExamples\u003c/a\u003e •\n   \u003ca href=\"#api\"\u003eAPI\u003c/a\u003e •\n   \u003ca href=\"#faq\"\u003eFAQ\u003c/a\u003e •\n   \u003ca href=\"#contributions\"\u003eContributions\u003c/a\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/jbesomi/texthero/raw/master/github/screencast.gif\"\u003e\n\u003c/p\u003e\n\n\u003ch2 align=\"center\"\u003eFrom zero to hero\u003c/h2\u003e\n\nTexthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic. \n\nYou can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines.\n\nTexthero include tools for:\n* Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.\n* Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.\n* Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)\n* Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.\n* Text visualization: vector space visualization, place localization on maps (wip).\n\nTexthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). \n\nWe hope you will find pleasure working with Texthero as we had during his development.\n\n\u003ch2 align=\"center\"\u003eHablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか？\u003c/h2\u003e\n\nTexthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.\n\nNow, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文？ 日本語が話せるのか？ Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!\n\nFor improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github [issue](https://github.com/jbesomi/texthero/issues), we will be glad to support you and help you.\n\n\n\u003ch2 align=\"center\"\u003eBeta version\u003c/h2\u003e\n\nTexthero's community is growing fast. Texthero though is still in a beta version; soon, a faster and better version will be released and it will bring some major changes.\n\nFor instance, to give a more granular control over the pipeline, starting from the next version on, all `preprocessing` functions will require as argument an already tokenized text. This will be a major change.\n\nOnce released the stable version (Texthero 2.0), backward compatibility will be respected. Until this point, backward compatibility will be present but it will be weaker.\n\nIf you want to be part of this fast-growing movements, do not hesitate to contribute: [CONTRIBUTING](./CONTRIBUTING.md)!\n\n\u003ch2 align=\"center\"\u003eInstallation\u003c/h2\u003e\n\nInstall texthero via `pip`:\n\n```bash\npip install texthero\n```\n\n\u003e ☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.\n\n\u003e For faster performance, make sure you have installed Spacy version \u003e= 2.2. Also, make sure you have a recent version of python, the higher, the best.\n\n\u003ch2 align=\"center\"\u003eGetting started\u003c/h2\u003e\n\nThe best way to learn Texthero is through the \u003ca href=\"https://texthero.org/docs/getting-started\"\u003eGetting Started\u003c/a\u003e docs. \n\nIn case you are an advanced python user, then `help(texthero)` should do the work.\n\n\u003ch2 align=\"center\"\u003eExamples\u003c/h2\u003e\n\n\u003ch3\u003e1. Text cleaning, TF-IDF representation and Visualization\u003c/h3\u003e\n\n\n```python\nimport texthero as hero\nimport pandas as pd\n\ndf = pd.read_csv(\n   \"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv\"\n)\n\ndf['pca'] = (\n   df['text']\n   .pipe(hero.clean)\n   .pipe(hero.tfidf)\n   .pipe(hero.pca)\n)\nhero.scatterplot(df, 'pca', color='topic', title=\"PCA BBC Sport news\")\n```\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"https://github.com/jbesomi/texthero/raw/master/github/scatterplot_bbcsport.svg\"\u003e\n\u003c/p\u003e\n\n\u003ch3\u003e2. Text preprocessing, TF-IDF, K-means and Visualization\u003c/h3\u003e\n\n```python\nimport texthero as hero\nimport pandas as pd\n\ndf = pd.read_csv(\n    \"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv\"\n)\n\ndf['tfidf'] = (\n    df['text']\n    .pipe(hero.clean)\n    .pipe(hero.tfidf)\n)\n\ndf['kmeans_labels'] = (\n    df['tfidf']\n    .pipe(hero.kmeans, n_clusters=5)\n    .astype(str)\n)\n\ndf['pca'] = df['tfidf'].pipe(hero.pca)\n\nhero.scatterplot(df, 'pca', color='kmeans_labels', title=\"K-means BBC Sport news\")\n```\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"https://github.com/jbesomi/texthero/raw/master/github/scatterplot_bbcsport_kmeans.svg\"\u003e\n\u003c/p\u003e\n\n\u003ch3\u003e3. Simple pipeline for text cleaning\u003c/h3\u003e\n\n```python\n\u003e\u003e\u003e import texthero as hero\n\u003e\u003e\u003e import pandas as pd\n\u003e\u003e\u003e text = \"This sèntencé    (123 /) needs to [OK!] be cleaned!   \"\n\u003e\u003e\u003e s = pd.Series(text)\n\u003e\u003e\u003e s\n0    This sèntencé    (123 /) needs to [OK!] be cleane...\ndtype: object\n```\n\nRemove all digits:\n\n```python\n\u003e\u003e\u003e s = hero.remove_digits(s)\n\u003e\u003e\u003e s\n0    This sèntencé    (  /) needs to [OK!] be cleaned!\ndtype: object\n```\n\n\u003e Remove digits replaces only blocks of digits. The digits in the string \"hello123\" will not be removed. If we want to remove all digits, you need to set only_blocks to false.\n\nRemove all types of brackets and their content.\n\n```python\n\u003e\u003e\u003e s = hero.remove_brackets(s)\n\u003e\u003e\u003e s \n0    This sèntencé    needs to  be cleaned!\ndtype: object\n```\n\nRemove diacritics.\n\n```python\n\u003e\u003e\u003e s = hero.remove_diacritics(s)\n\u003e\u003e\u003e s \n0    This sentence    needs to  be cleaned!\ndtype: object\n```\n\nRemove punctuation.\n\n```python\n\u003e\u003e\u003e s = hero.remove_punctuation(s)\n\u003e\u003e\u003e s \n0    This sentence    needs to  be cleaned\ndtype: object\n```\n\nRemove extra white-spaces.\n\n```python\n\u003e\u003e\u003e s = hero.remove_whitespace(s)\n\u003e\u003e\u003e s \n0    This sentence needs to be cleaned\ndtype: object\n```\n\nSometimes we also want to get rid of stop-words.\n\n```python\n\u003e\u003e\u003e s = hero.remove_stopwords(s)\n\u003e\u003e\u003e s\n0    This sentence needs cleaned\ndtype: object\n```\n\n\u003ch2 align=\"center\"\u003eAPI\u003c/h2\u003e\n\nTexthero is composed of four modules: [preprocessing.py](/texthero/preprocessing.py), [nlp.py](/texthero/nlp.py), [representation.py](/texthero/representation.py) and [visualization.py](/texthero/visualization.py).\n\n\u003ch3\u003e1. Preprocessing\u003c/h3\u003e\n\n**Scope:** prepare **text** data for further analysis.\n\nFull documentation: [preprocessing](https://texthero.org/docs/api-preprocessing)\n\n\u003ch3\u003e2. NLP\u003c/h3\u003e\n\n**Scope:** provide classic natural language processing tools such as `named_entity` and `noun_phrases`.\n\nFull documentation: [nlp](https://texthero.org/docs/api-nlp)\n\n\n\u003ch3\u003e2. Representation\u003c/h3\u003e\n\n**Scope:** map text data into vectors and do dimensionality reduction.\n\nSupported **representation** algorithms:\n1. Term frequency (`count`)\n1. Term frequency-inverse document frequency (`tfidf`)\n\nSupported **clustering** algorithms:\n1. K-means (`kmeans`)\n1. Density-Based Spatial Clustering of Applications with Noise (`dbscan`)\n1. Meanshift (`meanshift`)\n\nSupported **dimensionality reduction** algorithms:\n1. Principal component analysis (`pca`)\n1. t-distributed stochastic neighbor embedding (`tsne`)\n1. Non-negative matrix factorization (`nmf`)\n\nFull documentation: [representation](https://texthero.org/docs/api-representation)\n\n\u003ch3\u003e3. Visualization\u003c/h3\u003e\n\n**Scope:** summarize the main facts regarding the text data and visualize it. This module is opinionable. It's handy for anyone that needs a quick solution to visualize on screen the text data, for instance during a text exploratory data analysis (EDA).\n\nSupported functions:\n   - Text scatterplot (`scatterplot`)\n   - Most common words (`top_words`)\n\nFull documentation: [visualization](https://texthero.org/docs/api-visualization)\n\n\u003ch2 align=\"center\"\u003eFAQ\u003c/h2\u003e\n\n\u003ch5\u003eWhy Texthero\u003c/h5\u003e\n\nSometimes we just want things done, right? Texthero helps with that. It helps make things easier and give the developer more time to focus on his custom requirements. We believe that cleaning text should just take a minute. Same for finding the most important part of a text and the same for representing it.\n\nIn a very pragmatic way, texthero has just one goal: make the developer spare time. Working with text data can be a pain and in most cases, a default pipeline can be quite good to start. There is always time to come back and improve previous work.\n\n\n\u003ch2 align=\"center\"\u003eContributions\u003c/h2\u003e\n\n\u003e \"Texthero has been developed by a member of the NLP community for the whole NLP-community\"\n\nTexthero is for all of us NLP-developers and it can continue to exist with the precious contribution of the community.\n\nYour level of expertise of python and NLP does not matter, anyone can help and anyone is more than welcome to contribute!\n\n**Are you an NLP expert?**\n\n- [open an issue](https://github.com/jbesomi/texthero/issues) and tell us what you like and dislike of Texthero and what we can do better!\n\n**Are you good at creating websites?**\n\nThe website will be soon moved from Docusaurus to Sphinx: read the [open issue there](https://github.com/jbesomi/texthero/issues/40). Good news: the website will look like now :) Average news: we need to do some web-development to adapt [this Sphinx template](https://github.com/jbesomi/texthero/issues/40) to our needs. Can you help us?\n\n**Are you good at writing?**\n\nProbably this is the most important piece missing now on Texthero: more tutorials and more \"Getting Started\" guide. \n\nIf you are good at writing you can help us! Why don't you start by [Adding a FAQ page to the website](https://github.com/jbesomi/texthero/issues/41) or explain how to [create a custom pipeline](https://github.com/jbesomi/texthero/issues/38)? Need help? We are there for you.\n\n**Are you good in python?**\n\nThere are a lot of [open issues](https://github.com/jbesomi/texthero/issues) for techie guys. Which one do you choose?\n\nIf you have just other questions or inquiry drop me a line at jonathanbesomi__AT__gmail.com\n\n\u003ch3\u003eContributors (in chronological order)\u003c/h3\u003e\n\n- [Selim Al Awwa](https://github.com/selimelawwa/)\n- [Parth Gandhi](https://github.com/ParthGandhi)\n- [Dan Keefe](https://github.com/Peritract)\n- [Christian Claus](https://github.com/cclauss)\n- [bobfang1992](https://github.com/bobfang1992)\n- [Ishan Arora](https://github.com/ishanarora04)\n- [Vidya P](https://github.com/vidyap-xgboost)\n- [Cedric Conol](https://github.com/cedricconol)\n- [Rich Ramalho](https://github.com/richecr)\n\n\n\u003ch2 align=\"center\"\u003e\u003ca href=\"./LICENSE\"\u003eLicense\u003c/a\u003e\u003c/h2\u003e\n\nThe MIT License (MIT)\n\nCopyright (c) 2020 Texthero\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjbesomi%2Ftexthero","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjbesomi%2Ftexthero","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjbesomi%2Ftexthero/lists"}