{"id":13482365,"url":"https://github.com/JasonKessler/scattertext","last_synced_at":"2025-03-27T13:31:36.015Z","repository":{"id":37444598,"uuid":"63827736","full_name":"JasonKessler/scattertext","owner":"JasonKessler","description":"Beautiful visualizations of how language differs among document types.","archived":false,"fork":false,"pushed_at":"2024-09-23T05:24:52.000Z","size":42656,"stargazers_count":2288,"open_issues_count":22,"forks_count":292,"subscribers_count":54,"default_branch":"master","last_synced_at":"2025-03-25T22:06:40.189Z","etag":null,"topics":["computational-social-science","d3","eda","exploratory-data-analysis","japanese-language","machine-learning","natural-language-processing","nlp","scatter-plot","semiotic-squares","sentiment","stylometric","stylometry","text-as-data","text-mining","text-visualization","topic-modeling","visualization","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JasonKessler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-07-21T01:47:12.000Z","updated_at":"2025-03-19T16:42:17.000Z","dependencies_parsed_at":"2023-02-17T10:01:09.823Z","dependency_job_id":"b8536270-57d8-43b8-b24a-cc5971efd479","html_url":"https://github.com/JasonKessler/scattertext","commit_stats":{"total_commits":352,"total_committers":15,"mean_commits":"23.466666666666665","dds":"0.46590909090909094","last_synced_commit":"c7af5d59a65e3e528cbe31553ead293d66be6924"},"previous_names":[],"tags_count":39,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JasonKessler%2Fscattertext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JasonKessler%2Fscattertext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JasonKessler%2Fscattertext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JasonKessler%2Fscattertext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JasonKessler","download_url":"https://codeload.github.com/JasonKessler/scattertext/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245854351,"owners_count":20683341,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-social-science","d3","eda","exploratory-data-analysis","japanese-language","machine-learning","natural-language-processing","nlp","scatter-plot","semiotic-squares","sentiment","stylometric","stylometry","text-as-data","text-mining","text-visualization","topic-modeling","visualization","word-embeddings","word2vec"],"created_at":"2024-07-31T17:01:01.282Z","updated_at":"2025-03-27T13:31:35.980Z","avatar_url":"https://github.com/JasonKessler.png","language":"Python","funding_links":[],"categories":["Python","Libraries","Data Visualization","Visualization","🐍 Python","文本数据和NLP","APIs and Libraries","nlp","函式庫","Frameworks and libraries","Packages"],"sub_categories":["Videos and Online Courses","Data Management","Useful Python Tools for Data Analysis","Knowledge Graphs","書籍",":snake: Python","Libraries"],"readme":"[![Build Status](https://travis-ci.org/JasonKessler/scattertext.svg?branch=master)](https://travis-ci.org/JasonKessler/scattertext)\n[![PyPI](https://img.shields.io/pypi/v/scattertext.svg)]()\n[![Gitter Chat](https://img.shields.io/badge/GITTER-join%20chat-green.svg)](https://gitter.im/scattertext/Lobby)\n[![Twitter Follow](https://img.shields.io/twitter/follow/espadrine.svg?style=social\u0026label=Follow)](https://twitter.com/jasonkessler)\n\n# Scattertext 0.2.2\n\nA tool for finding distinguishing terms in corpora and displaying them in an\ninteractive HTML scatter plot. Points corresponding to terms are selectively labeled\nso that they don't overlap with other labels or points.\n\nCite as: Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System\nDemonstrations. 2017.\n\nBelow is an example of using Scattertext to create visualize terms used in 2012 American\npolitical conventions. The 2,000 most party-associated uni grams are displayed as\npoints in the scatter plot. Their x- and y- axes are the dense ranks of their usage by\nRepublican and Democratic speakers respectively.\n\n```pydocstring\nimport scattertext as st\n\ndf = st.SampleCorpora.ConventionData2012.get_data().assign(\n    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)\n)\n\ncorpus = st.CorpusFromParsedDocuments(\n    df, category_col='party', parsed_col='parse'\n).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))\n\nhtml = st.produce_scattertext_explorer(\n    corpus,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    minimum_term_frequency=0, \n    pmi_threshold_coefficient=0,\n    width_in_pixels=1000, \n    metadata=corpus.get_df()['speaker'],\n    transform=st.Scalers.dense_rank,\n    include_gradient=True,\n    left_gradient_term='More Republican',\n    middle_gradient_term='Metric: Dense Rank Difference',\n    right_gradient_term='More Democratic',\n)\nopen('./demo_compact.html', 'w').write(html)\n```\n\nThe HTML file written would look like the image below. Click on it for the actual interactive visualization.\n[![demo_compact.html](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/demo_compact.png)](https://jasonkessler.github.io/demo_compact.html)\n\n## Citation\n\nJason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.\nLink to paper: [arxiv.org/abs/1703.00565](https://arxiv.org/abs/1703.00565)\n\n```\n@article{kessler2017scattertext,\n  author    = {Kessler, Jason S.},\n  title     = {Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ},\n  booktitle = {Proceedings of ACL-2017 System Demonstrations},\n  year      = {2017},\n  address   = {Vancouver, Canada},\n  publisher = {Association for Computational Linguistics},\n}\n```\n\n**Table of Contents**\n\n- [Installation](#installation)\n- [Overview](#overview)\n- [Customizing the Visualization and Plotting Dispersion](#customizing-the-visualization-and-plotting-dispersion)\n- [Tutorial](#tutorial)\n    - [Help! I don't know Python but I still want to use Scattertext](#help-i-dont-know-python-but-i-still-want-to-use-scattertext)\n    - [Using Scattertext as a text analysis library: finding characteristic terms and their associations](#using-scattertext-as-a-text-analysis-library-finding-characteristic-terms-and-their-associations)\n    - [Visualizing term associations](#visualizing-term-associations)\n    - [Visualizing phrase associations](#visualizing-phrase-associations)\n    - [Adding color gradients to explain scores](#adding-color-gradients-to-explain-scores)\n    - [Visualizing Empath topics and categories](#visualizing-empath-topics-and-categories)\n    - [Visualizing the Moral Foundations 2.0 Dictionary](#visualizing-the-moral-foundations-2.0-dictionary)\n    - [Ordering Terms by Corpus Characteristicness](#ordering-terms-by-corpus-characteristicness)\n    - [Document-Based Scatterplots](#document-based-scatterplots)\n    - [Using Cohen's d or Hedge's g to visualize effect size](#using-cohens-d-or-hedges-g-to-visualize-effect-size)\n    - [Using Cliff's Delta to visualize effect size](#using-cliffs-delta-to-visualize-effect-size)\n    - [Using Bi-Normal Separation (BNS) to score terms](#using-bi-normal-separation-bns-to-score-terms)\n    - [Using correlations to explain classifiers](#using-correlations-to-explain-classifiers)\n    - [Using Custom Background Word Frequencies](#using-custom-background-word-frequencies)\n    - [Plotting word productivity](#plotting-word-productivity)\n- [Understanding Scaled F-Score](#understanding-scaled-f-score)\n- [Alternative term scoring methods](#alternative-term-scoring-methods)\n- [The position-select-plot process](#the-position-select-plot-process)\n- [Advanced Uses](#advanced-uses)\n    - [Visualizing differences based on only term frequencies](#visualizing-differences-based-on-only-term-frequencies)\n    - [Visualizing query-based categorical differences](#visualizing-query-based-categorical-differences)\n    - [Visualizing any kind of term score](#visualizing-any-kind-of-term-score)\n    - [Custom term positions](#custom-term-positions)\n    - [Emoji analysis](#emoji-analysis)\n    - [Visualizing SentencePiece tokens](#visualizing-sentencepiece-tokens)\n    - [Visualizing scikit-learn text classification weights](#visualizing-scikit-learn-text-classification-weights)\n    - [Creating lexicalized semiotic squares](#creating-lexicalized-semiotic-squares)\n    - [Visualizing topic models](#visualizing-topic-models)\n    - [Creating T-SNE-style word embedding projection plots](#creating-T-SNE-style-word-embedding-projection-plots)\n    - [Using SVD to visualize any kind of word embeddings](#using-svd-to-visualize-any-kind-of-word-embeddings)\n    - [Exporting plot to matplotlib](#exporting-plot-to-matplotlib)\n    - [Using the same scale for both axes](#using-the-same-scale-for-both-axes)\n\n- [Examples](#examples)\n- [A note on chart layout](#a-note-on-chart-layout)\n- [What's new](#whats-new)\n- [Sources](#sources)\n\n## Installation\n\nInstall Python 3.11 or higher and run:\n\n`$ pip install scattertext`\n\nIf you cannot (or don't want to) install spaCy, substitute `nlp = spacy.load('en')` lines with\n`nlp = scattertext.WhitespaceNLP.whitespace_nlp`. Note, this is not compatible\nwith `word_similarity_explorer`, and the tokenization and sentence boundary detection\ncapabilities will be low-performance regular expressions. See `demo_without_spacy.py`\nfor an example.\n\nIt is recommended you install `jieba`, `spacy`, `empath`, `astropy`, `flashtext`, `gensim` and `umap-learn` in order to\ntake full advantage of Scattertext.\n\nScattertext should mostly work with Python 2.7, but it may not.\n\nThe HTML outputs look best in Chrome and Safari.\n\n## Style Guide\n\nThe name of this project is Scattertext.  \"Scattertext\" is written as a single word\nand should be capitalized. When used in Python, the package `scattertext` should be defined\nto the name `st`, i.e., `import scattertext as st`.\n\n## Overview\n\nThis is a tool that's intended for visualizing what words and phrases\nare more characteristic of a category than others.\n\nConsider the example at the top of the page.\n\nLooking at this seem overwhelming. In fact, it's a relatively simple visualization of word use\nduring the 2012 political convention. Each dot corresponds to a word or phrase mentioned by Republicans or Democrats\nduring their conventions. The closer a dot is to the top of the plot, the more frequently it was used by\nDemocrats. The further right a dot, the more that word or phrase was used by Republicans. Words frequently\nused by both parties, like \"of\" and \"the\" and even \"Mitt\" tend to occur in the upper-right-hand corner. Although very\nlow\nfrequency words have been hidden to preserve computing resources, a word that neither party used, like \"giraffe\"\nwould be in the bottom-left-hand corner.\n\nThe interesting things happen close to the upper-left and lower-right corners. In the upper-left corner,\nwords like \"auto\" (as in auto bailout) and \"millionaires\" are frequently used by Democrats but infrequently or never\nused\nby Republicans. Likewise, terms frequently used by Republicans and infrequently by Democrats occupy the\nbottom-right corner. These include \"big government\" and \"olympics\", referring to the Salt Lake City Olympics in which\nGov. Romney was involved.\n\nTerms are colored by their association. Those that are more associated with Democrats are blue, and those\nmore associated with Republicans red.\n\nTerms that are most characteristic of the both sets of documents are displayed\non the far-right of the visualization.\n\nThe inspiration for this visualization came from Dataclysm (Rudder, 2014).\n\nScattertext is designed to help you build these graphs and efficiently label points on them.\n\nThe documentation (including this readme) is a work in\nprogress. Please see the tutorial below as well as\nthe [PyData 2017 Tutorial](https://github.com/JasonKessler/Scattertext-PyData).\n\nPoking around the code and tests should give you a good idea of how things work.\n\nThe library covers some novel and effective term-importance formulas, including **Scaled F-Score**.\n\n## Customizing the Visualization and Plotting Dispersion\n\nNew in Scattertext 0.1.0, one can use a dataframe for term/metadata positions and other term-specific data. We\ncan also use it to determine term-specific information which is shown after a term is clicked.\n\nNote that it is possible to disable the use of document categories in Scattertext, as we shall see in this example.\n\nThis example covers plotting term dispersion against word frequency and identifying the terms which are most and least\ndispersed given their frequencies. Using the Rosengren's S dispersion measure (Gries 2021), terms tend to increase in\ntheir\ndispersion scores as they get more frequent. We'll see how we can both plot this effect and factor out the effect\nof frequency.\n\nThis, along with a number of other dispersion metrics presented in Gries (2021), are available and documented\nin the `Dispersion` class, which we'll use later in the section.\n\nLet's start by creating a Convention corpus, but we'll use the `CorpusWithoutCategoriesFromParsedDocuments` factory\nto ensure that no categories are included in the corpus. If we try to find document categories, we'll see that\nall documents have the category '_'.\n\n```python\nimport scattertext as st\n\ndf = st.SampleCorpora.ConventionData2012.get_data().assign(\n    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences))\ncorpus = st.CorpusWithoutCategoriesFromParsedDocuments(\n    df, parsed_col='parse'\n).build().get_unigram_corpus().remove_infrequent_words(minimum_term_count=6)\n\ncorpus.get_categories()\n# Returns ['_']\n```\n\nNext, we'll create a dataframe for all terms we'll plot. We'll just start by creating a dataframe where we capture\nthe frequency of each term and various dispersion metrics. These will be shown after a term is activated in the plot.\n\n```python\ndispersion = st.Dispersion(corpus)\n\ndispersion_df = dispersion.get_df()\ndispersion_df.head(3)\n```\n\nWhich returns\n\n```\n       Frequency  Range         SD        VC  Juilland's D  Rosengren's S        DP   DP norm  KL-divergence  Dissemination\nthank        363    134   3.108113  1.618274      0.707416       0.694898  0.391548  0.391560       0.748808       0.972954\nyou         1630    177  12.383708  1.435902      0.888596       0.898805  0.233627  0.233635       0.263337       0.963905\nso           549    155   3.523380  1.212967      0.774299       0.822244  0.283151  0.283160       0.411750       0.986423```\n\nThese are discussed in detail in [Gries 2021](http://www.stgries.info/research/ToApp_STG_Dispersion_PHCL.pdf). \nDissementation is presented in Altmann et al. (2011).\n\nWe'll use Rosengren's S to find the dispersion of each term. It's which a metric designed for corpus parts\n(convention speeches in our case) of varying length. Where n is the number of documents in the corpus, s_i is the\npercentage of tokens in the corpus found in document i, v_i is term count in document i, and f is the total number\nof tokens in the corpus of type term type.\n\nRosengren's\nS: [![Rosengren's S](https://render.githubusercontent.com/render/math?math=\\frac{\\Sum_{i=1}^{n}\\sqrt{s_i%20\\cdot%20\\v_i})^2}{f})](https://render.githubusercontent.com/render/math?math=\\frac{\\Sum_{i=1}^{n}\\sqrt{s_i%20\\cdot%20\\v_i})\n^2}{f})\n\nIn order to start plotting, we'll need to add coordinates for each term to the data frame.\n\nTo use the `dataframe_scattertext` function, you need, at a minimum a dataframe with 'X' and 'Y' columns.\n\nThe `Xpos` and `Ypos` columns indicate the positions of the original `X` and `Y` values on the scatterplot, and\nneed to be between 0 and 1. Functions in `st.Scalers` perform this scaling. Absent `Xpos` or `Ypos`,\n`st.Scalers.scale` would be used.\n\nHere is a sample of values:\n\n* `st.Scalers.scale(vec)` Rescales the vector to where the minimum value is 0 and the maximum is 1.\n* `st.Scalers.log_scale(vec)` Rescales the lgo of the vector\n* `st.Scalers.dense_ranke(vec)` Rescales the dense rank of the vector\n* `st.Scalers.scale_center_zero_abs(vec)` Rescales a vector with both positive and negative values such that the 0 value\n  in the original vector is plotted at 0.5, negative values are projected from [-argmax(abs(vec)), 0] to [0, 0.5] and\n  positive values projected from [0, argmax(abs(vec))] to [0.5, 1].\n\n```python\ndispersion_df = dispersion_df.assign(\n    X=lambda df: df.Frequency,\n    Xpos=lambda df: st.Scalers.log_scale(df.X),\n    Y=lambda df: df[\"Rosengren's S\"],\n    Ypos=lambda df: st.Scalers.scale(df.Y),\n)\n```   \n\nNote that the `Ypos` column here is not necessary since `Y` would automatically be scaled.\n\nFinally, since we are not distinguishing between categories, we can set `ignore_categories=True`.\n\nWe can now plot this graph using the `dataframe_scattertext` function:\n\n```python\nhtml = st.dataframe_scattertext(\n    corpus,\n    plot_df=dispersion_df,\n    metadata=corpus.get_df()['speaker'] + ' (' + corpus.get_df()['party'].str.upper() + ')',\n    ignore_categories=True,\n    x_label='Log Frequency',\n    y_label=\"Rosengren's S\",\n    y_axis_labels=['Less Dispersion', 'Medium', 'More Dispersion'],\n)\n```\n\nWhich yields (click for an interactive version):\n[![dispersion-basic.html](https://jasonkessler.github.io/dispersion-basic.png)](https://jasonkessler.github.io/dispersion-basic.html)\n\nNote that we can see various dispersion statistics under a term's name, in addition to the standard usage statistics. To\ncustomize the statistics which are displayed, set the `term_description_column=[...]` parameter with a list of column\nnames to be displayed.\n\nOne issue in this dispersion chart, which tends to be common to dispersion metrics in general, is that dispersion\nand frequency tend to have a high correlation, but with a complex, non-linear curve. Depending on the metric,\nthis correlation curve could be power, linear, sigmoidal, or typically, something else.\n\nIn order to factor out this correlation, we can predict the dispersion from frequency using a non-parametric regressor,\nand see which terms have the highest and lowest residuals with respect to their expected dispersions based on their\nfrequencies.\n\nIn this case, we'll use a KNN regressor with 10 neighbors to predict Rosengren'S from term frequencies\n(`dispersion_df.X` and `.Y` respectively), and compute the residual.\n\nWe'll the residual to color points, with a neutral color for residuals around 0 and other colors for positive and\nnegative values. We'll add a column in the data frame for point colors, and call it ColorScore. It is populated\nwith values between 0 and 1, with 0.5 as a netural color on the `d3 interpolateWarm` color scale. We use\n`st.Scalers.scale_center_zero_abs`, discussed above, to make this transformation.\n\n```python\nfrom sklearn.neighbors import KNeighborsRegressor\n\ndispersion_df = dispersion_df.assign(\n    Expected=lambda df: KNeighborsRegressor(n_neighbors=10).fit(\n        df.X.values.reshape(-1, 1), df.Y\n    ).predict(df.X.values.reshape(-1, 1)),\n    Residual=lambda df: df.Y - df.Expected,\n    ColorScore=lambda df: st.Scalers.scale_center_zero_abs(df.Residual)\n)    \n```\n\nNow we are ready to plot our colored dispersion chart. We assign the ColorScore column name to the `color_score_column`\nparameter in `dataframe_scattertext`.\n\nAdditionally, We'd like to populate the two term lists on the\nleft with terms that have high and low residual values, indicating terms which have the most dispersion relative to\ntheir frequency-expected level and the lowest. We can do this by the `left_list_column` parameter. We can specify\nthe upper and lower term list names using the `header_names` parameter. Finally, we can spiff-up the plot by\nadding an appealing background color.\n\n```python\nhtml = st.dataframe_scattertext(\n    corpus,\n    plot_df=dispersion_df,\n    metadata=corpus.get_df()['speaker'] + ' (' + corpus.get_df()['party'].str.upper() + ')',\n    ignore_categories=True,\n    x_label='Log Frequency',\n    y_label=\"Rosengren's S\",\n    y_axis_labels=['Less Dispersion', 'Medium', 'More Dispersion'],\n    color_score_column='ColorScore',\n    header_names={'upper': 'Lower than Expected', 'lower': 'More than Expected'},\n    left_list_column='Residual',\n    background_color='#e5e5e3'\n)\n```\n\nWhich yields (click for an interactive version):\n[![dispersion-residual.html](https://jasonkessler.github.io/dispersion-residual.png)](https://jasonkessler.github.io/dispersion-residual.html)\n\n\n## Tutorial\n\n### Help! I don't know Python but I still want to use Scattertext.\n\nWhile you should learn Python fully use Scattertext, I've put some of the basic\nfunctionality in a commandline tool. The tool is installed when you follow the procedure laid out\nabove.\n\nRun `$ scattertext --help` from the commandline to see the full usage information. Here's a quick example of\nhow to use vanilla Scattertext on a CSV file. The file needs to have at least two columns,\none containing the text to be analyzed, and another containing the category. In the example CSV below,\nthe columns are text and party, respectively.\n\nThe example below processes the CSV file, and the resulting HTML visualization into cli_demo.html.\n\nNote, the parameter `--minimum_term_frequency=8` omit terms that occur less than 8\ntimes, and `--regex_parser` indicates a simple regular expression parser should\nbe used in place of spaCy. The flag `--one_use_per_doc` indicates that term frequency\nshould be calculated by only counting no more than one occurrence of a term in a document.\n\nIf you'd like to parse non-English text, you can use the `--spacy_language_model` argument to configure which\nspaCy language model the tool will use. The default is 'en' and you can see the others available at\n[https://spacy.io/docs/api/language-models](https://spacy.io/docs/api/language-models).\n\n```bash\n$ curl -s https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv | head -2\nparty,speaker,text\ndemocrat,BARACK OBAMA,\"Thank you. Thank you. Thank you. Thank you so much.Thank you.Thank you so much. Thank you. Thank you very much, everybody. Thank you.\n$\n$ scattertext --datafile=https://cdn.rawgit.com/JasonKessler/scattertext/master/scattertext/data/political_data.csv \\\n\u003e --text_column=text --category_column=party --metadata_column=speaker --positive_category=democrat \\\n\u003e --category_display_name=Democratic --not_category_display_name=Republican --minimum_term_frequency=8 \\\n\u003e --one_use_per_doc --regex_parser --outputfile=cli_demo.html\n```\n\n### Using Scattertext as a text analysis library: finding characteristic terms and their associations\n\nThe following code creates a stand-alone HTML file that analyzes words\nused by Democrats and Republicans in the 2012 party conventions, and outputs some notable\nterm associations.\n\nFirst, import Scattertext and spaCy.\n\n```pydocstring\n\u003e\u003e\u003e import scattertext as st\n\u003e\u003e\u003e import spacy\n\u003e\u003e\u003e from pprint import pprint\n```\n\nNext, assemble the data you want to analyze into a Pandas data frame. It should have\nat least two columns, the text you'd like to analyze, and the category you'd like to\nstudy. Here, the `text` column contains convention speeches while the `party` column\ncontains the party of the speaker. We'll eventually use the `speaker` column\nto label snippets in the visualization.\n\n```pydocstring\n\u003e\u003e\u003e convention_df = st.SampleCorpora.ConventionData2012.get_data()  \n\u003e\u003e\u003e convention_df.iloc[0]\nparty                                               democrat\nspeaker                                         BARACK OBAMA\ntext       Thank you. Thank you. Thank you. Thank you so ...\nName: 0, dtype: object\n```\n\nTurn the data frame into a Scattertext Corpus to begin analyzing it. To look for differences\nin parties, set the `category_col` parameter to `'party'`, and use the speeches,\npresent in the `text` column, as the texts to analyze by setting the `text` col\nparameter. Finally, pass a spaCy model in to the `nlp` argument and call `build()` to construct the corpus.\n\n```pydocstring\n# Turn it into a Scattertext Corpus \n\u003e\u003e\u003e nlp = spacy.load('en')\n\u003e\u003e\u003e corpus = st.CorpusFromPandas(convention_df, \n...                              category_col='party', \n...                              text_col='text',\n...                              nlp=nlp).build()\n```\n\nLet's see characteristic terms in the corpus, and terms that are most associated Democrats and\nRepublicans. See slides\n[52](http://www.slideshare.net/JasonKessler/turning-unstructured-content-into-kernels-of-ideas/52)\nto [59](http://www.slideshare.net/JasonKessler/turning-unstructured-content-into-kernels-of-ideas/59) of\nthe [Turning Unstructured Content ot Kernels of Ideas](http://www.slideshare.net/JasonKessler/turning-unstructured-content-into-kernels-of-ideas/)\ntalk for more details on these approaches.\n\nHere are the terms that differentiate the corpus from a general English corpus.\n\n```pydocstring\n\u003e\u003e\u003e print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))\n['obama',\n 'romney',\n 'barack',\n 'mitt',\n 'obamacare',\n 'biden',\n 'romneys',\n 'hardworking',\n 'bailouts',\n 'autoworkers']\n```\n\nHere are the terms that are most associated with Democrats:\n\n```pydocstring\n\u003e\u003e\u003e term_freq_df = corpus.get_term_freq_df()\n\u003e\u003e\u003e term_freq_df['Democratic Score'] = corpus.get_scaled_f_scores('democrat')\n\u003e\u003e\u003e pprint(list(term_freq_df.sort_values(by='Democratic Score', ascending=False).index[:10]))\n['auto',\n 'america forward',\n 'auto industry',\n 'insurance companies',\n 'pell',\n 'last week',\n 'pell grants',\n \"women 's\",\n 'platform',\n 'millionaires']\n```\n\nAnd Republicans:\n\n```pydocstring\n\u003e\u003e\u003e term_freq_df['Republican Score'] = corpus.get_scaled_f_scores('republican')\n\u003e\u003e\u003e pprint(list(term_freq_df.sort_values(by='Republican Score', ascending=False).index[:10]))\n['big government',\n \"n't build\",\n 'mitt was',\n 'the constitution',\n 'he wanted',\n 'hands that',\n 'of mitt',\n '16 trillion',\n 'turned around',\n 'in florida']\n```\n\n### Visualizing term associations\n\nNow, let's write the scatter plot a stand-alone HTML file. We'll make the y-axis category  \"democrat\", and name\nthe category \"Democrat\" with a capital \"D\" for presentation\npurposes. We'll name the other category \"Republican\" with a capital \"R\". All documents in the corpus without\nthe category \"democrat\" will be considered Republican. We set the width of the visualization in pixels, and label\neach excerpt with the speaker using the `metadata` parameter. Finally, we write the visualization to an HTML file.\n\n```pydocstring\n\u003e\u003e\u003e html = st.produce_scattertext_explorer(corpus,\n...          category='democrat',\n...          category_name='Democratic',\n...          not_category_name='Republican',\n...          width_in_pixels=1000,\n...          metadata=convention_df['speaker'])\n\u003e\u003e\u003e open(\"Convention-Visualization.html\", 'wb').write(html.encode('utf-8'))\n```\n\nBelow is what the webpage looks like. Click it and wait a few minutes for the interactive version.\n[![Conventions-Visualization.html](https://jasonkessler.github.io/2012conventions0.0.2.2.png)](https://jasonkessler.github.io/Conventions-Visualization.html)\n\n### Visualizing Phrase associations\n\nScattertext can also be used to visualize the category association of a variety of different phrase types. The word\n\"phrase\" denotes any single or multi-word collocation.\n\n#### Using PyTextRank\n\n[PyTextRank](https://github.com/DerwenAI/pytextrank), created by Paco Nathan, is an implementation of\na modified version of the TextRank algorithm (Mihalcea and Tarau 2004). It involves graph centrality\nalgorithm to extract a scored list of the most prominent phrases in a document. Here,\nnamed entities recognized by spaCy. As of spaCy version 2.2, these are from an NER system trained on\n[Ontonotes 5](https://catalog.ldc.upenn.edu/LDC2013T19).\n\nPlease install pytextrank `$ pip3 install pytextrank` before continuing with this tutorial.\n\nTo use, build a corpus as normal, but make sure you use spaCy to parse each document as opposed a built-in\n`whitespace_nlp`-type tokenizer. Note that adding PyTextRank to the spaCy pipeline is not needed, as it\nwill be run separately by the `PyTextRankPhrases` object. We'll reduce the number of phrases displayed in the\nchart to 2000 using the `AssociationCompactor`. The phrases generated will be treated like non-textual features\nsince their document scores will not correspond to word counts.\n\n```pydocstring\nimport pytextrank, spacy\nimport scattertext as st\n\nnlp = spacy.load('en')\nnlp.add_pipe(\"textrank\", last=True)\n\nconvention_df = st.SampleCorpora.ConventionData2012.get_data().assign(\n    parse=lambda df: df.text.apply(nlp),\n    party=lambda df: df.party.apply({'democrat': 'Democratic', 'republican': 'Republican'}.get)\n)\ncorpus = st.CorpusFromParsedDocuments(\n    convention_df,\n    category_col='party',\n    parsed_col='parse',\n    feats_from_spacy_doc=st.PyTextRankPhrases()\n).build(\n).compact(\n    AssociationCompactor(2000, use_non_text_features=True)\n)\n```\n\nNote that the terms present in the corpus are named entities, and, as opposed to frequency counts, their scores\nare the eigencentrality scores assigned to them by the TextRank algorithm. Running `corpus.get_metadata_freq_df('')`\nwill return, for each category, the sums of terms' TextRank scores. The dense ranks of these scores will be used to\nconstruct the scatter plot.\n\n```pydocstring\nterm_category_scores = corpus.get_metadata_freq_df('')\nprint(term_category_scores)\n'''\n                                         Democratic  Republican\nterm\nour future                                 1.113434    0.699103\nyour country                               0.314057    0.000000\ntheir home                                 0.385925    0.000000\nour government                             0.185483    0.462122\nour workers                                0.199704    0.210989\nher family                                 0.540887    0.405552\nour time                                   0.510930    0.410058\n...\n'''\n```  \n\nBefore we construct the plot, let's some helper variables Since the aggregate TextRank scores aren't particularly\ninterpretable, we'll display the per-category rank of each score in the `metadata_description` field. These will be\ndisplayed after a term is clicked.\n\n```pydocstring\nterm_ranks = pd.DataFrame(\n    np.argsort(np.argsort(-term_category_scores, axis=0), axis=0) + 1,\n    columns=term_category_scores.columns,\n    index=term_category_scores.index)\n\nmetadata_descriptions = {\n    term: '\u003cbr/\u003e' + '\u003cbr/\u003e'.join(\n        '\u003cb\u003e%s\u003c/b\u003e TextRank score rank: %s/%s' % (cat, term_ranks.loc[term, cat], corpus.get_num_metadata())\n        for cat in corpus.get_categories())\n    for term in corpus.get_metadata()\n}\n```\n\nWe can construct term scores in a couple ways. One is a standard dense-rank difference, a score which is used in most\nof the two-category contrastive plots here, which will give us the most category-associated phrases. Another is to use\nthe maximum category-specific score, this will give us the most prominent phrases in each category, regardless of the\nprominence in the other category. We'll take both approaches in this tutorial, let's compute the second kind of score,\nthe category-specific prominence below.\n\n```pydocstring\ncategory_specific_prominence = term_category_scores.apply(\n    lambda r: r.Democratic if r.Democratic \u003e r.Republican else -r.Republican,\n    axis=1\n)\n```\n\nNow we're ready output this chart. Note that we use a `dense_rank` transform, which places identically scalled phrases\natop each other. We use `category_specific_prominence` as scores, and set `sort_by_dist` as `False` to ensure the\nphrases displayed on the right-hand side of the chart are ranked by the scores and not distance to the upper-left or\nlower-right corners. Since matching phrases are treated as non-text features, we encode them as single-phrase topic\nmodels and set the `topic_model_preview_size` to `0` to indicate the topic model list shouldn't be shown. Finally,\nwe set ensure the full documents are displayed. Note the documents will be displayed in order of phrase-specific score.\n\n```pydocstring\nhtml = produce_scattertext_explorer(\n    corpus,\n    category='Democratic',\n    not_category_name='Republican',\n    minimum_term_frequency=0,\n    pmi_threshold_coefficient=0,\n    width_in_pixels=1000,\n    transform=dense_rank,\n    metadata=corpus.get_df()['speaker'],\n    scores=category_specific_prominence,\n    sort_by_dist=False,\n    use_non_text_features=True,\n    topic_model_term_lists={term: [term] for term in corpus.get_metadata()},\n    topic_model_preview_size=0,\n    metadata_descriptions=metadata_descriptions,\n    use_full_doc=True\n)\n```\n\n[![PyTextRankProminenceScore.html](https://jasonkessler.github.io/PyTextRankProminence.png)](https://jasonkessler.github.io/PyTextRankProminenceScore.html)\n\nThe most associated terms in each category make some sense, at least on a post hoc analysis. When referring to (then)\nGovernor Romney, Democrats used his surname \"Romney\" in their most central mentions of him, while Republicans used the\nmore familiar and humanizing \"Mitt\". In terms of the President Obama, the phrase \"Obama\" didn't show up as a top term i\nn either, the but the first name \"Barack\" was one of the the most central phrases in Democratic speeches,\nmirroring \"Mitt.\"\n\nAlternatively, we can Dense Rank Difference in scores to color phrase-points and determine the top phrases to be\ndisplayed on the right-hand side of the chart. Instead of setting `scores` as category-specific prominence scores,\nwe set `term_scorer=RankDifference()` to inject a way determining term scores into the scatter plot creation process.\n\n```pydocstring\nhtml = produce_scattertext_explorer(\n    corpus,\n    category='Democratic',\n    not_category_name='Republican',\n    minimum_term_frequency=0,\n    pmi_threshold_coefficient=0,\n    width_in_pixels=1000,\n    transform=dense_rank,\n    use_non_text_features=True,\n    metadata=corpus.get_df()['speaker'],\n    term_scorer=RankDifference(),\n    sort_by_dist=False,\n    topic_model_term_lists={term: [term] for term in corpus.get_metadata()},\n    topic_model_preview_size=0, \n    metadata_descriptions=metadata_descriptions,\n    use_full_doc=True\n)\n```\n\n[![PyTextRankRankDiff.html](https://jasonkessler.github.io/PyTextRankRankDiff.png)](https://jasonkessler.github.io/PyTextRankRankDiff.html)\n\n#### Using Phrasemachine to find phrases.\n\nPhrasemachine from [AbeHandler](https://github.com/AbeHandler) (Handler et al. 2016) uses regular expressions over\nsequences of part-of-speech tags to identify noun phrases. This has the advantage over using spaCy's NP-chunking\nin that it tends to isolote meaningful, large noun phases which are free of appositives.\n\nA opposed to PyTextRank, we'll just use counts of these phrases, treating them like any other term.\n\n```pydocstring\nimport spacy\nfrom scattertext import SampleCorpora, PhraseMachinePhrases, dense_rank, RankDifference, AssociationCompactor, produce_scattertext_explorer\nfrom scattertext.CorpusFromPandas import CorpusFromPandas\n\ncorpus = (CorpusFromPandas(SampleCorpora.ConventionData2012.get_data(),\n                           category_col='party',\n                           text_col='text',\n                           feats_from_spacy_doc=PhraseMachinePhrases(),\n                           nlp=spacy.load('en', parser=False))\n          .build().compact(AssociationCompactor(4000)))\n\nhtml = produce_scattertext_explorer(corpus,\n                                    category='democrat',\n                                    category_name='Democratic',\n                                    not_category_name='Republican',\n                                    minimum_term_frequency=0,\n                                    pmi_threshold_coefficient=0,\n                                    transform=dense_rank,\n                                    metadata=corpus.get_df()['speaker'],\n                                    term_scorer=RankDifference(),\n                                    width_in_pixels=1000)\n```\n\n[![Phrasemachine.html](https://jasonkessler.github.io/PhraseMachine.png)](https://jasonkessler.github.io/Phrasemachine.html)\n\n### Adding color gradients to explain scores\n\nIn Scattertext, various metrics, including term associations, are often shown through two ways. The first \nand most important, is the position in the chart. The second is the color of a point or text. In Scattertext 0.2.21, a \nway of visualizing the semantics of these scores is introduced: the gradient as key.  \n\nThe gradient, by default, follows the `d3_color_scale` parameter of `produce_scattertext_explorer` which is \n`d3.interpolateRdYlBu` by default. \n\nThe following additional parameters to `produce_scattertext_explorer` (and similar functions) allow for the manipulation\ngradients.\n\n- `include_gradient: bool` (`False` by default) is a flag that triggers the appearance of a gradient.\n- `left_gradient_term: Optional[str]` indicates the text written on the far-left side of the gradient. It is written in `gradient_text_color` and is `category_name` by default.\n- `right_gradient_term: Optional[str]` indicates the text written on the far-left side of the gradient. It is written in `gradient_text_color` and is `not_category_name` by default.\n- `middle_gradient_term: Optional[str]` indicates the text written in the middle of the gradient. It is the opposite color of the center gradient color and is empty by default.\n- `gradient_text_color: Optional[str]` indicates the fixed color of the text written on the gradient. If None, it defaults to opposite color of the gradient.\n- `left_text_color: Optional[str]` overrides `gradient_text_color` for the left gradient term\n- `middle_text_color: Optional[str]` overrides `gradient_text_color` for the middle gradient term\n- `right_text_color: Optional[str]` overrides `gradient_text_color` for the right gradient term\n- `gradient_colors: Optional[List[str]]` list of hex colors, including '#', (e.g., `['#0000ff', '#980067', '#cc3300', '#32cd00']`) which describe the gradient. If given, these override `d3_color_scale`.\n\nA straightforward example is as follows. Term colors are defined as a mapping between a term name and a `#RRGGBB` color \nas part of the `term_color` parameter, and the color gradient is defined in `gradient_colors`. THe \n\n```python\n\nimport matplotlib.pyplot as plt\nimport matplotlib as mpl\n\ndf = st.SampleCorpora.ConventionData2012.get_data().assign(\n    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)\n)\n\ncorpus = st.CorpusFromParsedDocuments(\n    df, category_col='party', parsed_col='parse'\n).build().get_unigram_corpus().compact(st.AssociationCompactor(2000))\n\nhtml = st.produce_scattertext_explorer(\n    corpus,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    minimum_term_frequency=0,\n    pmi_threshold_coefficient=0,\n    width_in_pixels=1000,\n    metadata=corpus.get_df()['speaker'],\n    transform=st.Scalers.dense_rank,\n    include_gradient=True,\n    left_gradient_term=\"More Democratic\",\n    right_gradient_term=\"More Republican\",\n    middle_gradient_term='Metric: Dense Rank Difference',\n    gradient_text_color=\"white\",\n    term_colors=dict(zip(\n        corpus.get_terms(),\n        [\n            mpl.colors.to_hex(x) for x in plt.get_cmap('brg')(\n                st.Scalers.scale_center_zero_abs(\n                    st.RankDifferenceScorer(corpus).set_categories('democrat').get_scores()).values\n            )\n        ]\n    )),\n    gradient_colors=[mpl.colors.to_hex(x) for x in plt.get_cmap('brg')(np.arange(1., 0., -0.01))],\n)\n```\n[![demo_gradient.html](https://jasonkessler.github.io/gradient.png)](https://jasonkessler.github.io/demo_gradient.html)\n\n\n### Visualizing Empath topics and categories\n\nIn order to visualize Empath (Fast et al., 2016) topics and categories instead of terms, we'll need to\ncreate a `Corpus` of extracted topics and categories rather than unigrams and\nbigrams. To do so, use the `FeatsOnlyFromEmpath` feature extractor. See the source code for\nexamples of how to make your own.\n\nWhen creating the visualization, pass the `use_non_text_features=True` argument into\n`produce_scattertext_explorer`. This will instruct it to use the labeled Empath\ntopics and categories instead of looking for terms. Since the documents returned\nwhen a topic or category label is clicked will be in order of the document-level\ncategory-association strength, setting `use_full_doc=True` makes sense, unless you have\nenormous documents. Otherwise, the first 300 characters will be shown.\n\n(New in 0.0.26). Ensure you include `topic_model_term_lists=feat_builder.get_top_model_term_lists()`\nin `produce_scattertext_explorer` to ensure it bolds passages of snippets that match the\ntopic model.\n\n```pydocstring\n\u003e\u003e\u003e feat_builder = st.FeatsFromOnlyEmpath()\n\u003e\u003e\u003e empath_corpus = st.CorpusFromParsedDocuments(convention_df,\n...                                              category_col='party',\n...                                              feats_from_spacy_doc=feat_builder,\n...                                              parsed_col='text').build()\n\u003e\u003e\u003e html = st.produce_scattertext_explorer(empath_corpus,\n...                                        category='democrat',\n...                                        category_name='Democratic',\n...                                        not_category_name='Republican',\n...                                        width_in_pixels=1000,\n...                                        metadata=convention_df['speaker'],\n...                                        use_non_text_features=True,\n...                                        use_full_doc=True,\n...                                        topic_model_term_lists=feat_builder.get_top_model_term_lists())\n\u003e\u003e\u003e open(\"Convention-Visualization-Empath.html\", 'wb').write(html.encode('utf-8'))\n``` \n\n[![Convention-Visualization-Empath.html](https://jasonkessler.github.io/Convention-Visualization-Empath.png)](https://jasonkessler.github.io/Convention-Visualization-Empath.html)\n\nc\nScattertext also includes a feature builder to explore the relationship between General Inquirer Tag Categoires\nand Document Categories. We'll use a slightly different approach, looking at relationship of GI Tag Categories to\npolitical parties by using the\nZ-Scores of the Log-Odds-Ratio with Uninformative Dirichlet Priors (Monroe 2008). We'll use\nthe `produce_frequency_explorer` plot\nvariation to visualize this relationship, setting the x-axis as the number of times a word in the tag category occurs,\nand the y-axis as the z-score.\n\nFor more information on the General Inquirer, please see\nthe [General Inquirer Home Page](http://www.wjh.harvard.edu/~inquirer/).\n\nWe'll use the same data set as before, except we'll use the `FeatsFromGeneralInquirer` feature builder.\n\n```pydocstring\n\u003e\u003e\u003e general_inquirer_feature_builder = st.FeatsFromGeneralInquirer()\n\u003e\u003e\u003e corpus = st.CorpusFromPandas(convention_df,\n...                              category_col='party',\n...                              text_col='text',\n...                              nlp=st.whitespace_nlp_with_sentences,\n...                              feats_from_spacy_doc=general_inquirer_feature_builder).build()\n```\n\nNext, we'll call `produce_frequency_explorer` in a similar way we called `produce_scattertext_explorer` in the previous\nsection.\nThere are a few differences, however. First, we specify the `LogOddsRatioUninformativeDirichletPrior` term scorer, which\nscores the relationships between the categories. The `grey_threshold` indicates the points scoring between [-1.96, 1.96]\n(i.e., p \u003e 0.05) should be colored gray. The\nargument `metadata_descriptions=general_inquirer_feature_builder.get_definitions()`\nindicates that a dictionary mapping the tag name to a string definition is passed. When a tag is clicked, the definition\nin the dictionary will be shown below the plot, as shown in the image following the snippet.\n\n```pydocstring\n\u003e\u003e\u003e html = st.produce_frequency_explorer(corpus,\n...                                      category='democrat',\n...                                      category_name='Democratic',\n...                                      not_category_name='Republican',\n...                                      metadata=convention_df['speaker'],\n...                                      use_non_text_features=True,\n...                                      use_full_doc=True,\n...                                      term_scorer=st.LogOddsRatioUninformativeDirichletPrior(),\n...                                      grey_threshold=1.96,\n...                                      width_in_pixels=1000,\n...                                      topic_model_term_lists=general_inquirer_feature_builder.get_top_model_term_lists(),\n...                                      metadata_descriptions=general_inquirer_feature_builder.get_definitions())\n```\n\nHere's the resulting chart.  \n[![demo_general_inquirer_frequency_plot.html](https://jasonkessler.github.io/general_inquirer.png)](https://jasonkessler.github.io/demo_general_inquirer_frequency_plot.html)\n\n[![demo_general_inquirer_frequency_plot.html](https://jasonkessler.github.io/general_inquirer2.png)](https://jasonkessler.github.io/demo_general_inquirer_frequency_plot.html)\n\n### Visualizing the Moral Foundations 2.0 Dictionary\n\nThe  [[Moral Foundations Theory]](https://moralfoundations.org/) proposes six psychological constructs\nas building blocks of moral thinking, as described in Graham et al. (2013). These foundations are,\nas described on [[moralfoundations.org]](https://moralfoundations.org/): care/harm, fairness/cheating, loyalty/betrayal,\nauthority/subversion, sanctity/degradation, and liberty/oppression. Please see the site for a more in-depth discussion\nof these foundations.\n\nFrimer et al. (2019) created the Moral Foundations Dictionary 2.0, or a lexicon of terms which invoke a moral foundation\nas a virtue (favorable toward the foundation) or a vice (in opposition to the foundation).\n\nThis dictionary can be used in the same way as the General Inquirer. In this example, we can plot the Cohen's d scores\nof\nfoundation-word counts relative to the frequencies words involving those foundations were invoked.\n\nWe can first load the corpus as normal, and use `st.FeatsFromMoralFoundationsDictionary()` to extract features.\n\n```python\nimport scattertext as st\n\nconvention_df = st.SampleCorpora.ConventionData2012.get_data()\nmoral_foundations_feats = st.FeatsFromMoralFoundationsDictionary()\ncorpus = st.CorpusFromPandas(convention_df,\n                             category_col='party',\n                             text_col='text',\n                             nlp=st.whitespace_nlp_with_sentences,\n                             feats_from_spacy_doc=moral_foundations_feats).build()\n```\n\nNext, let's use Cohen's d term scorer to analyze the corpus, and describe a set of Cohen's d association scores.\n\n```python\ncohens_d_scorer = st.CohensD(corpus).use_metadata()\nterm_scorer = cohens_d_scorer.set_categories('democrat', ['republican']).term_scorer.get_score_df()\n```\n\nWhich yields the following data frame:\n\n|                  |   cohens_d |   cohens_d_se |   cohens_d_z |   cohens_d_p |   hedges_g | hedges_g_se | hedges_g_z |  hedges_g_p |         m1 |         m2 |   count1 |   count2 |   docs1 |   docs2 |\n|:-----------------|-----------:|--------------:|-------------:|-------------:|-----------:|------------:|-----------:|------------:|-----------:|-----------:|---------:|---------:|--------:|--------:|\n| care.virtue      |  0.662891  |      0.149425 |     4.43629  |  4.57621e-06 |   0.660257 |    0.159049 |    4.15129 | 1.65302e-05 | 0.195049   | 0.12164    |      760 |      379 |     115 |      54 |\n| care.vice        |  0.24435   |      0.146025 |     1.67335  |  0.0471292   |   0.243379 |    0.152654 |    1.59432 |   0.0554325 | 0.0580005  | 0.0428358  |      244 |      121 |      80 |      41 |\n| fairness.virtue  |  0.176794  |      0.145767 |     1.21286  |  0.112592    |   0.176092 |    0.152164 |    1.15725 |    0.123586 | 0.0502469  | 0.0403369  |      225 |      107 |      71 |      39 |\n| fairness.vice    |  0.0707162 |      0.145528 |     0.485928 |  0.313509    |  0.0704352 |    0.151711 |   0.464273 |    0.321226 | 0.00718627 | 0.00573227 |       32 |       14 |      21 |      10 |\n| authority.virtue | -0.0187793 |      0.145486 |    -0.12908  |  0.551353    | -0.0187047 |     0.15163 |  -0.123357 |    0.549088 | 0.358192   | 0.361191   |     1281 |      788 |     122 |      66 |\n| authority.vice   | -0.0354164 |      0.145494 |    -0.243422 |  0.596161    | -0.0352757 |    0.151646 |  -0.232619 |    0.591971 | 0.00353465 | 0.00390602 |       20 |       14 |      14 |      10 |\n| sanctity.virtue  | -0.512145  |      0.147848 |    -3.46399  |  0.999734    |   -0.51011 |    0.156098 |   -3.26788 |    0.999458 | 0.0587987  | 0.101677   |      265 |      309 |      74 |      48 |    \n| sanctity.vice    | -0.108011  |      0.145589 |    -0.74189  |  0.770923    |  -0.107582 |    0.151826 |  -0.708585 |    0.760709 | 0.00845048 | 0.0109339  |       35 |       28 |      23 |      20 |\n| loyalty.virtue   | -0.413696  |      0.147031 |    -2.81367  |  0.997551    |  -0.412052 |    0.154558 |     -2.666 |    0.996162 | 0.259296   | 0.309776   |     1056 |      717 |     119 |      66 |\n| loyalty.vice     | -0.0854683 |      0.145549 |    -0.587213 |  0.72147     | -0.0851287 |    0.151751 |  -0.560978 |    0.712594 | 0.00124518 | 0.00197022 |        5 |        5 |       5 |       4 |\n\nThis data frame gives us Cohen's d scores (and their standard errors and z-scores), Hedge's $g$ scores (ditto),\nthe mean document-length normalized topic usage per category (where the in-focus category is m1 [in this case Democrats]\nand the out-of-focus is m2), the raw number of words used in for each topic (count1 and count2), and the number of\ndocuments\nin each category with the topic (docs1 and docs2).\n\nNote that Cohen's d is the difference of m1 and m2 divided by their pooled standard deviation.\n\nNow, let's plot the d-scores of foundations vs. their frequencies.\n\n```python\nhtml = st.produce_frequency_explorer(\n    corpus,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    metadata=convention_df['speaker'],\n    use_non_text_features=True,\n    use_full_doc=True,\n    term_scorer=st.CohensD(corpus).use_metadata(),\n    grey_threshold=0,\n    width_in_pixels=1000,\n    topic_model_term_lists=moral_foundations_feats.get_top_model_term_lists(),\n    metadata_descriptions=moral_foundations_feats.get_definitions()\n)\n```\n\n[![demo_moral_foundations.html](https://jasonkessler.github.io/demo_moral_foundations.png)](https://jasonkessler.github.io/demo_moral_foundations.html)\n\n### Ordering Terms by Corpus Characteristicness\n\nOften the terms of most interest are ones that are characteristic to the corpus as a whole. These are terms which occur\nfrequently in all sets of documents being studied, but relatively infrequent compared to general term frequencies.\n\nWe can produce a plot with a characteristic score on the x-axis and class-association scores on the y-axis using the\nfunction `produce_characteristic_explorer`.\n\nCorpus characteristicness is the difference in dense term ranks between the words in all of the documents in the study\nand a general English-language frequency list. See\nthis [Talk on Term-Class Association Scores](http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb)\nfor a more thorough explanation.\n\n```python\nimport scattertext as st\n\ncorpus = (st.CorpusFromPandas(st.SampleCorpora.ConventionData2012.get_data(),\n                              category_col='party',\n                              text_col='text',\n                              nlp=st.whitespace_nlp_with_sentences)\n          .build()\n          .get_unigram_corpus()\n          .compact(st.ClassPercentageCompactor(term_count=2,\n                                               term_ranker=st.OncePerDocFrequencyRanker)))\nhtml = st.produce_characteristic_explorer(\n    corpus,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    metadata=corpus.get_df()['speaker']\n)\nopen('demo_characteristic_chart.html', 'wb').write(html.encode('utf-8'))\n```\n\n[![demo_characteristic_chart.html](https://jasonkessler.github.io/demo_characteristic_chart.png)](https://jasonkessler.github.io/demo_characteristic_chart.html)\n\n### Document-Based Scatterplots\n\nIn addition to words, phases and topics, we can make each point correspond to a document. Let's first create\na corpus object for the 2012 Conventions data set. This explanation follows `demo_pca_documents.py`\n\n```python\nimport pandas as pd\nfrom sklearn.feature_extraction.text import TfidfTransformer\nimport scattertext as st\nfrom scipy.sparse.linalg import svds\n\nconvention_df = st.SampleCorpora.ConventionData2012.get_data()\nconvention_df['parse'] = convention_df['text'].apply(st.whitespace_nlp_with_sentences)\ncorpus = (st.CorpusFromParsedDocuments(convention_df,\n                                       category_col='party',\n                                       parsed_col='parse')\n          .build()\n          .get_stoplisted_unigram_corpus())\n```\n\nNext, let's add the document names as meta data in the corpus object. The `add_doc_names_as_metadata` function\ntakes an array of document names, and populates a new corpus' meta data with those names. If two documents have the\nsame name, it appends a number (starting with 1) to the name.\n\n```python\ncorpus = corpus.add_doc_names_as_metadata(corpus.get_df()['speaker'])\n```\n\nNext, we find tf.idf scores for the corpus' term-document matrix, run sparse SVD, and add them to a projection\ndata frame, making the x and y-axes the first two singular values, and indexing it on the corpus' meta data, which\ncorresponds to the document names.\n\n```python\nembeddings = TfidfTransformer().fit_transform(corpus.get_term_doc_mat())\nu, s, vt = svds(embeddings, k=3, maxiter=20000, which='LM')\nprojection = pd.DataFrame({'term': corpus.get_metadata(), 'x': u.T[0], 'y': u.T[1]}).set_index('term')\n```\n\nFinally, set scores as 1 for Democrats and 0 for Republicans, rendering Republican documents as red points and\nDemocratic documents as blue. For more on the `produce_pca_explorer` function,\nsee [Using SVD to visualize any kind of word embeddings](#using-svd-to-visualize-any-kind-of-word-embeddings).\n\n```python\ncategory = 'democrat'\nscores = (corpus.get_category_ids() == corpus.get_categories().index(category)).astype(int)\nhtml = st.produce_pca_explorer(corpus,\n                               category=category,\n                               category_name='Democratic',\n                               not_category_name='Republican',\n                               metadata=convention_df['speaker'],\n                               width_in_pixels=1000,\n                               show_axes=False,\n                               use_non_text_features=True,\n                               use_full_doc=True,\n                               projection=projection,\n                               scores=scores,\n                               show_top_terms=False)\n```\n\nClick for an interactive version\n[![demo_pca_documents.html](https://jasonkessler.github.io/doc_pca.png)](https://jasonkessler.github.io/demo_pca_documents.html)\n\n### Using Cohen's d or Hedge's g to visualize effect size.\n\nCohen's d is a popular metric used to measure effect size. The definitions of Cohen's d and Hedge's $g$\nfrom (Shinichi and Cuthill 2017) are implemented in Scattertext.\n\n```python\n\u003e\u003e\u003e convention_df = st.SampleCorpora.ConventionData2012.get_data()\n\u003e\u003e\u003e corpus = (st.CorpusFromPandas(convention_df,\n...                               category_col='party',\n               ...text_col='text',\n               ...nlp=st.whitespace_nlp_with_sentences)\n....build()\n....get_unigram_corpus())\n```\n\nWe can create a term scorer object to examine the effect sizes and other metrics.\n\n```python\n\u003e\u003e \u003e term_scorer = st.CohensD(corpus).set_categories('democrat', ['republican'])\n\u003e\u003e \u003e term_scorer.get_score_df().sort_values(by='cohens_d', ascending=False).head()\ncohens_d\ncohens_d_se\ncohens_d_z\ncohens_d_p\nhedges_g\nhedges_g_se\nhedges_g_z\nhedges_g_p\nm1\nm2\nobama\n1.187378\n0.024588\n48.290444\n0.000000e+00\n1.187322\n0.018419\n64.461363\n0.0\n0.007778\n0.002795\n\n\nclass 0.855859     0.020848   41.052045   0.000000e+00  0.855818     0.017227   49.677688         0.0  0.002222  0.000375\n\n\nmiddle\n0.826895\n0.020553\n40.232746\n0.000000e+00\n0.826857\n0.017138\n48.245626\n0.0\n0.002316\n0.000400\npresident\n0.820825\n0.020492\n40.056541\n0.000000e+00\n0.820786\n0.017120\n47.942661\n0.0\n0.010231\n0.005369\nbarack\n0.730624\n0.019616\n37.245725\n6.213052e-304\n0.730589\n0.016862\n43.327800\n0.0\n0.002547\n0.000725\n```\n\nOur calculation of Cohen's d is not directly based on term counts. Rather, we divide each document's term counts by the\ntotal number\nof terms in the document before calculating the statistics.  `m1` and `m2` are, respectively the mean portions of words\nin speeches made by Democrats and Republicans that were the term in question. The effect size (`cohens_d`) is the\ndifference between these means divided by the pooled standard deviation.  `cohens_d_se` is the standard error\nof the statistic, while `cohens_d_z` and `cohens_d_p` are the Z-scores and p-values indicating the statistical\nsignificance of the effect. Corresponding columns are present for Hedge's $g$ a version of Cohen's d adjusted for data set size.\n\n```python\n\u003e\u003e\u003e st.produce_frequency_explorer(\n    corpus,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    term_scorer=st.CohensD(corpus),\n    metadata=convention_df['speaker'],\n    grey_threshold=0\n)\n```  \n\nClick for an interactive version.\n[![demo_cohens_d.html](https://jasonkessler.github.io/cohen_d.png)](https://jasonkessler.github.io/demo_cohens_d.html)\n\n### Using Cliff's Delta to visualize effect size\n\nCliff's Delta (Cliff 1993) uses a non-parametric approach to computing effect size. In our setting, the term's frequency \npercentage of each document in the focus set is compared with that of the background set. For each pair of documents,\na score of 1 is given if the focus document's frequency percentage is larger than the background, 0 if identical, and -1 \nif different. Note that this assumes document lengths are similarly distributed across the focus and background corpora.\n\nSee [https://real-statistics.com/non-parametric-tests/mann-whitney-test/cliffs-delta/] for the formulas used in `CliffsDelta`.\n\nBelow is an example of how to use `CliffsDelta` to find and plot term scores: \n\n```python\nnlp = spacy.blank('en')\nnlp.add_pipe('sentencizer')\nconvention_df = st.SampleCorpora.ConventionData2012.get_data().assign(\n   party = lambda df: df.party.apply(\n       lambda x: {'democrat': 'Dem', 'republican': 'Rep'}[x]),\n   SpacyParse=lambda df: df.text.progress_apply(nlp)\n)\ncorpus = st.CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='SpacyParse').build(\n).remove_terms_used_in_less_than_num_docs(10)\nst.CliffsDelta(corpus).set_categories('Dem').get_score_df().sort_values(by='Dem', ascending=False).iloc[:10]\n```\n\n| term            |   Metric |    Stddev |   Low-5.0% CI |   High-5.0% CI |   TermCount1 |   TermCount2 |   DocCount1 |   DocCount2 |\n|:----------------|---------:|----------:|--------------:|---------------:|-------------:|-------------:|------------:|------------:|\n| obama           | 0.597191 | 0.0266606 |     -1.35507  |      -1.03477  |          537 |          165 |         113 |          40 |\n| president obama | 0.565903 | 0.0314348 |     -2.37978  |      -1.74131  |          351 |           78 |         100 |          30 |\n| president       | 0.426337 | 0.0293418 |      1.22784  |       0.909226 |          740 |          301 |         113 |          53 |\n| middle          | 0.417591 | 0.0267365 |      1.10791  |       0.840932 |          164 |           27 |          68 |          12 |\n| class           | 0.415373 | 0.0280622 |      1.09032  |       0.815649 |          161 |           25 |          69 |          14 |\n| barack          | 0.406997 | 0.0281692 |      1.00765  |       0.750963 |          202 |           46 |          76 |          16 |\n| barack obama    | 0.402562 | 0.027512  |      0.965359 |       0.723403 |          164 |           45 |          76 |          16 |\n| that 's         | 0.384085 | 0.0227344 |      0.809747 |       0.634705 |          236 |           91 |          89 |          31 |\n| obama .         | 0.356245 | 0.0237453 |      0.664688 |       0.509631 |           70 |            5 |          49 |           4 |\n| for             | 0.35526  | 0.0364138 |      0.70142  |       0.46487  |         1020 |          542 |         119 |          62 |\n\n\nWe can elegantly display the Cliff's delta scores using `dataframe_scattertext`, and describe the point coloring scheme\nusing the `include_gradient=True` parameter. We set the `left_gradient_term`, `middle_gradient_term`, and `right_gradient_term`\nparameters to strings which will appear in their corresonding values. \n\n```python\nplot_df = st.CliffsDelta(\n    corpus\n).set_categories(\n    category_name='Dem'\n).get_score_df().rename(columns={'Metric': 'CliffsDelta'}).assign(\n    Frequency=lambda df: df.TermCount1 + df.TermCount1,\n    X=lambda df: df.Frequency,\n    Y=lambda df: df.CliffsDelta,\n    Xpos=lambda df: st.Scalers.dense_rank(df.X),\n    Ypos=lambda df: st.Scalers.scale_center_zero_abs(df.Y),\n    ColorScore=lambda df: df.Ypos,\n)\n\nhtml = st.dataframe_scattertext(\n    corpus,\n    plot_df=plot_df,\n    category='Dem', \n    category_name='Dem',\n    not_category_name='Rep',\n    width_in_pixels=1000, \n    ignore_categories=False,\n    metadata=lambda corpus: corpus.get_df()['speaker'],\n    color_score_column='ColorScore',\n    left_list_column='ColorScore',\n    show_characteristic=False,\n    y_label=\"Cliff's Delta\",\n    x_label='Frequency Ranks',\n    y_axis_labels=[f'More Rep: delta={plot_df.CliffsDelta.max():.3f}',\n                   '',\n                   f'More Dem: delta={-plot_df.CliffsDelta.max():.3f}'],\n    tooltip_columns=['Frequency', 'CliffsDelta'],\n    term_description_columns=['CliffsDelta', 'Stddev', 'Low-95.0% CI', 'High-95.0% CI'],\n    header_names={'upper': 'Top Dem', 'lower': 'Top Reps'},\n    horizontal_line_y_position=0,\n    include_gradient=True,\n    left_gradient_term='More Republican',\n    right_gradient_term='More Democratic',\n    middle_gradient_term=\"Metric: Cliff's Delta\",\n)\n```\n\n[![demo_cliffs_delta.html](https://jasonkessler.github.io/cliffsdelta.png)](https://jasonkessler.github.io/demo_cliffs_delta.html)\n\n\n### Using Bi-Normal Separation (BNS) to score terms\n\nBi-Normal Separation (BNS) (Forman, 2008) was added in version 0.1.8. A variation of (BNS) is used\nwhere $F^{-1}(tpr) - F^{-1}(fpr)$ is not used as an absolute value, but kept as a difference. This allows for\nterms strongly indicative of true positives and false positives to have a high or low score.\nNote that tpr and fpr are scaled to between $[\\alpha, 1-\\alpha]$ where\nalpha is $\\in [0, 1]$. In Forman (2008) and earlier literature $\\alpha=0.0005$. In personal correspondence with Forman,\nhe kindly suggested using $\\frac{1.}{\\mbox{minimum(positives, negatives)}}$. I have implemented this as\n$\\alpha=\\frac{1.}{\\mbox{minimum documents in the least frequent category}}$\n\n```python\ncorpus = (st.CorpusFromPandas(convention_df,\n                              category_col='party',\n                              text_col='text',\n                              nlp=st.whitespace_nlp_with_sentences)\n          .build()\n          .get_unigram_corpus()\n          .remove_infrequent_words(3, term_ranker=st.OncePerDocFrequencyRanker))\n\nterm_scorer = (st.BNSScorer(corpus).set_categories('democrat'))\nprint(term_scorer.get_score_df().sort_values(by='democrat BNS'))\n\nhtml = st.produce_frequency_explorer(\n    corpus,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    scores=term_scorer.get_score_df()['democrat BNS'].reindex(corpus.get_terms()).values,\n    metadata=lambda c: c.get_df()['speaker'],\n    minimum_term_frequency=0,\n    grey_threshold=0,\n    y_label=f'Bi-normal Separation (alpha={term_scorer.prior_counts})'\n)\n```\n\nBNS Scored terms using an algorithmically found alpha.\n[![BNS](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/d  emo_bi_normal_separation.png)](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/demo_bi_normal_separation.html)\n\n### Using correlations to explain classifiers\n\nWe can train a classifier to produce a prediction score for each document. Often classifiers or regressors\nuse features which take into account features beyond the ones represented by Scatterext, be they n-gram, topic,\nextra-linguistic, neural, etc.\n\nWe can use Scattertext to visualize the correlations between unigrams (or really any feature representation) and\nthe document scores produced by a model.\n\nIn the following example, we train a linear SVM using unigram and bi-gram features on the entire convention data set,\nand use the model to make a prediction on each document, and finally using Pearson's $r$ to correlate unigram features\nto the distance from the SVM decision boundary.\n\n```python\nfrom sklearn.svm import LinearSVC\n\nimport scattertext as st\n\ndf = st.SampleCorpora.ConventionData2012.get_data().assign(\n    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)\n)\n\ncorpus = st.CorpusFromParsedDocuments(\n    df, category_col='party', parsed_col='parse'\n).build()\n\nX = corpus.get_term_doc_mat()\ny = corpus.get_category_ids()\n\nclf = LinearSVC()\nclf.fit(X=X, y=y == corpus.get_categories().index('democrat'))\ndoc_scores = clf.decision_function(X=X)\n\ncompactcorpus = corpus.get_unigram_corpus().compact(st.AssociationCompactor(2000))\n\nplot_df = st.Correlations().set_correlation_type(\n    'pearsonr'\n).get_correlation_df(\n    corpus=compactcorpus,\n    document_scores=doc_scores\n).reindex(compactcorpus.get_terms()).assign(\n    X=lambda df: df.Frequency,\n    Y=lambda df: df['r'],\n    Xpos=lambda df: st.Scalers.dense_rank(df.X),\n    Ypos=lambda df: st.Scalers.scale_center_zero_abs(df.Y),\n    SuppressDisplay=False,\n    ColorScore=lambda df: df.Ypos,\n)\n\nhtml = st.dataframe_scattertext(\n    compactcorpus,\n    plot_df=plot_df,\n    category='democrat',\n    category_name='Democratic',\n    not_category_name='Republican',\n    width_in_pixels=1000,\n    metadata=lambda c: c.get_df()['speaker'],\n    unified_context=False,\n    ignore_categories=False,\n    color_score_column='ColorScore',\n    left_list_column='ColorScore',\n    y_label=\"Pearson r (correlation to SVM document score)\",\n    x_label='Frequency Ranks',\n    header_names={'upper': 'Top Democratic',\n                  'lower': 'Top Republican'},\n)\n```\n\n[![BNS](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/pearsons.png)](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/pearsons.html)\n\n### Using Custom Background Word Frequencies\n\nScattertext relies on a set of general-domain English word frequencies when computing unigram characteristic  \nscores. When using running Scattertext on non-English data or in a specific domain, the quality of the scores\nwill degrade.\n\nEnsure that you are on Scattertext 0.1.6 or higher.\n\nTo remedy this, one can add a custom set of background scores to a Corpus-like object,\nusing the `Corpus.set_background_corpus` function. The function takes a `pd.Series` object, indexed on\nterms with numeric count values.\n\nBy default, [!understanding-scaled-f-score](Scaled F-Score) is used to rank how characteristic\nterms are.\n\nThe example below illustrates using Polish background word frequencies.\n\nFirst, we produce a Series object mapping Polish words to their frequencies using a list from\nthe [https://github.com/oprogramador/most-common-words-by-language](most-common-words-by-language) repo.\n\n```python\npolish_word_frequencies = pd.read_csv(\n    'https://raw.githubusercontent.com/hermitdave/FrequencyWords/master/content/2016/pl/pl_50k.txt',\n    sep=' ',\n    names=['Word', 'Frequency']\n).set_index('Word')['Frequency']\n```\n\nNote the composition of the Series\n\n```python\n\u003e\u003e \u003e polish_word_frequencies\nWord\nnie\n5875385\nto\n4388099\nsię\n3507076\nw\n2723767\nna\n2309765\nName: Frequency, dtype: int64\n```\n\nNext, we build a DataFrame, `reviews_df`, consisting of document which appear (to a non-Polish speaker) to be\npositive and negative hotel reviews from the [https://klejbenchmark.com/tasks/](PolEmo2.0) corpus\n(Kocoń, et al. 2019). Note this data is under a CC BY-NC-SA 4.0 license. These are labeled as\n\"__label__meta_plus_m\" and \"__label__meta_minus_m\". We will use Scattertext to compare those\nreviews and determine\n\n```python\nnlp = spacy.blank('pl')\nnlp.add_pipe('sentencizer')\n\nwith ZipFile(io.BytesIO(urlopen(\n        'https://klejbenchmark.com/static/data/klej_polemo2.0-in.zip'\n).read())) as zf:\n    review_df = pd.read_csv(zf.open('train.tsv'), sep='\\t')[\n        lambda df: df.target.isin(['__label__meta_plus_m', '__label__meta_minus_m'])\n    ].assign(\n        Parse=lambda df: df.sentence.apply(nlp)\n    )\n```\n\nNext, we wish to create a `ParsedCorpus` object from `review_df`. In preparation, we first assemble a\nlist of Polish stopwords from the [stopwords](https://github.com/bieli/stopwords/) repository. We also\ncreate the `not_a_word` regular expression to filter out terms which do not contain a letter.\n\n```python\npolish_stopwords = {\n    stopword for stopword in\n    urlopen(\n        'https://raw.githubusercontent.com/bieli/stopwords/master/polish.stopwords.txt'\n    ).read().decode('utf-8').split('\\n')\n    if stopword.strip()\n}\n\nnot_a_word = re.compile(r'^\\W+$')\n```\n\nWith these present, we can build a corpus from `review_df` with the category being the binary\n\"target\" column. We reduce the term space to unigrams and then run the `filter_out` which\ntakes a function to determine if a term should be removed from the corpus. The function identifies\nterms which are in the Polish stoplist or do not contain a letter. Finally, terms occurring\nless than 20 times in the corpus are removed.\n\nWe set the background frequency Series we created early as the background corpus.\n\n```python\ncorpus = st.CorpusFromParsedDocuments(\n    review_df,\n    category_col='target',\n    parsed_col='Parse'\n).build(\n).get_unigram_corpus(\n).filter_out(\n    lambda term: term in polish_stopwords or not_a_word.match(term) is not None\n).remove_infrequent_words(\n    minimum_term_count=20\n).set_background_corpus(\n    polish_word_frequencies\n)\n```\n\nNote that a minimum word count of 20 was chosen to ensure that only around 2,000 terms would be displayed\n\n```python\n\u003e\u003e \u003e corpus.get_num_terms()\n2023\n```\n\nRunning `get_term_and_background_counts` shows us total term counts in the corpus compare to background\nfrequency counts. We limit this to terms which only occur in the corpus.\n\n```python\n\u003e\u003e \u003e corpus.get_term_and_background_counts()[\n    ...\nlambda df: df.corpus \u003e 0\n...].sort_values(by='corpus', ascending=False)\n\nbackground\ncorpus\nm\n341583838.0\n4819.0\nhotelu\n33108.0\n1812.0\nhotel\n297974790.0\n1651.0\ndoktor\n154840.0\n1534.0\npolecam\n0.0\n1438.0\n.........\nszoku\n0.0\n21.0\nbadaniem\n0.0\n21.0\nbalkonu\n0.0\n21.0\nstopnia\n0.0\n21.0\nwobec\n0.0\n21.0\n```\n\nInteresting, the term \"polecam\" appears very frequently in the corpus, but does not appear at all\nin the background corpus, making it highly characteristic. Judging from Google Translate, it appears to\nmean something related to \"recommend\".\n\nWe are now ready to display the plot.\n\n```python\nhtml = st.produce_scattertext_explorer(\n    corpus,\n    category='__label__meta_plus_m',\n    category_name='Plus-M',\n    not_category_name='Minus-M',\n    minimum_term_frequency=1,\n    width_in_pixels=1000,\n    transform=st.Scalers.dense_rank\n)\n```\n\n[![Polish Scattertext](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/polish_pos_neg_scattertext.png)](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/polish_pos_neg_scattertext.html)\n\nWe can change the formula which is used to produce the Characteristic scores\nusing the `characteristic_scorer` parameter to `produce_scattertext_explorer`.\n\nIt takes a instance of a descendant of the `CharacteristicScorer` class. See\n[DenseRankCharacteristicness.py](https://github.com/JasonKessler/scattertext/blob/8ddff82f670aa2ed40312b2cdd077e7f0a98a873/scattertext/characteristic/DenseRankCharacteristicness.py#L36)\nfor an example of how to make your own.\n\nExample of plotting with a modified characteristic scorer,\n\n```python\nhtml = st.produce_scattertext_explorer(\n    corpus,\n    category='__label__meta_plus_m',\n    category_name='Plus-M',\n    not_category_name='Minus-M',\n    minimum_term_frequency=1,\n    transform=st.Scalers.dense_rank,\n    characteristic_scorer=st.DenseRankCharacteristicness(),\n  \tterm_ranker=st.termranking.AbsoluteFrequencyRanker,\n\tterm_scorer=st.ScaledFScorePresets(beta=1, one_to_neg_one=True)\n).encode('utf-8'))\nprint('open ' + fn)\n\n```\n\n[![Polish Scattertext DenseRank](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/polish_dense_rank_characteristic.png)](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/polish_dense_rank_characteristic.png)\n\nNote that numbers show up as more characteristic using the Dense Rank Difference. It may be they occur\nunusually frequently in this corpus, or perhaps the background word frequencies under counted mumbers.\n\n### Plotting word productivity\n\nWord productivity is one strategy for plotting word-based charts describing an uncategorized corpus.\n\nProductivity is defined in Schumann (2016) (Jason: check this) as the entropy of ngrams\nwhich contain a term. For the entropy computation, the probability of an n-gram wrt the term whose productivity is\nbeing calculated is the frequency of the n-gram divided by the term's frequency.\n\nSince productivity highly correlates with frequency, the recommended metric to plot is the dense rank difference between\nfrequency and productivity.\n\nThe snippet below plots words in the convention corpus based on their log frequency and their productivity.\n\nThe function `st.whole_corpus_productivity_scores` returns a DataFrame giving each word's productivity. For example,\nin the convention corpus,\n\nProductivity scores should be calculated on a `Corpus`-like object which contains a complete set of unigrams and at\nleast bigrams. This corpus should not be compacted before the productivity score calculation.\n\nThe terms with lower productivity have more limited usage (e.g., \"thank\" for \"thank you\", \"united\"\nfor \"united steates\") while the terms with higher productivity occurr in a wider varity of contexts (\"getting\",\n\"actually\", \"political\", etc.).\n\n```python\nimport spacy\nimport scattertext as st\n\ncorpus_no_cat = st.CorpusWithoutCategoriesFromParsedDocuments(\n    st.SampleCorpora.ConventionData2012.get_data().assign(\n        Parse=lambda df: [x for x in spacy.load('en_core_web_sm').pipe(df.text)]),\n    parsed_col='Parse'\n).build()\n\ncompact_corpus_no_cat = corpus_no_cat.get_stoplisted_unigram_corpus().remove_infrequent_words(9)\n\nplot_df = st.whole_corpus_productivity_scores(corpus_no_cat).assign(\n    RankDelta=lambda df: st.RankDifference().get_scores(\n        a=df.Productivity,\n        b=df.Frequency\n    )\n).reindex(\n    compact_corpus_no_cat.get_terms()\n).dropna().assign(\n    X=lambda df: df.Frequency,\n    Xpos=lambda df: st.Scalers.log_scale(df.Frequency),\n    Y=lambda df: df.RankDelta,\n    Ypos=lambda df: st.Scalers.scale(df.RankDelta),\n)\n\nhtml = st.dataframe_scattertext(\n    compact_corpus_no_cat.whitelist_terms(plot_df.index),\n    plot_df=plot_df,\n    metadata=lambda df: df.get_df()['speaker'],\n    ignore_categories=True,\n    x_label='Rank Frequency',\n    y_label=\"Productivity\",\n    left_list_column='Ypos',\n    color_score_column='Ypos',\n    y_axis_labels=['Least Productive', 'Average Productivity', 'Most Productive'],\n    header_names={'upper': 'Most Productive', 'lower': 'Least Productive', 'right': 'Characteristic'},\n    horizontal_line_y_position=0\n)\n\n```\n\n[![Productivity](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/convention_single_category_productivity.html)](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/convention_single_category_productivity.png)\n\n### Understanding Scaled F-Score\n\nLet's now turn our attention to a novel term scoring metric, Scaled F-Score. We'll examine this on a unigram\nversion of the Rotten Tomatoes corpus (Pang et al. 2002). It contains excerpts of\npositive and negative movie reviews.\n\nPlease\nsee [Scaled F Score Explanation](http://nbviewer.jupyter.org/github/JasonKessler/GlobalAI2018/blob/master/notebook/Scaled-F-Score-Explanation.ipynb)\nfor a notebook version of this analysis.\n\n![Scaled F-Score Explanation 1](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs1.png)\n\n```python\nfrom scipy.stats import hmean\n\nterm_freq_df = corpus.get_unigram_corpus().get_term_freq_df()[['Positive freq', 'Negative freq']]\nterm_freq_df = term_freq_df[term_freq_df.sum(axis=1) \u003e 0]\n\nterm_freq_df['pos_precision'] = (term_freq_df['Positive freq'] * 1. /\n                                 (term_freq_df['Positive freq'] + term_freq_df['Negative freq']))\n\nterm_freq_df['pos_freq_pct'] = (term_freq_df['Positive freq'] * 1.\n                                / term_freq_df['Positive freq'].sum())\n\nterm_freq_df['pos_hmean'] = (term_freq_df\n                             .apply(lambda x: (hmean([x['pos_precision'], x['pos_freq_pct']])\n                                               if x['pos_precision'] \u003e 0 and x['pos_freq_pct'] \u003e 0\n                                               else 0), axis=1))\nterm_freq_df.sort_values(by='pos_hmean', ascending=False).iloc[:10]\n```\n\n![SFS2](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs2.png)\n\nIf we plot term frequency on the x-axis and the percentage of a term's occurrences\nwhich are in positive documents (i.e., its precision) on the y-axis, we can see\nthat low-frequency terms have a much higher variation in the precision. Given these terms have\nlow frequencies, the harmonic means are low. Thus, the only terms which have a high harmonic mean\nare extremely frequent words which tend to all have near average precisions.\n\n```python\nfreq = term_freq_df.pos_freq_pct.values\nprec = term_freq_df.pos_precision.values\nhtml = st.produce_scattertext_explorer(\n    corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n    category='Positive',\n    not_category_name='Negative',\n    not_categories=['Negative'],\n\n    x_label='Portion of words used in positive reviews',\n    original_x=freq,\n    x_coords=(freq - freq.min()) / freq.max(),\n    x_axis_values=[int(freq.min() * 1000) / 1000.,\n                   int(freq.max() * 1000) / 1000.],\n\n    y_label='Portion of documents containing word that are positive',\n    original_y=prec,\n    y_coords=(prec - prec.min()) / prec.max(),\n    y_axis_values=[int(prec.min() * 1000) / 1000.,\n                   int((prec.max() / 2.) * 1000) / 1000.,\n                   int(prec.max() * 1000) / 1000.],\n    scores=term_freq_df.pos_hmean.values,\n\n    sort_by_dist=False,\n    show_characteristic=False\n)\nfile_name = 'not_normed_freq_prec.html'\nopen(file_name, 'wb').write(html.encode('utf-8'))\nIFrame(src=file_name, width=1300, height=700)\n```\n\n![SFS3](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs3.png)\n\n![SFS4](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs4.png)\n\n```python\nfrom scipy.stats import norm\n\n\ndef normcdf(x):\n    return norm.cdf(x, x.mean(), x.std())\n\n\nterm_freq_df['pos_precision_normcdf'] = normcdf(term_freq_df.pos_precision)\n\nterm_freq_df['pos_freq_pct_normcdf'] = normcdf(term_freq_df.pos_freq_pct.values)\n\nterm_freq_df['pos_scaled_f_score'] = hmean(\n    [term_freq_df['pos_precision_normcdf'], term_freq_df['pos_freq_pct_normcdf']])\n\nterm_freq_df.sort_values(by='pos_scaled_f_score', ascending=False).iloc[:10]\n```\n\n![SFS5](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs5.png)\n\n```python\nfreq = term_freq_df.pos_freq_pct_normcdf.values\nprec = term_freq_df.pos_precision_normcdf.values\nhtml = st.produce_scattertext_explorer(\n    corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n    category='Positive',\n    not_category_name='Negative',\n    not_categories=['Negative'],\n\n    x_label='Portion of words used in positive reviews (norm-cdf)',\n    original_x=freq,\n    x_coords=(freq - freq.min()) / freq.max(),\n    x_axis_values=[int(freq.min() * 1000) / 1000.,\n                   int(freq.max() * 1000) / 1000.],\n\n    y_label='documents containing word that are positive (norm-cdf)',\n    original_y=prec,\n    y_coords=(prec - prec.min()) / prec.max(),\n    y_axis_values=[int(prec.min() * 1000) / 1000.,\n                   int((prec.max() / 2.) * 1000) / 1000.,\n                   int(prec.max() * 1000) / 1000.],\n    scores=term_freq_df.pos_scaled_f_score.values,\n\n    sort_by_dist=False,\n    show_characteristic=False\n)\n```\n\n![SFS6](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs6.png)\n\n![SFS7](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs7.png)\n\n```python\nterm_freq_df['neg_precision_normcdf'] = normcdf((term_freq_df['Negative freq'] * 1. /\n                                                 (term_freq_df['Negative freq'] + term_freq_df['Positive freq'])))\n\nterm_freq_df['neg_freq_pct_normcdf'] = normcdf((term_freq_df['Negative freq'] * 1.\n                                                / term_freq_df['Negative freq'].sum()))\n\nterm_freq_df['neg_scaled_f_score'] = hmean(\n    [term_freq_df['neg_precision_normcdf'], term_freq_df['neg_freq_pct_normcdf']])\n\nterm_freq_df['scaled_f_score'] = 0\nterm_freq_df.loc[term_freq_df['pos_scaled_f_score'] \u003e term_freq_df['neg_scaled_f_score'],\n                 'scaled_f_score'] = term_freq_df['pos_scaled_f_score']\nterm_freq_df.loc[term_freq_df['pos_scaled_f_score'] \u003c term_freq_df['neg_scaled_f_score'],\n                 'scaled_f_score'] = 1 - term_freq_df['neg_scaled_f_score']\nterm_freq_df['scaled_f_score'] = 2 * (term_freq_df['scaled_f_score'] - 0.5)\nterm_freq_df.sort_values(by='scaled_f_score', ascending=True).iloc[:10]\n```\n\n![SFS8](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs8.png)\n\n```python\nis_pos = term_freq_df.pos_scaled_f_score \u003e term_freq_df.neg_scaled_f_score\nfreq = term_freq_df.pos_freq_pct_normcdf * is_pos - term_freq_df.neg_freq_pct_normcdf * ~is_pos\nprec = term_freq_df.pos_precision_normcdf * is_pos - term_freq_df.neg_precision_normcdf * ~is_pos\n\n\ndef scale(ar):\n    return (ar - ar.min()) / (ar.max() - ar.min())\n\n\ndef close_gap(ar):\n    ar[ar \u003e 0] -= ar[ar \u003e 0].min()\n    ar[ar \u003c 0] -= ar[ar \u003c 0].max()\n    return ar\n\n\nhtml = st.produce_scattertext_explorer(\n    corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n    category='Positive',\n    not_category_name='Negative',\n    not_categories=['Negative'],\n\n    x_label='Frequency',\n    original_x=freq,\n    x_coords=scale(close_gap(freq)),\n    x_axis_labels=['Frequent in Neg',\n                   'Not Frequent',\n                   'Frequent in Pos'],\n\n    y_label='Precision',\n    original_y=prec,\n    y_coords=scale(close_gap(prec)),\n    y_axis_labels=['Neg Precise',\n                   'Imprecise',\n                   'Pos Precise'],\n\n    scores=(term_freq_df.scaled_f_score.values + 1) / 2,\n    sort_by_dist=False,\n    show_characteristic=False\n)\n```\n\n![SFS9](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs9.png)\n\nWe can use `st.ScaledFScorePresets` as a term scorer to display terms' Scaled F-Score on the y-axis and\nterm frequencies on the x-axis.\n\n```python\nhtml = st.produce_frequency_explorer(\n    corpus.remove_terms(set(corpus.get_terms()) - set(term_freq_df.index)),\n    category='Positive',\n    not_category_name='Negative',\n    not_categories=['Negative'],\n    term_scorer=st.ScaledFScorePresets(beta=1, one_to_neg_one=True),\n    metadata=rdf['movie_name'],\n    grey_threshold=0\n)\n```\n\n![SFS10](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/scaledfscoreimgs/sfs10.png)\n\n### Alternative term scoring methods\n\nScaled F-Score is not the only scoring method included in Scattertext. Please click on one of the links below to\nview a notebook which describes how other class association scores work and can be visualized through Scattertext.\n\n* [Google Colab Notebook](https://colab.research.google.com/drive/1snxAP8X6EIDi42FugJ_h5U-fBGDCqtyS) (recommend).\n* [Jupyter Notebook via NBViewer](https://colab.research.google.com/drive/1snxAP8X6EIDi42FugJ_h5U-fBGDCqtyS).\n\nNew in 0.0.2.73 is the delta JS-Divergence scorer `DeltaJSDivergence` scorer (Gallagher et al. 2020), and its\ncorresponding compactor (JSDCompactor.) See `demo_deltajsd.py` for an example usage.\n\n### The position-select-plot process\n\nNew in 0.0.2.72\n\nScattertext was originally set up to visualize corpora objects, which are connected sets of documents and\nterms to visualize. The \"compaction\" process allows users to eliminate terms which may not be associated with a\ncategory using a variety of feature selection methods. The issue with this is that the terms eliminated during\nthe selection process are not taken into account when scaling term positions.\n\nThis issue can be mitigated by using the position-select-plot process, where term positions are pre-determined\nbefore the selection process is made.\n\nLet's first use the 2012 conventions corpus, update the category names, and create a unigram corpus.\n\n```python\nimport scattertext as st\nimport numpy as np\n\ndf = st.SampleCorpora.ConventionData2012.get_data().assign(\n    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)\n).assign(party=lambda df: df['party'].apply({'democrat': 'Democratic', 'republican': 'Republican'}.get))\n\ncorpus = st.CorpusFromParsedDocuments(\n    df, category_col='party', parsed_col='parse'\n).build().get_unigram_corpus()\n\ncategory_name = 'Democratic'\nnot_category_name = 'Republican'\n```\n\nNext, let's create a dataframe consisting of the original counts and their log-scale positions.\n\n```python\ndef get_log_scale_df(corpus, y_category, x_category):\n    term_coord_df = corpus.get_term_freq_df('')\n\n    # Log scale term counts (with a smoothing constant) as the initial coordinates\n    coord_columns = []\n    for category in [y_category, x_category]:\n        col_name = category + '_coord'\n        term_coord_df[col_name] = np.log(term_coord_df[category] + 1e-6) / np.log(2)\n        coord_columns.append(col_name)\n\n    # Scale these coordinates to between 0 and 1\n    min_offset = term_coord_df[coord_columns].min(axis=0).min()\n    for coord_column in coord_columns:\n        term_coord_df[coord_column] -= min_offset\n    max_offset = term_coord_df[coord_columns].max(axis=0).max()\n    for coord_column in coord_columns:\n        term_coord_df[coord_column] /= max_offset\n    return term_coord_df\n\n\n# Get term coordinates from original corpus\nterm_coordinates = get_log_scale_df(corpus, category_name, not_category_name)\nprint(term_coordinates)\n```\n\nHere is a preview of the `term_coordinates` dataframe. The `Democrat` and\n`Republican` columns contain the term counts, while the `_coord` columns\ncontain their logged coordinates. Visualizing 7,973 terms is difficult (but\npossible) for people running Scattertext on most computers.\n\n```\n          Democratic  Republican  Democratic_coord  Republican_coord\nterm\nthank            158         205          0.860166          0.872032\nyou              836         794          0.936078          0.933729\nso               337         212          0.894681          0.873562\nmuch              84          76          0.831380          0.826820\nvery              62          75          0.817543          0.826216\n...              ...         ...               ...               ...\nprecinct           0           2          0.000000          0.661076\ngodspeed           0           1          0.000000          0.629493\nbeauty             0           1          0.000000          0.629493\nbumper             0           1          0.000000          0.629493\nsticker            0           1          0.000000          0.629493\n\n[7973 rows x 4 columns]\n```\n\nWe can visualize this full data set by running the following code block. We'll create a custom\nJavascript function to populate the tooltip with the original term counts, and create a\nScattertext Explorer where the x and y coordinates and original values are specified from the data\nframe. Additionally, we can use `show_diagonal=True` to draw a dashed diagonal line across the plot area.\n\nYou can click the chart below to see the interactive version. Note that it will take a while to load.\n\n```\n# The tooltip JS function. Note that d is is the term data object, and ox and oy are the original x- and y-\n# axis counts.\nget_tooltip_content = ('(function(d) {return d.term + \"\u003cbr/\u003e' + not_category_name + ' Count: \" ' +\n                       '+ d.ox +\"\u003cbr/\u003e' + category_name + ' Count: \" + d.oy})')\n\n\nhtml_orig = st.produce_scattertext_explorer(\n    corpus,\n    category=category_name,\n    not_category_name=not_category_name,\n    minimum_term_frequency=0,\n    pmi_threshold_coefficient=0,\n    width_in_pixels=1000,\n    metadata=corpus.get_df()['speaker'],\n    show_diagonal=True,\n    original_y=term_coordinates[category_name],\n    original_x=term_coordinates[not_category_name],\n    x_coords=term_coordinates[category_name + '_coord'],\n    y_coords=term_coordinates[not_category_name + '_coord'],\n    max_overlapping=3,\n    use_global_scale=True,\n    get_tooltip_content=get_tooltip_content,\n)\n```\n\n[![demo_global_scale_log_orig.png](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/demo_global_scale_log_orig.png)](https://jasonkessler.github.io/demo_global_scale_log_orig.html)\n\nNext, we can visualize the compacted version of the corpus. The compaction, using `ClassPercentageCompactor`,\nselects terms which frequently in each category. The `term_count` parameter, set to 2, is used to determine\nthe percentage threshold for terms to keep in a particular category. This is done using by calculating the\npercentile of terms (types) in each category which appear more than two times. We find the smallest percentile,\nand only include terms which occur above that percentile in a given category.\n\nNote that this compaction leaves only 2,828 terms. This number is much easier for Scattertext to display\nin a browser.\n\n```python\n# Select terms which appear a minimum threshold in both corpora\ncompact_corpus = corpus.compact(st.ClassPercentageCompactor(term_count=2))\n\n# Only take term coordinates of terms remaining in corpus\nterm_coordinates = term_coordinates.loc[compact_corpus.get_terms()]\n\nhtml_compact = st.produce_scattertext_explorer(\n    compact_corpus,\n    category=category_name,\n    not_category_name=not_category_name,\n    minimum_term_frequency=0,\n    pmi_threshold_coefficient=0,\n    width_in_pixels=1000,\n    metadata=corpus.get_df()['speaker'],\n    show_diagonal=True,\n    original_y=term_coordinates[category_name],\n    original_x=term_coordinates[not_category_name],\n    x_coords=term_coordinates[category_name + '_coord'],\n    y_coords=term_coordinates[not_category_name + '_coord'],\n    max_overlapping=3,\n    use_global_scale=True,\n    get_tooltip_content=get_tooltip_content,\n)\n```\n\n[![demo_global_scale_log.png](https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/demo_global_scale_log.png)](https://jasonkessler.github.io/demo_global_scale_log.html)\n\n\n## Advanced uses\n\n### Visualizing differences based on only term frequencies\n\nOccasionally, only term frequency statistics are available. This may happen in the case of very large,\nlost, or proprietary data sets. `TermCategoryFrequencies` is a corpus representation,that can accept this\nsort of data, along with any categorized documents that happen to be available.\n\nLet use the [Corpus of Contemporary American English](https://corpus.byu.edu/coca/) as an example.  \nWe'll construct a visualization\nto analyze the difference between spoken American English and English that occurs in fiction.\n\n```python\ndf = (pd.read_excel('https://www.wordfrequency.info/files/genres_sample.xls')\n      .dropna()\n      .set_index('lemma')[['SPOKEN', 'FICTION']]\n      .iloc[:1000])\ndf.head()\n'''\n       SPOKEN    FICTION\nlemma\nthe    3859682.0  4092394.0\nI      1346545.0  1382716.0\nthey   609735.0   352405.0\nshe    212920.0   798208.0\nwould  233766.0   229865.0\n'''\n```\n\nTransforming this into a visualization is extremely easy. Just pass a dataframe indexed on\nterms with columns indicating category-counts into the the `TermCategoryFrequencies` constructor.\n\n```python\nterm_cat_freq = st.TermCategoryFrequencies(df)\n``` \n\nAnd call `produce_scattertext_explorer` normally:\n\n```python\nhtml = st.produce_scattertext_explorer(\n    term_cat_freq,\n    category='SPOKEN',\n    category_name='Spoken',\n    not_category_name='Fiction',\n)\n```\n\n[![demo_category_frequencies.html](https://jasonkessler.github.io/demo_category_frequencies.png)](https://jasonkessler.github.io/demo_category_frequencies.html)\n\nIf you'd like to incorporate some documents into the visualization, you can add them into to the\n`TermCategoyFrequencies` object.\n\nFirst, let's extract some example Fiction and Spoken documents from the sample COCA corpus.\n\n```python\nimport requests, zipfile, io\n\ncoca_sample_url = 'http://corpus.byu.edu/cocatext/samples/text.zip'\nzip_file = zipfile.ZipFile(io.BytesIO(requests.get(coca_sample_url).content))\n\ndocument_df = pd.DataFrame(\n    [{'text': zip_file.open(fn).read().decode('utf-8'),\n      'category': 'SPOKEN'}\n     for fn in zip_file.filelist if fn.filename.startswith('w_spok')][:2]\n    + [{'text': zip_file.open(fn).read().decode('utf-8'),\n        'category': 'FICTION'}\n       for fn in zip_file.filelist if fn.filename.startswith('w_fic')][:2])\n```   \n\nAnd we'll pass the `documents_df` dataframe into `TermCategoryFrequencies` via the `document_category_df`\nparameter. Ensure the dataframe has two columns, 'text' and 'category'. Afterward, we can\ncall `produce_scattertext_explorer` (or your visualization function of choice) normally.\n\n```python\ndoc_term_cat_freq = st.TermCategoryFrequencies(df, document_category_df=document_df)\n\nhtml = st.produce_scattertext_explorer(\n    doc_term_cat_freq,\n    category='SPOKEN',\n    category_name='Spoken',\n    not_category_name='Fiction',\n)\n```\n\n### Visualizing query-based categorical differences\n\nWord representations have recently become a hot topic in NLP. While lots of work has been done visualizing\nhow terms relate to one another given their scores\n(e.g., [http://projector.tensorflow.org/](http://projector.tensorflow.org/)),\nnone to my knowledge has been done visualizing how we can use these to examine how\ndocument categories differ.\n\nIn this example given a query term, \"jobs\", we can see how Republicans and\nDemocrats talk about it differently.\n\nIn this configuration of Scattertext, words are colored by their similarity to a query phrase.  \nThis is done using [spaCy](https://spacy.io/)-provided GloVe word vectors (trained on\nthe Common Crawl corpus). The cosine distance between vectors is used,\nwith mean vectors used for phrases.\n\nThe calculation of the most similar terms associated with each category is a simple heuristic. First,\nsets of terms closely associated with a category are found. Second, these terms are ranked\nbased on their similarity to the query, and the top rank terms are displayed to the right of the\nscatterplot.\n\nA term is considered associated if its p-value is less than 0.05. P-values are\ndetermined using Monroe et al. (2008)'s difference in the weighted log-odds-ratios with an\nuninformative Dirichlet prior. This is the only model-based method discussed in Monroe et al.\nthat does not rely on a large, in-domain background corpus. Since we are scoring\nbigrams in addition to the unigrams scored by Monroe, the size of the corpus would have to be larger\nto have high enough bigram counts for proper penalization. This function\nrelies the Dirichlet distribution's parameter alpha, a vector, which is uniformly set to 0.01.\n\nHere is the code to produce such a visualization.\n\n```pydocstring\n\u003e\u003e\u003e from scattertext import word_similarity_explorer\n\u003e\u003e\u003e html = word_similarity_explorer(corpus,\n...                                 category='democrat',\n...                                 category_name='Democratic',\n...                                 not_category_name='Republican',\n...                                 target_term='jobs',\n...                                 minimum_term_frequency=5,\n...                                 pmi_threshold_coefficient=4,\n...                                 width_in_pixels=1000,\n...                                 metadata=convention_df['speaker'],\n...                                 alpha=0.01,\n...                                 max_p_val=0.05,\n...                                 save_svg_button=True)\n\u003e\u003e\u003e open(\"Convention-Visualization-Jobs.html\", 'wb').write(html.encode('utf-8'))\n``` \n\n[![Convention-Visualization-Jobs.html](https://jasonkessler.github.io/Convention-Visualization-Jobs.png)](https://jasonkessler.github.io/Convention-Visualization-Jobs.html)\n\n#### Developing and using bespoke word representations\n\nScattertext can interface with Gensim Word2Vec models. For example, here's a snippet from `demo_gensim_similarity.py`\nwhich illustrates how to train and use a word2vec model on a corpus. Note the similarities produced\nreflect quirks of the corpus, e.g., \"8\" tends to refer to the 8% unemployment rate at the time of the\nconvention.\n\n```python\nimport spacy\nfrom gensim.models import word2vec\nfrom scattertext import SampleCorpora, word_similarity_explorer_gensim, Word2VecFromParsedCorpus\nfrom scattertext.CorpusFromParsedDocuments import CorpusFromParsedDocuments\n\nnlp = spacy.en.English()\nconvention_df = SampleCorpora.ConventionData2012.get_data()\nconvention_df['parsed'] = convention_df.text.apply(nlp)\ncorpus = CorpusFromParsedDocuments(convention_df, category_col='party', parsed_col='parsed').build()\nmodel = word2vec.Word2Vec(size=300,\n                          alpha=0.025,\n                          window=5,\n                          min_count=5,\n                          max_vocab_size=None,\n                          sample=0,\n                          seed=1,\n                          workers=1,\n                          min_alpha=0.0001,\n                          sg=1,\n                          hs=1,\n                          negative=0,\n                          cbow_mean=0,\n                          iter=1,\n                          null_word=0,\n                          trim_rule=None,\n                          sorted_vocab=1)\nhtml = word_similarity_explorer_gensim(corpus,\n                                       category='democrat',\n                                       category_name='Democratic',\n                                       not_category_name='Republican',\n                                       target_term='jobs',\n                                       minimum_term_frequency=5,\n                                       pmi_threshold_coefficient=4,\n                                       width_in_pixels=1000,\n                                       metadata=convention_df['speaker'],\n                                       word2vec=Word2VecFromParsedCorpus(corpus, model).train(),\n                                       max_p_val=0.05,\n                                       save_svg_button=True)\nopen('./demo_gensim_similarity.html', 'wb').write(html.encode('utf-8'))\n```\n\nHow Democrats and Republicans talked differently about \"jobs\" in their 2012 convention speeches.\n[![Convention-Visualization-Jobs.html](https://jasonkessler.github.io/demo_gensim_similarity.png)](https://jasonkessler.github.io/demo_gensim_similarity.html)\n\n### Visualizing any kind of term score\n\nWe can use Scattertext to visualize alternative types of word scores, and ensure that 0 scores are greyed out. Use\nthe `sparse_explroer` function to acomplish this, and see its source code for more details.\n\n```pydocstring\n\u003e\u003e\u003e from sklearn.linear_model import Lasso\n\u003e\u003e\u003e from scattertext import sparse_explorer\n\u003e\u003e\u003e html = sparse_explorer(corpus,\n...                        category='democrat',\n...                        category_name='Democratic',\n...                        not_category_name='Republican',\n...                        scores = corpus.get_regression_coefs('democrat', Lasso(max_iter=10000)),\n...                        minimum_term_frequency=5,\n...                        pmi_threshold_coefficient=4,\n...                        width_in_pixels=1000,\n...                        metadata=convention_df['speaker'])\n\u003e\u003e\u003e open('./Convention-Visualization-Sparse.html', 'wb').write(html.encode('utf-8'))\n```\n\n[![Convention-Visualization-Sparse.html](https://jasonkessler.github.io/Convention-Visualization-Sparse.png)](https://jasonkessler.github.io/Convention-Visualization-Sparse.html)\n\n### Custom term positions\n\nYou can also use custom term positions and axis labels. For example, you can base terms' y-axis\npositions on a regression coefficient and their x-axis on term frequency and label the axes\naccordingly. The one catch is that axis positions must be scaled between 0 and 1.\n\nFirst, let's define two scaling functions: `scale` to project positive values to \\[0,1\\], and\n`zero_centered_scale` project real values to \\[0,1\\], with negative values always \\\u003c0.5, and\npositive values always \\\u003e0.5.\n\n```pydocstring\n\u003e\u003e\u003e def scale(ar):\n...     return (ar - ar.min()) / (ar.max() - ar.min())\n...\n\u003e\u003e\u003e def zero_centered_scale(ar):\n...     ar[ar \u003e 0] = scale(ar[ar \u003e 0])\n...     ar[ar \u003c 0] = -scale(-ar[ar \u003c 0])\n...     return (ar + 1) / 2.\n```\n\nNext, let's compute and scale term frequencies and L2-penalized regression coefficients. We'll\nhang on to the original coefficients and allow users to view them by mousing over terms.\n\n```pydocstring\n\u003e\u003e\u003e from sklearn.linear_model import LogisticRegression\n\u003e\u003e\u003e import numpy as np\n\u003e\u003e\u003e\n\u003e\u003e\u003e frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))\n\u003e\u003e\u003e scores = corpus.get_logreg_coefs('democrat',\n...                                  LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))\n\u003e\u003e\u003e scores_scaled = zero_centered_scale(scores)\n```\n\nFinally, we can write the visualization. Note the use of the `x_coords` and `y_coords`\nparameters to store the respective coordinates, the `scores` and `sort_by_dist` arguments\nto register the original coefficients and use them to rank the terms in the r","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJasonKessler%2Fscattertext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJasonKessler%2Fscattertext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJasonKessler%2Fscattertext/lists"}