{"id":22647772,"url":"https://github.com/davidhintelmann/natural-language-processing","last_synced_at":"2025-03-29T06:48:28.018Z","repository":{"id":156030355,"uuid":"302273854","full_name":"davidhintelmann/Natural-Language-Processing","owner":"davidhintelmann","description":"Natural Language Processing Using python's Scapy Library","archived":false,"fork":false,"pushed_at":"2020-10-11T03:42:00.000Z","size":7088,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T20:03:21.073Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidhintelmann.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-08T08:14:15.000Z","updated_at":"2020-10-11T03:42:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"7a91f2c3-2f53-4258-9965-551010998f64","html_url":"https://github.com/davidhintelmann/Natural-Language-Processing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FNatural-Language-Processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FNatural-Language-Processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FNatural-Language-Processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FNatural-Language-Processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidhintelmann","download_url":"https://codeload.github.com/davidhintelmann/Natural-Language-Processing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246150409,"owners_count":20731419,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-09T07:34:33.918Z","updated_at":"2025-03-29T06:48:27.992Z","avatar_url":"https://github.com/davidhintelmann.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Natural Language Processing With Scapy Library for Python\n\nThis notebook starts by using [PRAW](https://praw.readthedocs.io/en/latest/#), \"The Python Reddit API Wrapper\" which will be used to download comments from the subreddit [r/news/](https://www.reddit.com/r/news/).\n\nPRAW can be used to create chat bots on reddit or just to scrap data from it to gain insights into online social media. I will use to [Scapy](https://scapy.readthedocs.io/en/latest/introduction.html) to then perform natural language processing since this python library already have pretain models for this task.\n\nWe will try to create an algorithm to detect online harassment, and in particular to flag if a comment has a high likelihood of containing hate speech.\n\n# Import Libraries\n\n\n```python\nimport spacy\nimport praw\nimport pandas as pd\nimport json\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom collections import Counter\nfrom sklearn.metrics import plot_confusion_matrix\nfrom sklearn.preprocessing import LabelBinarizer\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.ensemble import RandomForestClassifier\n```\n\n# Reddit PRAW\n\nGo to this [page](https://www.reddit.com/prefs/apps) to create an app on Reddit's API page.\n\n\nRules for Reddits API can be found [here](https://github.com/reddit-archive/reddit/wiki/API).\n\nInstructions for creating Reddit app below have been taken from [Felippe Rodrigues's](https://www.storybench.org/how-to-scrape-reddit-with-python/) post from [storybench.org](https://www.storybench.org/)\n\n![image](https://www.storybench.org/wp-content/uploads/2018/03/Screen-Shot-2018-02-28-at-5.37.01-PM.png)\n\nThis form should open up:\n![image](https://www.storybench.org/wp-content/uploads/2018/03/Screen-Shot-2018-02-28-at-6.55.38-PM.png)\n\nPick a name for your application and add a description for reference. Also make sure you select the “script” option and don’t forget to put http://localhost:8080 in the redirect uri field. If you have any doubts, refer to [Praw documentation](https://praw.readthedocs.io/en/latest/getting_started/authentication.html#script-application).\n\nHit create app and now you are ready to use the OAuth2 authorization to connect to the API and start scraping. Copy and paste your 14-characters personal use script and 27-character secret key somewhere safe. You application should look like this:\n\n![image](https://www.storybench.org/wp-content/uploads/2018/03/Screen-Shot-2018-02-28-at-7.02.45-PM.png)\n\nIt is important to note below that the values in the fields `client_id`, `client_secret`, `user_agent`, `username`, and `password` need to be replaced with ones own from the reddit API. Below I have a reddit.json file in my parent directory and read values from this file which are fed into the `praw.Reddit()` function.\n\n\n```python\nid_ = ''\nsecret_ = ''\nagent_ = ''\nusername_ = ''\npassword_ = ''\n\nwith open('../reddit.json', 'r') as file:\n    lines = json.load(file)\n    id_ += lines['client_id']\n    secret_ += lines['client_secret']\n    agent_ += lines['user_agent']\n    username_ += lines['username']\n    password_ += lines['password']\n```\n\n\n```python\nreddit = praw.Reddit(client_id=id_,\n                     client_secret=secret_,\n                     user_agent=agent_,\n                     username=username_,\n                     password=password_)\n```\n\n## [r/news/](https://www.reddit.com/r/news/) subreddit.\n\nThe subreddit of interest is `r/news/` though one can edit the first line the cell below to change what subreddit to look at. We will only be looking at one article in this subreddit name\n\n\n```python\nsubreddit = reddit.subreddit('news')\ntop_subreddit = subreddit.top(limit=10)\n```\n\n\n```python\ntopics_dict = {\"title\":[], \"score\":[], \"id\":[], \"url\":[], \"comms_num\": [], \"created\": [], \"body\":[]}\nfor n in top_subreddit:\n    topics_dict[\"title\"].append(n.title)\n    topics_dict[\"score\"].append(n.score)\n    topics_dict[\"id\"].append(n.id)\n    topics_dict[\"url\"].append(n.url)\n    topics_dict[\"comms_num\"].append(n.num_comments)\n    topics_dict[\"created\"].append(n.created)\n    topics_dict[\"body\"].append(n.selftext)\n```\n\n\n```python\ndf = pd.DataFrame.from_dict(topics_dict)\n```\n\n\n```python\ndf.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etitle\u003c/th\u003e\n      \u003cth\u003escore\u003c/th\u003e\n      \u003cth\u003eid\u003c/th\u003e\n      \u003cth\u003eurl\u003c/th\u003e\n      \u003cth\u003ecomms_num\u003c/th\u003e\n      \u003cth\u003ecreated\u003c/th\u003e\n      \u003cth\u003ebody\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003eBlizzard Employees Staged a Walkout After the ...\u003c/td\u003e\n      \u003ctd\u003e226332\u003c/td\u003e\n      \u003ctd\u003edfn3yi\u003c/td\u003e\n      \u003ctd\u003ehttps://www.thedailybeast.com/blizzard-employe...\u003c/td\u003e\n      \u003ctd\u003e9608\u003c/td\u003e\n      \u003ctd\u003e1.570683e+09\u003c/td\u003e\n      \u003ctd\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003ePresident Donald Trump says he has tested posi...\u003c/td\u003e\n      \u003ctd\u003e226153\u003c/td\u003e\n      \u003ctd\u003ej3oj21\u003c/td\u003e\n      \u003ctd\u003ehttps://www.cnbc.com/2020/10/02/president-dona...\u003c/td\u003e\n      \u003ctd\u003e34754\u003c/td\u003e\n      \u003ctd\u003e1.601644e+09\u003c/td\u003e\n      \u003ctd\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003eKobe Bryant killed in helicopter crash in Cali...\u003c/td\u003e\n      \u003ctd\u003e213684\u003c/td\u003e\n      \u003ctd\u003eeubjfc\u003c/td\u003e\n      \u003ctd\u003ehttps://www.fox5dc.com/news/kobe-bryant-killed...\u003c/td\u003e\n      \u003ctd\u003e20656\u003c/td\u003e\n      \u003ctd\u003e1.580096e+09\u003c/td\u003e\n      \u003ctd\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003eScientist Stephen Hawking has died aged 76\u003c/td\u003e\n      \u003ctd\u003e188177\u003c/td\u003e\n      \u003ctd\u003e84aebi\u003c/td\u003e\n      \u003ctd\u003ehttp://news.sky.com/story/scientist-stephen-ha...\u003c/td\u003e\n      \u003ctd\u003e6913\u003c/td\u003e\n      \u003ctd\u003e1.521028e+09\u003c/td\u003e\n      \u003ctd\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003eJeffrey Epstein's autopsy more consistent with...\u003c/td\u003e\n      \u003ctd\u003e186237\u003c/td\u003e\n      \u003ctd\u003edp5lr1\u003c/td\u003e\n      \u003ctd\u003ehttps://www.foxnews.com/us/forensic-pathologis...\u003c/td\u003e\n      \u003ctd\u003e10043\u003c/td\u003e\n      \u003ctd\u003e1.572465e+09\u003c/td\u003e\n      \u003ctd\u003e\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nsubmission = reddit.submission(id=\"eubjfc\")\nsubmission.comment_sort = \"new\"\nlen(submission.comments.list())\n```\n\n\n\n\n    294\n\n\n\nWe will only look at one article to inspect if there is hate speech since this article alone has over 20,000 comments and will take time for my computer to process, I will limit PRAW to 'scrolling' down the page to load more comments with the `limit=99` parameter. In the cell below I am initializing a dictionary which to append values into each key which will be used to create a pandas dataframe.\n\n\n```python\ncomments_dict = {\"author\":[], \"score\":[], \"id\":[], \"replies\":[], \"edited\": [], \"created\": [], \"body\":[]}\n\nsubmission.comments.replace_more(limit=99)\nfor comment in submission.comments.list():\n    comments_dict[\"author\"].append(comment.author)\n    comments_dict[\"score\"].append(comment.score)\n    comments_dict[\"id\"].append(comment.id)\n    comments_dict[\"replies\"].append(comment.replies)\n    comments_dict[\"edited\"].append(comment.edited)\n    comments_dict[\"created\"].append(comment.created_utc)\n    #body_ = comment.body.replace\n    comments_dict[\"body\"].append(comment.body)\n```\n\n\n```python\ndf_ = pd.DataFrame.from_dict(comments_dict)\n```\n\n\n```python\ndf_.head(2)\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eauthor\u003c/th\u003e\n      \u003cth\u003escore\u003c/th\u003e\n      \u003cth\u003eid\u003c/th\u003e\n      \u003cth\u003ereplies\u003c/th\u003e\n      \u003cth\u003eedited\u003c/th\u003e\n      \u003cth\u003ecreated\u003c/th\u003e\n      \u003cth\u003ebody\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ehoosakiwi\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003effnzwqh\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003e1.58007e+09\u003c/td\u003e\n      \u003ctd\u003e1.580073e+09\u003c/td\u003e\n      \u003ctd\u003eLet's try and keep the comments here civil. Yo...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003erandy88moss\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003efx335fo\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003e1.594030e+09\u003c/td\u003e\n      \u003ctd\u003eThinking about you, Mamba!\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\ntxt = ''\nfor n in df_['body'].values:\n    n = n.replace(\"\\n\",\"\").strip()\n    txt +=n\n```\n\n### Frequency Counter\n\nWe can now start to use natural language processing to counter the frequency of the words occuring the text above. Words which appear more often can tell us if the title of the news article, \"Kobe Bryant killed in helicopter crash in California\" matches the content of comments. \n\nThen we will continue to create a model to predict if a comment is hate full speech or not.\n\n\n```python\nnlp = spacy.load(\"en_core_web_sm\")\n\nall_comments = nlp(txt)\nwords = [chunk.text for chunk in all_comments if not chunk.is_stop and not chunk.is_punct]\n```\n\n\n```python\ncomment_counts = Counter(words)\n```\n\n\n```python\ncomment_counts.most_common(10)\n```\n\n\n\n\n    [('Kobe', 1160),\n     (' ', 1116),\n     ('people', 772),\n     ('like', 739),\n     ('daughter', 577),\n     ('helicopter', 556),\n     ('know', 531),\n     ('basketball', 450),\n     ('family', 364),\n     ('time', 363)]\n\n\n\n### Lemmatization\n\nFrom [Wikipedia](https://en.wikipedia.org/wiki/Lemmatisation),\n\n\u003eLemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.\n\nFor example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma. The inflection of a word allows you to express different grammatical categories like tense (organized vs organize), number (trains vs train), and so on. Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.\n\nspaCy has the attribute lemma_ on the Token class. This attribute has the lemmatized form of a token:\n\n\n```python\nlemmas = [txt.lemma_ for txt in all_comments if not txt.is_stop and not txt.is_punct]\n```\n\n\n```python\nlemma_counts = Counter(lemmas)\nlemma_counts.most_common(10)\n```\n\n\n\n\n    [('Kobe', 1159),\n     (' ', 1116),\n     ('like', 808),\n     ('daughter', 806),\n     ('people', 782),\n     ('know', 742),\n     ('helicopter', 735),\n     ('die', 589),\n     ('say', 556),\n     ('go', 514)]\n\n\n\nWe still have the issue of a space value being the second most common word in the comments text\n\n### Preprocess Text\n\n\n```python\ndef is_token_allowed(token):\n    '''\n     Only allow valid tokens which are not stop words\n     and punctuation symbols.\n    '''\n    if (not token or not token.string.strip() or token.is_stop or token.is_punct):\n        return False\n    return True\n\ndef preprocess_token(token):\n    # Reduce token to its lowercase lemma form\n    return token.lemma_.strip().lower()\n\n```\n\n\n```python\ntext_processed = [preprocess_token(txt) for txt in all_comments if is_token_allowed(txt)]\ntext_processed_counter = Counter(text_processed)\ntext_processed_counter.most_common(10)\n```\n\n\n\n\n    [('kobe', 1320),\n     ('people', 845),\n     ('daughter', 809),\n     ('like', 808),\n     ('helicopter', 743),\n     ('know', 742),\n     ('die', 589),\n     ('say', 556),\n     ('go', 514),\n     ('basketball', 487)]\n\n\n\nWe see an improvement above as we have gotten rid of the space value showing as the second most common word in the previous two frequency counters.  \nThough we can only visually inspect if the comments are related to the article heading, we will move on to creating a machine learning model to predict if a comment is hateful or not.\n\n# Clean Train Data\n\nText data used for this classification problem is from a [GitHub repository](https://github.com/t-davidson/hate-speech-and-offensive-language) for the [paper](https://arxiv.org/abs/1703.04009) \"Automated Hate Speech Detection and the Problem of Offensive Language\", ICWSM 2017. Dataset was created as described by the paper:\n\n\u003eWe begin with a hate speech lexicon containing words and phrases identified by internet users as hate speech, compiled by Hatebase.org... Workers were asked to label each tweet as one of three categories: hate speech, offensive but not hate speech, or neither offensive nor hate speech. They were provided with our definition along with a paragraph explaining it in further detail. Users were asked to think not just about the words appearing in a given tweet but about the context in which they were used. They were instructed that the presence of a particular word, however offensive, did not necessarily indicate a tweet is hate speech. Each tweet was coded by three or more people.\n\nWe will use this dataset to create a supervised machine learning model to predict the classifcation of the text in a reddit comment.\n\n\n```python\ndf_twitter = pd.read_csv('labeled_data.csv')\n```\n\n\n```python\ndf_twitter.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eUnnamed: 0\u003c/th\u003e\n      \u003cth\u003ecount\u003c/th\u003e\n      \u003cth\u003ehate_speech\u003c/th\u003e\n      \u003cth\u003eoffensive_language\u003c/th\u003e\n      \u003cth\u003eneither\u003c/th\u003e\n      \u003cth\u003eclass\u003c/th\u003e\n      \u003cth\u003etweet\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003e!!! RT @mayasolovely: As a woman you shouldn't...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e!!!!! RT @mleew17: boy dats cold...tyga dwn ba...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\nThe tweet column above has a lot of formating issues to be fed into a ML model. Since tweets can be nested we need to clean up the tweets a bit. A rudimentary fix has been made below but more time spent cleaning up the data will create a better model down the road.\n\n\n```python\ndf_twitter['tweet'] = df_twitter['tweet'].str.replace('\u0026amp;','and').str.strip('! RT')\n```\n\n\n```python\ndf_twitter.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eUnnamed: 0\u003c/th\u003e\n      \u003cth\u003ecount\u003c/th\u003e\n      \u003cth\u003ehate_speech\u003c/th\u003e\n      \u003cth\u003eoffensive_language\u003c/th\u003e\n      \u003cth\u003eneither\u003c/th\u003e\n      \u003cth\u003eclass\u003c/th\u003e\n      \u003cth\u003etweet\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003e@mayasolovely: As a woman you shouldn't compla...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e@mleew17: boy dats cold...tyga dwn bad for cuf...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e@UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e@C_G_Anderson: @viva_based she look like a tranny\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e6\u003c/td\u003e\n      \u003ctd\u003e0\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e@ShenikaRoberts: The shit you hear about me mi...\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nhate_speech = round(len(df_twitter[df_twitter['class']==0])/len(df_twitter)*100,2)\noff_speech = round(len(df_twitter[df_twitter['class']==1])/len(df_twitter)*100,2)\nneither_speech = round(len(df_twitter[df_twitter['class']==2])/len(df_twitter)*100,2)\n\nprint('The ratio of Hate Speech labelled rows to total rows is {}%\\n\\\nThe ratio of Offensive Speech labelled rows to total rows is {}%\\n\\\nThe ratio of Neutral labelled rows to total rows is {}%'.format(hate_speech,off_speech,neither_speech))\n```\n\n    The ratio of Hate Speech labelled rows to total rows is 5.77%\n    The ratio of Offensive Speech labelled rows to total rows is 77.43%\n    The ratio of Neutral labelled rows to total rows is 16.8%\n\n\n\n```python\nlen(df_twitter)\n```\n\n\n\n\n    24783\n\n\n\nSince the encoding the tweets into a bag of words takes a VERY long time on my computer I will reduce the size of this data set while keeping the ratio of the three classes of speeches constant.\n\n\n```python\nclass_zero = df_twitter[df_twitter['class']==0][:143] # Hate Speech\nclass_one = df_twitter[df_twitter['class']==1][:1919] # Offensive Language\nclass_two = df_twitter[df_twitter['class']==2][:416] # Neither\n```\n\n\n```python\ndf_twitter = pd.concat([class_zero, class_one, class_two])\n```\n\nBelow the Twitter data will be split into a training set and a test set with 1/4 of dataset going to the test set. This will allow how to validate the hatespeech model before moving on to a harder dataset, for example Reddit or any other online source.\n\n\n```python\nX_train, X_test, Y_train, Y_test = train_test_split(df_twitter['tweet'], df_twitter['class'], test_size=0.25, random_state=42)\n```\n\nNow the training text will be concatenated into one large string so we can use the `LabelBinarizer()` function from scikit-learn.\n\n\n```python\ntwitter_text_train = ''\nfor n in X_train:\n    twitter_text_train += n + ' '\n```\n\n\n```python\ntwitter_encoder = LabelBinarizer()\ntwitter_encoder.fit(twitter_text_train.split())\n```\n\n\n\n\n    LabelBinarizer()\n\n\n\n\n```python\ncol_list = range(len(twitter_encoder.classes_)) # columns from classes create from LabelBinarizer\ntrain_list = np.empty((len(X_train),len(twitter_encoder.classes_))) #create empty arrays to populate\ntest_list = np.empty((len(X_test),len(twitter_encoder.classes_)))\n```\n\n\n```python\nfor i, n in enumerate(X_train,0):\n    transformed = twitter_encoder.transform(n.split())\n    tmp_ = transformed.sum(axis=0).T\n    train_list[i] = tmp_\n    if i % 200 == 199:\n        print(i+1)\n```\n\n    200\n    400\n    600\n    800\n    1000\n    1200\n    1400\n    1600\n    1800\n\n\n\n```python\nfor i, n in enumerate(X_test,0):\n    transformed = twitter_encoder.transform(n.split())\n    tmp_ = transformed.sum(axis=0).T\n    test_list[i] = tmp_\n    if i % 100 == 99:\n        print(i+1)\n```\n\n    100\n    200\n    300\n    400\n    500\n    600\n\n\n\n```python\nTwitter_train_df = pd.DataFrame(train_list)\nTwitter_test_df = pd.DataFrame(test_list)\n```\n\n\n```python\nTwitter_train_df.to_csv('train_list.csv', index=False)\nTwitter_test_df.to_csv('test_list.csv', index=False)\n```\n\n\n```python\nTwitter_test_df = Twitter_test_df.reindex(columns = Twitter_train_df.columns, fill_value=0)\n```\n\n\n```python\nTwitter_train_df = pd.read_csv('train_list.csv')\nTwitter_test_df = pd.read_csv('test_list.csv')\n```\n\n# Dimension Reduction\n\n\n```python\n#twitter_text = nlp(twitter_text)\n#text_processed = [preprocess_token(txt) for txt in twitter_text if is_token_allowed(txt)]\n```\n\n# Predict if Comment is Hateful or Not\n\nDifferent machine learning models will be investigated below to figure out the one that generalizes the best to the validation set which 25% of the twitter csv data. Then Reddit comments from `r/news/` will be transformed with the fit from `LabelBinarizer()` above, and this produces a 'bag of words' for a ML model to predict a classification for that reddit comment.  \n\nLet us begin by creating a model with Naive Bayes:\n\n## Naive Bayes\n\n\n```python\nnb = GaussianNB()\nnb_fit = nb.fit(Twitter_train_df,Y_train)\n```\n\n\n```python\nnb_fit.score(Twitter_test_df,Y_test)\n```\n\n\n\n\n    0.7645161290322581\n\n\n\nWe see above that we can only predict the class of twitter comment (hate speech, offensive speech, or neutral) with an accuracy of about 76.45%.\n\n\n```python\nparameters_knn = {\n    'n_neighbors':[3,4,5,6,7],\n    'weights':['uniform','distance'],\n    'algorithm':['auto','ball_tree','kd_tree','brute'],\n}\n\nknn = KNeighborsClassifier()\nclf_knn = GridSearchCV(knn, parameters_knn, cv=5, n_jobs=-1, verbose=10)\nclf_knn.fit(Twitter_train_df,Y_train)\n```\n\n    Fitting 5 folds for each of 40 candidates, totalling 200 fits\n\n\n    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.\n    [Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   24.1s\n    [Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   43.7s\n    [Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   44.2s\n    [Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.4min\n    [Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.8min\n    [Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  2.1min\n    [Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  2.5min\n    [Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  3.1min\n    [Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:  3.7min\n    [Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:  4.3min\n    [Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:  4.7min\n    [Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:  5.6min\n    [Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  6.0min\n    [Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:  6.2min\n    [Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  6.3min\n    [Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:  6.3min finished\n\n\n\n\n\n    GridSearchCV(cv=5, estimator=KNeighborsClassifier(), n_jobs=-1,\n                 param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],\n                             'n_neighbors': [3, 4, 5, 6, 7],\n                             'weights': ['uniform', 'distance']},\n                 verbose=10)\n\n\n\n\n```python\nx = pd.DataFrame(clf_knn.cv_results_)\nx[['mean_test_score','std_test_score','rank_test_score']].sort_values(by='rank_test_score').head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emean_test_score\u003c/th\u003e\n      \u003cth\u003estd_test_score\u003c/th\u003e\n      \u003cth\u003erank_test_score\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e37\u003c/th\u003e\n      \u003ctd\u003e0.790101\u003c/td\u003e\n      \u003ctd\u003e0.012655\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e29\u003c/th\u003e\n      \u003ctd\u003e0.788488\u003c/td\u003e\n      \u003ctd\u003e0.012594\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e9\u003c/th\u003e\n      \u003ctd\u003e0.788488\u003c/td\u003e\n      \u003ctd\u003e0.012594\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003e0.788487\u003c/td\u003e\n      \u003ctd\u003e0.012499\u003c/td\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e24\u003c/th\u003e\n      \u003ctd\u003e0.788487\u003c/td\u003e\n      \u003ctd\u003e0.012499\u003c/td\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nclf_knn.best_params_\n```\n\n\n\n\n    {'algorithm': 'brute', 'n_neighbors': 6, 'weights': 'distance'}\n\n\n\n\n```python\nclf_knn.best_score_\n```\n\n\n\n\n    0.7901008607947135\n\n\n\n\n```python\nclf_knn.best_estimator_.score(Twitter_test_df,Y_test)\n```\n\n\n\n\n    0.8080645161290323\n\n\n\nThere is a slight improvement using a KNN model to classify text speech, however not by much. Lets try random forest.\n\n\n```python\nparameters_rf = {\n    'n_estimators':[50,75,100,125,150],\n    'criterion':['gini','entropy'],\n    'min_samples_split':[2,3,4],\n    'min_samples_leaf':[1,2,3]\n}\n\nrf = RandomForestClassifier()\nclf_rf = GridSearchCV(rf, parameters_rf, cv=5, n_jobs=-1, verbose=10)\nclf_rf.fit(Twitter_train_df,Y_train)\n```\n\n    Fitting 5 folds for each of 90 candidates, totalling 450 fits\n\n\n    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.\n    [Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    4.6s\n    [Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    8.9s\n    [Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   12.7s\n    [Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   19.7s\n    [Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   24.4s\n    [Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:   34.0s\n    [Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:   40.9s\n    [Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:   52.1s\n    [Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:   57.9s\n    [Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed:  1.1min\n    [Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed:  1.2min\n    [Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:  1.3min\n    [Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.4min\n    [Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed:  1.5min\n    [Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.5min\n    [Parallel(n_jobs=-1)]: Done 205 tasks      | elapsed:  1.6min\n    [Parallel(n_jobs=-1)]: Done 226 tasks      | elapsed:  1.8min\n    [Parallel(n_jobs=-1)]: Done 249 tasks      | elapsed:  2.0min\n    [Parallel(n_jobs=-1)]: Done 272 tasks      | elapsed:  2.3min\n    [Parallel(n_jobs=-1)]: Done 297 tasks      | elapsed:  2.6min\n    [Parallel(n_jobs=-1)]: Done 322 tasks      | elapsed:  2.7min\n    [Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  2.9min\n    [Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:  3.0min\n    [Parallel(n_jobs=-1)]: Done 405 tasks      | elapsed:  3.1min\n    [Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  3.3min\n    [Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:  3.4min finished\n\n\n\n\n\n    GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,\n                 param_grid={'criterion': ['gini', 'entropy'],\n                             'min_samples_leaf': [1, 2, 3],\n                             'min_samples_split': [2, 3, 4],\n                             'n_estimators': [50, 75, 100, 125, 150]},\n                 verbose=10)\n\n\n\n\n```python\nx = pd.DataFrame(clf_rf.cv_results_)\nx[['mean_test_score','std_test_score','rank_test_score']].sort_values(by='rank_test_score').head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emean_test_score\u003c/th\u003e\n      \u003cth\u003estd_test_score\u003c/th\u003e\n      \u003cth\u003erank_test_score\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003e0.817543\u003c/td\u003e\n      \u003ctd\u003e0.010441\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e6\u003c/th\u003e\n      \u003ctd\u003e0.815389\u003c/td\u003e\n      \u003ctd\u003e0.010508\u003c/td\u003e\n      \u003ctd\u003e2\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e8\u003c/th\u003e\n      \u003ctd\u003e0.813234\u003c/td\u003e\n      \u003ctd\u003e0.012193\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003e0.813231\u003c/td\u003e\n      \u003ctd\u003e0.013584\u003c/td\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e48\u003c/th\u003e\n      \u003ctd\u003e0.812160\u003c/td\u003e\n      \u003ctd\u003e0.011136\u003c/td\u003e\n      \u003ctd\u003e5\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nclf_rf.best_params_\n```\n\n\n\n\n    {'criterion': 'gini',\n     'min_samples_leaf': 1,\n     'min_samples_split': 2,\n     'n_estimators': 75}\n\n\n\n\n```python\nclf_rf.best_score_\n```\n\n\n\n\n    0.8175434020230125\n\n\n\n\n```python\nclf_rf.best_estimator_.score(Twitter_test_df,Y_test)\n```\n\n\n\n\n    0.8419354838709677\n\n\n\nThere is an even smaller improvement using a Random Forest model to classify text speech.\n\n## Confusion Matrix\n\n\n```python\nlabels = ['Hate','Off','Neu']\nplot_confusion_matrix(nb_fit, Twitter_test_df, Y_test, normalize='true', display_labels = labels)\nplt.title('Naive Bayes Confusion Matrix')\nplt.show()\n```\n\n\n![png](img/output_83_0.png)\n\n\n\n```python\nlabels = ['Hate','Off','Neu']\nplot_confusion_matrix(clf_knn.best_estimator_, Twitter_test_df, Y_test, normalize='true', display_labels = labels)\nplt.title('K Nearest Neighbours Confusion Matrix')\nplt.show()\n```\n\n\n![png](img/output_84_0.png)\n\n\n\n```python\nlabels = ['Hate','Off','Neu']\nplot_confusion_matrix(clf_rf.best_estimator_, Twitter_test_df, Y_test, normalize='true', display_labels = labels)\nplt.title('Random Forest Confusion Matrix')\nplt.show()\n```\n\n\n![png](img/output_85_0.png)\n\n\n# Predict Reddit Comment Classification\n\n\n```python\ndf_reddit = df_.copy()\n```\n\n\n```python\ndf_reddit.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003eauthor\u003c/th\u003e\n      \u003cth\u003escore\u003c/th\u003e\n      \u003cth\u003eid\u003c/th\u003e\n      \u003cth\u003ereplies\u003c/th\u003e\n      \u003cth\u003eedited\u003c/th\u003e\n      \u003cth\u003ecreated\u003c/th\u003e\n      \u003cth\u003ebody\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ehoosakiwi\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003effnzwqh\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003e1.58007e+09\u003c/td\u003e\n      \u003ctd\u003e1.580073e+09\u003c/td\u003e\n      \u003ctd\u003eLet's try and keep the comments here civil. Yo...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003erandy88moss\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003efx335fo\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003e1.594030e+09\u003c/td\u003e\n      \u003ctd\u003eThinking about you, Mamba!\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003erockinchanks\u003c/td\u003e\n      \u003ctd\u003e3\u003c/td\u003e\n      \u003ctd\u003efvn9x00\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003e1.592841e+09\u003c/td\u003e\n      \u003ctd\u003eR.I.P. It’s halfway through 2020 and it only r...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003epspotboy\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003efvb2l2x\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003e1.592542e+09\u003c/td\u003e\n      \u003ctd\u003eMaybe my\\nMind just doesn’t have Kobe\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003eoooOoh42069\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003efv5ckt5\u003c/td\u003e\n      \u003ctd\u003e()\u003c/td\u003e\n      \u003ctd\u003eFalse\u003c/td\u003e\n      \u003ctd\u003e1.592419e+09\u003c/td\u003e\n      \u003ctd\u003eI like how this post got made me smile awards ...\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nReddit_list = np.empty((len(df_reddit['body']),len(twitter_encoder.classes_)))\n```\n\n\n```python\nfor i, n in enumerate(df_reddit['body'],0):\n    transformed = twitter_encoder.transform(n.split())\n    tmp_ = transformed.sum(axis=0).T\n    Reddit_list[i] = tmp_\n    if i % 1000 == 999:\n        print(i+1)\n```\n\n    1000\n    2000\n    3000\n    4000\n    5000\n    6000\n    7000\n\n\n\n```python\nReddit_test_df = pd.DataFrame(test_list)\nReddit_test_df = Twitter_test_df.reindex(columns = Twitter_train_df.columns, fill_value=0)\n```\n\n\n```python\nReddit_predictions = clf_rf.best_estimator_.predict(Reddit_test_df)\n```\n\n\n```python\ndf_reddit['body'][0], Reddit_predictions[0]\n```\n\n\n\n\n    (\"Let's try and keep the comments here civil. You don't have to like Kobe, but celebrating his death, making jokes about it, and wishing death/threatening other people still violates our rules. \\n\\nUsers who threaten other people will be banned on sight.\",\n     1)\n\n\n\n\n```python\ndf_reddit['body'][1], Reddit_predictions[1]\n```\n\n\n\n\n    ('Thinking about you, Mamba!', 1)\n\n\n\n\n```python\ndf_reddit['body'][2], Reddit_predictions[2]\n```\n\n\n\n\n    ('R.I.P. It’s halfway through 2020 and it only really hit me that he’s gone. i didn’t even know you but you still hold a special place in my heart. i hope you, Gigi and the other passengers all have a nice stay in heaven. 😔',\n     1)\n\n\n\n\n```python\ndf_reddit['body'][3], Reddit_predictions[3]\n```\n\n\n\n\n    ('Maybe my\\nMind just doesn’t have Kobe', 1)\n\n\n\n\n```python\ndf_reddit['body'][4], Reddit_predictions[4]\n```\n\n\n\n\n    ('I like how this post got made me smile awards and wholesome awards from people',\n     1)\n\n\n\n\n```python\ndf_reddit['body'][21], Reddit_predictions[21]\n```\n\n\n\n\n    ('Nice bru 👊 I’ve started following him on YouTube now. I’m interested to hopefully understand the pilot’s thought process in these circumstances and conditions. It seems he was trained as an instructor to use Instruments and had deep knowledge of the route he was flying. Why did he drop nearly 400 feet in 14 seconds? What was his plan B?',\n     2)\n\n\n\n\n```python\nnp.where(Reddit_predictions == 0)\n```\n\n\n\n\n    (array([362]),)\n\n\n\n\n```python\ndf_reddit['body'][362], Reddit_predictions[362]\n```\n\n\n\n\n    (\"Death is a strange thing because it is so hard to wrap your head around it, the most common reaction is to go to how we feel about it, but Id encourage everone to place yourselves for a moment in the victim's shoes,  the fear of plummeting to earth holding your loved ones knowing your going to die is unfathomable to me, my thoughts and prayers\",\n     0)\n\n\n\n# Results\n\nText Classification plays an important role in an area of in todays information age, especially with social media. Because of large of amount of data availability on web, for its easy retrieval this data must be organised according to it content, for exmaple clustering text on based on hate speech, offensive speech and neutral. Expansion of social media leds to usage of different kinds of languages on web. This adds complexity to Text classification task. Unfortunately using Twitter data from Thomas Davidson et al., 2017 does *not* generalize well to classifying text data from Reddit. We see above in the Predict Reddit Comment Classification section that most of the predictions are of class 1 (offensive speech) but when we look at the accompanying text it does not seem to be offensive.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidhintelmann%2Fnatural-language-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidhintelmann%2Fnatural-language-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidhintelmann%2Fnatural-language-processing/lists"}