{"id":22211626,"url":"https://github.com/kwokhing/sentimentanalysis-python-demo","last_synced_at":"2025-07-27T11:32:25.104Z","repository":{"id":189295791,"uuid":"208187112","full_name":"KwokHing/SentimentAnalysis-Python-Demo","owner":"KwokHing","description":"Submission of an in-class NLP sentiment analysis competition held at Microsoft AI Singapore group. This submission entry explores the performance of both lexicon \u0026 machine-learning based models","archived":false,"fork":false,"pushed_at":"2022-10-10T12:53:07.000Z","size":12504,"stargazers_count":12,"open_issues_count":0,"forks_count":9,"subscribers_count":4,"default_branch":"master","last_synced_at":"2023-08-19T08:42:09.610Z","etag":null,"topics":["bert-model","cross-validation","data-analysis","decision-trees","deep-learning","lexicon-based","lstm-sentiment-analysis","machine-learning","machine-learning-algorithms","naive-bayes","naive-bayes-classifier","random-forest","sentiment-analysis","sentiment-classification","sklearn","supervised-learning","supervised-machine-learning","support-vector-machines","svm","vader-sentiment-analysis"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KwokHing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-09-13T03:13:31.000Z","updated_at":"2023-08-19T08:42:11.181Z","dependencies_parsed_at":"2023-09-07T15:01:47.198Z","dependency_job_id":null,"html_url":"https://github.com/KwokHing/SentimentAnalysis-Python-Demo","commit_stats":null,"previous_names":["kwokhing/sentimentanalysis-python-demo"],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KwokHing%2FSentimentAnalysis-Python-Demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KwokHing%2FSentimentAnalysis-Python-Demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KwokHing%2FSentimentAnalysis-Python-Demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KwokHing%2FSentimentAnalysis-Python-Demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KwokHing","download_url":"https://codeload.github.com/KwokHing/SentimentAnalysis-Python-Demo/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227801436,"owners_count":17822016,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert-model","cross-validation","data-analysis","decision-trees","deep-learning","lexicon-based","lstm-sentiment-analysis","machine-learning","machine-learning-algorithms","naive-bayes","naive-bayes-classifier","random-forest","sentiment-analysis","sentiment-classification","sklearn","supervised-learning","supervised-machine-learning","support-vector-machines","svm","vader-sentiment-analysis"],"created_at":"2024-12-02T20:35:44.106Z","updated_at":"2024-12-02T20:35:44.551Z","avatar_url":"https://github.com/KwokHing.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Exploration of Sentiment Analysis\n\nThis repo provides the submission entry for an in-class NLP sentiment analysis competition held at Microsoft AI Singapore group using techniques learned in class to classify text in identifying positive or negative sentiment.\n\n![jpg](images/inclass-competition.jpg)\n\nRecommended to install [Anaconda](https://www.anaconda.com/products/distribution), a pre-packaged Python distribution that contains all of the necessary libraries and software for this project. Alternatively, you can make use of [Google Colaboratory](https://colab.research.google.com/), which allows you to write and execute Python codes in your browser.\n\n**Data**\n\nData for this in-class competition comes from the [Sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) dataset where the training and test data consists of randomly sampled 10% and 5% of the dataset.\n\n## Getting started using Lexicon and Machine Learning (ML) based methods\nOpen `SentimentAnalysis.ipynb` on a jupyter notebook environment, or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/KwokHing/SentimentAnalysis-Python-Demo/blob/master/SentimentAnalysis.ipynb)\n\n- VADER (VALENCE based sentiment analyzer) [67%]\n- Naive Bayes\n- Linear SVM (Support Vector Machine) [80%]\n- Decision Tree\n- Random Forest\n- Extra Trees\n- SVC [80%]\n\n## Exploring using Deep Learning Techniques (LSTM)\nOpen `SentimentAnalysis_RNN.ipynb` on a jupyter notebook environment, or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/KwokHing/SentimentAnalysis-Python-Demo/blob/master/SentimentAnalysis_RNN.ipynb)\n\nThe LSTM deep learning method [79%] did not perform better than SVC/SVM method \n\u003cbr/\u003e\n\n## How about the BERT Transformers model?\nOpen `SentimentAnalysis_BERT.ipynb` on a jupyter notebook environment, or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/KwokHing/SentimentAnalysis-Python-Demo/blob/master/SentimentAnalysis_BERT.ipynb)\n\nThe State-of-the-Art transformer model performs slightly better at [82%] accuracy\n\n\u003c!---\n# Walk-through of the submission entry:\n\n\n## 1. Adding imports \u0026 installing neccessay packages ##\n\n\n```python\n### run this if using google colab to mount google drive as local storage\n\nfrom google.colab import drive\nimport os\ndrive.mount('/content/gdrive')\n\nrepo_path = '/content/gdrive/My Drive/colab/NLP-Bootcamp/'\n```\n\n \n\n\n```python\nimport pandas as pd\nimport collections\n%matplotlib inline\n\n# Import modules to calculate accuracy and confusion matrix\nfrom sklearn.metrics import confusion_matrix, accuracy_score\n```\n\n## 2. Loading Data ##\n\n\n```python\n### run below 2 lines of code for setting train \u0026 test data path on google colab\n'''\ntrainData = os.path.join(repo_path, 'data/sentiment140_160k_tweets_train.csv')\ntestData = os.path.join(repo_path, 'data/sentiment140_test.csv')\n'''\n\n### run below 3 lines of code for setting train \u0026 test data path on local machine\nDATA = './data/'\ntrainData = DATA + 'sentiment140_160k_tweets_train.csv'\ntestData =  DATA + 'sentiment140_test.csv'\n\ntrain = pd.read_csv(trainData)\ntest = pd.read_csv(testData)\n\ntrain.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etarget\u003c/th\u003e\n      \u003cth\u003eids\u003c/th\u003e\n      \u003cth\u003euser\u003c/th\u003e\n      \u003cth\u003etext\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1978186076\u003c/td\u003e\n      \u003ctd\u003eceruleanbreeze\u003c/td\u003e\n      \u003ctd\u003e@nocturnalie Anyway, and now Abby and I share ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1994697891\u003c/td\u003e\n      \u003ctd\u003eenthusiasticjen\u003c/td\u003e\n      \u003ctd\u003e@JoeGigantino Few times I'm trying to leave co...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2191885992\u003c/td\u003e\n      \u003ctd\u003eLifeRemixed\u003c/td\u003e\n      \u003ctd\u003e@AngieGriffin Good Morning Angie  I'll be in t...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1753662211\u003c/td\u003e\n      \u003ctd\u003elovemandy\u003c/td\u003e\n      \u003ctd\u003ehad a good day driving up mountains, visiting ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2177442789\u003c/td\u003e\n      \u003ctd\u003e_LOVELYmanu\u003c/td\u003e\n      \u003ctd\u003edownloading some songs  i love lady GaGa.\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\nLooking at distribution of *'positives'* \u0026 *'negatives'* samples in train dataset \n\n\n```python\ncollections.Counter(train['target'])\n```\n\n\n\n\n    Counter({'n': 79985, 'p': 80000})\n\n\n\n\n```python\ntrain.groupby('target').size().plot(kind='bar')\n```\n\n\n\n![png](images/output_7_1.png)\n\n\nWe will find that it is a relatively well-balanced dataset\n\n## 3. Data (Text) Preprocessing ##\n\n\n```python\n### mapping a dictionary of apostrophe words\n\nappos = {\n\"aren't\" : \"are not\",\n\"can't\" : \"cannot\",\n\"cant\" : \"cannot\",\n\"couldn't\" : \"could not\",\n\"didn't\" : \"did not\",\n\"doesn't\" : \"does not\",\n\"don't\" : \"do not\",\n\"hadn't\" : \"had not\",\n\"hasn't\" : \"has not\",\n\"haven't\" : \"have not\",\n\"he'd\" : \"he would\",\n\"he'll\" : \"he will\",\n\"he's\" : \"he is\",\n\"i'd\" : \"I would\",\n\"i'd\" : \"I had\",\n\"i'll\" : \"I will\",\n\"i'm\" : \"I am\",\n\"im\" : \"I am\",\n\"isn't\" : \"is not\",\n\"it's\" : \"it is\",\n\"it'll\":\"it will\",\n\"i've\" : \"I have\",\n\"let's\" : \"let us\",\n\"mightn't\" : \"might not\",\n\"mustn't\" : \"must not\",\n\"shan't\" : \"shall not\",\n\"she'd\" : \"she would\",\n\"she'll\" : \"she will\",\n\"she's\" : \"she is\",\n\"shouldn't\" : \"should not\",\n\"that's\" : \"that is\",\n\"there's\" : \"there is\",\n\"they'd\" : \"they would\",\n\"they'll\" : \"they will\",\n\"they're\" : \"they are\",\n\"they've\" : \"they have\",\n\"we'd\" : \"we would\",\n\"we're\" : \"we are\",\n\"weren't\" : \"were not\",\n\"we've\" : \"we have\",\n\"what'll\" : \"what will\",\n\"what're\" : \"what are\",\n\"what's\" : \"what is\",\n\"what've\" : \"what have\",\n\"where's\" : \"where is\",\n\"who'd\" : \"who would\",\n\"who'll\" : \"who will\",\n\"who're\" : \"who are\",\n\"who's\" : \"who is\",\n\"who've\" : \"who have\",\n\"won't\" : \"will not\",\n\"wouldn't\" : \"would not\",\n\"you'd\" : \"you would\",\n\"you'll\" : \"you will\",\n\"you're\" : \"you are\",\n\"you've\" : \"you have\",\n\"'re\": \" are\",\n\"wasn't\": \"was not\",\n\"we'll\":\" will\",\n\"didn't\": \"did not\",\n\"gg\" : \"going\"\n}\n```\n\n\n```python\nimport re\n\ndef preprocess_text(sentence):\n    text = re.sub('((www\\.[^\\s]+)|(https?://[^\\s]+))','', sentence['text'])\n    text = re.sub('@[^\\s]+','', text)\n    text = text.lower().split()\n    reformed = [appos[word] if word in appos else word for word in text]\n    reformed = \" \".join(reformed) \n    text = re.sub('\u0026[^\\s]+;', '', reformed)\n    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)\n    text = re.sub(' +',' ', text)\n    #text = re.sub(' [\\w] ', ' ', text)\n    return text.strip()\n\npreprocess = train\npreprocess['ugc'] = preprocess.apply(preprocess_text, axis=1)\n\npreprocess.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etarget\u003c/th\u003e\n      \u003cth\u003eids\u003c/th\u003e\n      \u003cth\u003euser\u003c/th\u003e\n      \u003cth\u003etext\u003c/th\u003e\n      \u003cth\u003eugc\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1978186076\u003c/td\u003e\n      \u003ctd\u003eceruleanbreeze\u003c/td\u003e\n      \u003ctd\u003e@nocturnalie Anyway, and now Abby and I share ...\u003c/td\u003e\n      \u003ctd\u003eanyway and now abby and i share all our crops ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1994697891\u003c/td\u003e\n      \u003ctd\u003eenthusiasticjen\u003c/td\u003e\n      \u003ctd\u003e@JoeGigantino Few times I'm trying to leave co...\u003c/td\u003e\n      \u003ctd\u003efew times I am trying to leave comments in you...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2191885992\u003c/td\u003e\n      \u003ctd\u003eLifeRemixed\u003c/td\u003e\n      \u003ctd\u003e@AngieGriffin Good Morning Angie  I'll be in t...\u003c/td\u003e\n      \u003ctd\u003egood morning angie I will be in the atl july 8...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1753662211\u003c/td\u003e\n      \u003ctd\u003elovemandy\u003c/td\u003e\n      \u003ctd\u003ehad a good day driving up mountains, visiting ...\u003c/td\u003e\n      \u003ctd\u003ehad a good day driving up mountains visiting k...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2177442789\u003c/td\u003e\n      \u003ctd\u003e_LOVELYmanu\u003c/td\u003e\n      \u003ctd\u003edownloading some songs  i love lady GaGa.\u003c/td\u003e\n      \u003ctd\u003edownloading some songs i love lady gaga\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n## 4. Sentiment Analysis using Lexicon-based Method\n\nThere are two types of lexicon-based sentiment analyzing approcaches - _Polarity_ and _Valence_ based.\n\n_VADER_ is a _VALENCE_ based sentiment analyzer.\n\n*Valence*-based approach taken into consideration the \"intensity\" of a word as opposed to only the polarity (+ve or -ve). For example, \"Great\" is treated as more +ve as opposed to \"Good\".\n\nReferences:\nhttp://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf\n\nScale for the classification model used base on compound value:\n\n1. Positive = \u003e=0\n2. Negative = \u003c0\n\n\n\n```python\npip install vaderSentiment\n```\n\n\n```python\nfrom vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n\nanalyzer = SentimentIntensityAnalyzer()\n```\n\n\n```python\ndef print_sentiment_scores(ugc):\n    snt = analyzer.polarity_scores(ugc['ugc'])  # Calling the polarity analyzer\n    return snt['compound']\n```\n\n\n```python\ncompound = train\ncompound['VADER']=compound.apply(print_sentiment_scores, axis=1)\n\ncompound.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etarget\u003c/th\u003e\n      \u003cth\u003eids\u003c/th\u003e\n      \u003cth\u003euser\u003c/th\u003e\n      \u003cth\u003etext\u003c/th\u003e\n      \u003cth\u003eugc\u003c/th\u003e\n      \u003cth\u003eVADER\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1978186076\u003c/td\u003e\n      \u003ctd\u003eceruleanbreeze\u003c/td\u003e\n      \u003ctd\u003e@nocturnalie Anyway, and now Abby and I share ...\u003c/td\u003e\n      \u003ctd\u003eanyway and now abby and i share all our crops ...\u003c/td\u003e\n      \u003ctd\u003e0.6361\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1994697891\u003c/td\u003e\n      \u003ctd\u003eenthusiasticjen\u003c/td\u003e\n      \u003ctd\u003e@JoeGigantino Few times I'm trying to leave co...\u003c/td\u003e\n      \u003ctd\u003efew times I am trying to leave comments in you...\u003c/td\u003e\n      \u003ctd\u003e-0.0258\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2191885992\u003c/td\u003e\n      \u003ctd\u003eLifeRemixed\u003c/td\u003e\n      \u003ctd\u003e@AngieGriffin Good Morning Angie  I'll be in t...\u003c/td\u003e\n      \u003ctd\u003egood morning angie I will be in the atl july 8...\u003c/td\u003e\n      \u003ctd\u003e0.4404\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1753662211\u003c/td\u003e\n      \u003ctd\u003elovemandy\u003c/td\u003e\n      \u003ctd\u003ehad a good day driving up mountains, visiting ...\u003c/td\u003e\n      \u003ctd\u003ehad a good day driving up mountains visiting k...\u003c/td\u003e\n      \u003ctd\u003e0.7717\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2177442789\u003c/td\u003e\n      \u003ctd\u003e_LOVELYmanu\u003c/td\u003e\n      \u003ctd\u003edownloading some songs  i love lady GaGa.\u003c/td\u003e\n      \u003ctd\u003edownloading some songs i love lady gaga\u003c/td\u003e\n      \u003ctd\u003e0.6369\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nconfusion_matrix(compound['target'], compound['predict'])\naccuracy_score(compound['target'], compound['predict'])\n```\n\n\n\n\n    0.6673063099665594\n\n\n\n\n```python\ndef custom_predict(ugc):\n    snt = analyzer.polarity_scores(ugc['ugc'])  # Calling the polarity analyzer\n    if snt['neg'] \u003e snt['pos']:\n        return 'n'\n    elif snt['pos'] \u003e snt['neg']:\n        return 'p'\n    else:\n        return 'p'\n\nvader = train\nvader['predict']=vader.apply(custom_predict, axis=1)\n```\n\n\n```python\nconfusion_matrix(vader['target'], vader['predict'])\naccuracy_score(vader['target'], vader['predict'])\n```\n\n\n\n\n    0.6673063099665594\n\n\n\n## 5. Sentiment Analysis using Machine Learning-based Method: Naive Bayes\n\n\n```python\n#Import feature engineering modules and test_train_split\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer\nfrom sklearn.model_selection import train_test_split, GridSearchCV\n\n#Import classification algorithm\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.svm import SVC\nfrom sklearn.svm import LinearSVC\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble import ExtraTreesClassifier\nfrom xgboost import XGBClassifier\n\n#Import modules to calculate accuracy and confusion matrix\nfrom sklearn.metrics import confusion_matrix, accuracy_score\nfrom sklearn.metrics import classification_report\n```\n\nNaive Bayes with TF-IDF on original text data\n\n\n```python\ntv = TfidfVectorizer(ngram_range=(1,3),max_features=20000,stop_words='english') \nX = tv.fit_transform(train['text'])\n\nXtrain, Xtest, ytrain, ytest = train_test_split(X, train['target'],\n                                               test_size = 0.2, shuffle=True)\n\nnb = MultinomialNB(alpha=6.5, fit_prior=False)\nnb.fit(Xtrain,ytrain)\npred = nb.predict(Xtest)\n\nprint(accuracy_score(ytest,pred))\nprint(confusion_matrix(ytest,pred))\nprint(classification_report(ytest,pred))\n```\n\n    0.753820670687877\n    [[12287  3768]\n     [ 4109 11833]]\n                  precision    recall  f1-score   support\n    \n               n       0.75      0.77      0.76     16055\n               p       0.76      0.74      0.75     15942\n    \n        accuracy                           0.75     31997\n       macro avg       0.75      0.75      0.75     31997\n    weighted avg       0.75      0.75      0.75     31997\n    \n\n\nNaive Bayes with TF-IDF on pre-processed text data - achieved very minimal accuracy improvement\n\n\n```python\ntv = TfidfVectorizer(ngram_range=(1,3),max_features=20000,stop_words='english') \nX = tv.fit_transform(preprocess['ugc'])\n\nXtrain, Xtest, ytrain, ytest = train_test_split(X, preprocess['target'],\n                                               test_size = 0.2, shuffle=True)\n\nnb = MultinomialNB(alpha=6.5, fit_prior=False)\nnb.fit(Xtrain,ytrain)\npred = nb.predict(Xtest)\n\nprint(accuracy_score(ytest,pred))\nprint(confusion_matrix(ytest,pred))\nprint(classification_report(ytest,pred))\n```\n\n    0.7545707410069694\n    [[12184  3730]\n     [ 4123 11960]]\n                  precision    recall  f1-score   support\n    \n               n       0.75      0.77      0.76     15914\n               p       0.76      0.74      0.75     16083\n    \n        accuracy                           0.75     31997\n       macro avg       0.75      0.75      0.75     31997\n    weighted avg       0.75      0.75      0.75     31997\n    \n\n\nNaive Bayes with Grid Search Hyperparameter Tuning \u0026 10-Fold Cross Validation - achieving higher accuracy over the mdoel without hyperparameter tuning \n\n\n```python\ntext_clf = Pipeline([('vect', CountVectorizer()),\n                     ('tfidf', TfidfTransformer()),\n                     ('clf', MultinomialNB())])\ntuned_parameters = {\n    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],\n    'tfidf__use_idf': (True, False),\n    'tfidf__norm': ('l1', 'l2'),\n    'clf__alpha': [1, 1e-1, 1e-2]\n}\n```\n\n\n```python\nx_train, x_test, y_train, y_test = train_test_split(train['text'], train['target'],\n                                               test_size = 0.2, shuffle=True)\n```\n\n\n```python\nfrom sklearn.metrics import classification_report\nclf = GridSearchCV(text_clf, tuned_parameters, cv=10)\nclf.fit(x_train, y_train)\n\nprint(classification_report(y_test, clf.predict(x_test), digits=4))\nprint(accuracy_score(y_test, clf.predict(x_test)))\nprint(confusion_matrix(y_test, clf.predict(x_test)))\n```\n\n                  precision    recall  f1-score   support\n    \n               n     0.7525    0.8386    0.7932     15876\n               p     0.8209    0.7284    0.7719     16121\n    \n        accuracy                         0.7831     31997\n       macro avg     0.7867    0.7835    0.7825     31997\n    weighted avg     0.7870    0.7831    0.7825     31997\n    \n    0.7830734131324811\n    [[13314  2562]\n     [ 4379 11742]]\n\n\n\n```python\n# x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)\nx_train, x_test, y_train, y_test = train_test_split(preprocess['ugc'], preprocess['target'],\n                                               test_size = 0.2, shuffle=True)\n```\n\n\n```python\nfrom sklearn.metrics import classification_report\nclf = GridSearchCV(text_clf, tuned_parameters, cv=10)\nclf.fit(x_train, y_train)\n\nprint(classification_report(y_test, clf.predict(x_test), digits=4))\nprint(accuracy_score(y_test, clf.predict(x_test)))\nprint(confusion_matrix(y_test, clf.predict(x_test)))\n```\n\n                  precision    recall  f1-score   support\n    \n               n     0.7571    0.8380    0.7955     16035\n               p     0.8177    0.7299    0.7713     15962\n    \n        accuracy                         0.7841     31997\n       macro avg     0.7874    0.7840    0.7834     31997\n    weighted avg     0.7873    0.7841    0.7834     31997\n    \n    0.7840735068912711\n    [[13438  2597]\n     [ 4312 11650]]\n\n\n\n```python\nprint(\"Best Score: \", clf.best_score_)\nprint(\"Best Params: \", clf.best_params_)\n```\n\n    Best Score:  0.7837531643591586\n    Best Params:  {'clf__alpha': 0.1, 'tfidf__norm': 'l1', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}\n\n\n\n```python\nfrom google.colab import files ### remove this line of code if not using colab\n\ntest['ugc'] = test.apply(preprocess_text, axis=1)\ny_kaggle = clf.predict((test['ugc']))\ntest['target'] = pd.DataFrame(y_kaggle.tolist())\ntest[['target', 'ids']].to_csv(\"nb_submission.csv\", index=False)\n\nfiles.download('nb_submission.csv') ### remove this line of code if not using colab\n\n```\n\n\n```python\nfrom sklearn.metrics import classification_report\nclf = GridSearchCV(text_clf, tuned_parameters, cv=10)\nclf.fit(x_train, y_train)\n\nprint(classification_report(y_test, clf.predict(x_test), digits=4))\nprint(accuracy_score(y_test, clf.predict(x_test)))\nprint(confusion_matrix(y_test, clf.predict(x_test)))\n```\n\n                  precision    recall  f1-score   support\n    \n               n     0.7734    0.8187    0.7954     16002\n               p     0.8073    0.7600    0.7829     15995\n    \n        accuracy                         0.7894     31997\n       macro avg     0.7904    0.7893    0.7892     31997\n    weighted avg     0.7904    0.7894    0.7892     31997\n    \n    0.7893552520548801\n    [[13101  2901]\n     [ 3839 12156]]\n\n\n## 6. Sentiment Analysis using Machine Learning-based Method: Linear SVM ##\nwith Grid Search Hyperparameter Tuning \u0026 10-Fold Cross Validation\n\n\n```python\ntext_clf = Pipeline([('vect', CountVectorizer()),\n                     ('tfidf', TfidfTransformer()),\n                     ('clf', LinearSVC())])\ntuned_parameters = {\n    'vect__ngram_range': [(1, 2), (1, 3), (1, 4)],\n    'tfidf__use_idf': (True, False),\n    #'tfidf__norm': ('l1', 'l2'),\n    'clf__tol': [1, 1e-1, 1e-2, 1e-3]\n}\n```\n\n\n```python\nx_train, x_test, y_train, y_test = train_test_split(preprocess['ugc'], preprocess['target'],\n                                               test_size = 0.2, shuffle=True)\n```\n\n\n```python\nclf = GridSearchCV(text_clf, tuned_parameters, cv=10)\nclf.fit(x_train, y_train)\n\nprint(classification_report(y_test, clf.predict(x_test), digits=4))\nprint(accuracy_score(y_test, clf.predict(x_test)))\nprint(confusion_matrix(y_test, clf.predict(x_test)))\n\nprint(\"Best Score: \", clf.best_score_)\nprint(\"Best Params: \", clf.best_params_)\n```\n\n                  precision    recall  f1-score   support\n    \n               n     0.7939    0.8293    0.8112     15870\n               p     0.8243    0.7882    0.8058     16127\n    \n        accuracy                         0.8086     31997\n       macro avg     0.8091    0.8087    0.8085     31997\n    weighted avg     0.8092    0.8086    0.8085     31997\n    \n    0.8085758039816233\n    [[13161  2709]\n     [ 3416 12711]]\n    Best Score:  0.8010751007906991\n    Best Params:  {'clf__tol': 0.1, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 4)}\n\n\n\n```python\nfrom google.colab import files ### remove this line of code if not using colab\n\ntest['ugc'] = test.apply(preprocess_text, axis=1)\ny_kaggle = clf.predict((test['ugc']))\ntest['target'] = pd.DataFrame(y_kaggle.tolist())\ntest[['target', 'ids']].to_csv(\"l_svm_submission.csv\", index=False)\n\nfiles.download('l_svm_submission.csv') ### remove this line of code if not using colab\n```\n\n## 7. Sentiment Analysis using Machine Learning-based Method: XGBoost\n\n\n```python\npip install xgboost\n```\n\n```python\ntv = TfidfVectorizer(ngram_range=(1,2), max_features=20000, stop_words='english', min_df=.0025, max_df=0.25) \nX = tv.fit_transform(preprocess['ugc'])\n\nx_train, x_test, y_train, y_test = train_test_split(X, preprocess['target'],\n                                               test_size = 0.2, shuffle=True)\n```\n\n\n```python\nxgb = XGBClassifier(max_depth=10, n_estimators=400, learning_rate=0.3, objective='binary:logistic')\nxgb.fit(x_train, y_train)\npred = xgb.predict(x_test)\n```\n\n\n```python\nprint(accuracy_score(y_test, pred))\nprint(confusion_matrix(y_test, pred))\nprint(classification_report(y_test, pred))\n```\n\n    0.7501015720223771\n    [[11302  4860]\n     [ 3136 12699]]\n                  precision    recall  f1-score   support\n    \n               n       0.78      0.70      0.74     16162\n               p       0.72      0.80      0.76     15835\n    \n        accuracy                           0.75     31997\n       macro avg       0.75      0.75      0.75     31997\n    weighted avg       0.75      0.75      0.75     31997\n    \n\n\n## 8. Sentiment Analysis using Machine Learning-based Method: Decision Tree\n\n\n```python\ntv = TfidfVectorizer(ngram_range=(1,2), max_features=20000, stop_words='english') \nX = tv.fit_transform(preprocess['ugc'])\n\nx_train, x_test, y_train, y_test = train_test_split(X, preprocess['target'],\n                                               test_size = 0.2, shuffle=True)\n```\n\n\n```python\ndt = DecisionTreeClassifier()\ndt.fit(Xtrain,ytrain)\npred = dt.predict(Xtest)\n```\n\n\n```python\nprint(accuracy_score(ytest,pred))\nprint(confusion_matrix(ytest,pred))\nprint(classification_report(ytest,pred))\n```\n\n    0.6914398224833578\n    [[10944  5018]\n     [ 4855 11180]]\n                  precision    recall  f1-score   support\n    \n               n       0.69      0.69      0.69     15962\n               p       0.69      0.70      0.69     16035\n    \n        accuracy                           0.69     31997\n       macro avg       0.69      0.69      0.69     31997\n    weighted avg       0.69      0.69      0.69     31997\n    \n\n\n## 9. Sentiment Analysis using Machine Learning-based Method: Random Forest\n\n\n```python\nrf = RandomForestClassifier()\nrf.fit(Xtrain,ytrain)\npred = rf.predict(Xtest)\n```\n\n\n```python\nprint(accuracy_score(ytest,pred))\nprint(confusion_matrix(ytest,pred))\nprint(classification_report(ytest,pred))\n```\n\n## 10. Sentiment Analysis using Machine Learning-based Method: Extra Trees\n\n\n```python\netc=ExtraTreesClassifier()\netc.fit(Xtrain,ytrain)\npred=etc.predict(Xtest)\n```\n\n\n\n```python\nprint(accuracy_score(ytest,pred))\nprint(confusion_matrix(ytest,pred))\nprint(classification_report(ytest,pred))\n```\n\n    0.7307872613057474\n    [[11874  4088]\n     [ 4526 11509]]\n                  precision    recall  f1-score   support\n    \n               n       0.72      0.74      0.73     15962\n               p       0.74      0.72      0.73     16035\n    \n        accuracy                           0.73     31997\n       macro avg       0.73      0.73      0.73     31997\n    weighted avg       0.73      0.73      0.73     31997\n    \n\n\n## 11. Sentiment Analysis using Machine Learning-based Method: SVC ##\n\n_Warning - approximately 3hrs of processing_ \n\n\n```python\n#Import feature engineering modules and test_train_split\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import classification_report\n\ntv = TfidfVectorizer(ngram_range=(1,3)) \nX = tv.fit_transform(preprocess['ugc'])\n\nXtrain, Xtest, ytrain, ytest = train_test_split(X, preprocess['target'],\n                                               test_size = 0.2, shuffle=True)\n```\n\n\n```python\nfrom sklearn.svm import SVC\n\nsvm = SVC(kernel='linear')\nsvm.fit(Xtrain,ytrain)\npred = svm.predict(Xtest)\n\nprint(accuracy_score(ytest,pred))\nprint(confusion_matrix(ytest,pred))\nprint(classification_report(ytest,pred))\n```\n\n    0.805137981685783\n    [[12754  3333]\n     [ 2902 13008]]\n                  precision    recall  f1-score   support\n    \n               n       0.81      0.79      0.80     16087\n               p       0.80      0.82      0.81     15910\n    \n        accuracy                           0.81     31997\n       macro avg       0.81      0.81      0.81     31997\n    weighted avg       0.81      0.81      0.81     31997\n    \n\n\n\n```python\n# Uncomment and run below line of code if using google colab\n# from google.colab import files\n\ntest['ugc'] = test.apply(preprocess_text, axis=1)\ny_kaggle = svm.predict(tv.transform(test['ugc']))\ntest['target'] = pd.DataFrame(y_kaggle.tolist())\ntest[['target', 'ids']].to_csv(\"svc_submission.csv\", index=False)\n\n# Uncommon and run below line of code if using google colab \n# files.download('svc_submission.csv')\n```\n\n## Text Pre-processing Steps - References ##\n\nhttps://www.topbots.com/text-preprocessing-for-machine-learning-nlp/\n\n## Further - Text Preprocessing: Porter Stemmer ##\n\n\n```python\nimport nltk\nfrom nltk.tokenize import sent_tokenize, word_tokenize\nfrom nltk.stem import PorterStemmer\nfrom nltk.stem import LancasterStemmer\n\n#nltk.download('punkt') \n\n#create an object of class PorterStemmer\nporter = PorterStemmer()\nlancaster=LancasterStemmer()\n\n```\n\n\n```python\ndef stemSentence(sentence):\n    token_words = word_tokenize(sentence['ugc'])\n    stem_sentence = []\n    for word in token_words:\n        stem_sentence.append(porter.stem(word))\n        stem_sentence.append(\" \")\n    return \"\".join(stem_sentence)\n  \npreprocess['stem'] = preprocess.apply(stemSentence, axis=1)\npreprocess.head()\n```\n\n\n\n\n\u003cdiv\u003e\n\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003etarget\u003c/th\u003e\n      \u003cth\u003eids\u003c/th\u003e\n      \u003cth\u003euser\u003c/th\u003e\n      \u003cth\u003etext\u003c/th\u003e\n      \u003cth\u003eugc\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1978186076\u003c/td\u003e\n      \u003ctd\u003eceruleanbreeze\u003c/td\u003e\n      \u003ctd\u003e@nocturnalie Anyway, and now Abby and I share ...\u003c/td\u003e\n      \u003ctd\u003eanyway and now abbi and i share all our crop w...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1994697891\u003c/td\u003e\n      \u003ctd\u003eenthusiasticjen\u003c/td\u003e\n      \u003ctd\u003e@JoeGigantino Few times I'm trying to leave co...\u003c/td\u003e\n      \u003ctd\u003efew time i m tri to leav comment in your blog ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e2\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2191885992\u003c/td\u003e\n      \u003ctd\u003eLifeRemixed\u003c/td\u003e\n      \u003ctd\u003e@AngieGriffin Good Morning Angie  I'll be in t...\u003c/td\u003e\n      \u003ctd\u003egood morn angi i ll be in the atl juli 8th 1 t...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e3\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e1753662211\u003c/td\u003e\n      \u003ctd\u003elovemandy\u003c/td\u003e\n      \u003ctd\u003ehad a good day driving up mountains, visiting ...\u003c/td\u003e\n      \u003ctd\u003ehad a good day drive up mountain visit kati ea...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e4\u003c/th\u003e\n      \u003ctd\u003ep\u003c/td\u003e\n      \u003ctd\u003e2177442789\u003c/td\u003e\n      \u003ctd\u003e_LOVELYmanu\u003c/td\u003e\n      \u003ctd\u003edownloading some songs  i love lady GaGa.\u003c/td\u003e\n      \u003ctd\u003edownload some song i love ladi gaga\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n--\u003e\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkwokhing%2Fsentimentanalysis-python-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkwokhing%2Fsentimentanalysis-python-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkwokhing%2Fsentimentanalysis-python-demo/lists"}