{"id":21546488,"url":"https://github.com/gui-sitton/detectnegativereviews","last_synced_at":"2025-03-18T00:53:34.995Z","repository":{"id":200577195,"uuid":"705841643","full_name":"Gui-Sitton/DetectNegativeReviews","owner":"Gui-Sitton","description":"Create a model to classify reviews as positive and negative.","archived":false,"fork":false,"pushed_at":"2023-11-29T17:34:49.000Z","size":1203,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-24T09:12:25.112Z","etag":null,"topics":["catboost","logisticregression","nltk","nltk-python","random-forest-classifier","text-analysis","text-classification","tfidf-vectorizer","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gui-Sitton.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-10-16T19:47:18.000Z","updated_at":"2023-11-03T17:54:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"e60f8eea-6885-4a29-8546-b861bb2b7edd","html_url":"https://github.com/Gui-Sitton/DetectNegativeReviews","commit_stats":null,"previous_names":["gui-sitton/detectnegativereviews"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gui-Sitton%2FDetectNegativeReviews","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gui-Sitton%2FDetectNegativeReviews/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gui-Sitton%2FDetectNegativeReviews/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gui-Sitton%2FDetectNegativeReviews/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gui-Sitton","download_url":"https://codeload.github.com/Gui-Sitton/DetectNegativeReviews/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244135905,"owners_count":20403797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["catboost","logisticregression","nltk","nltk-python","random-forest-classifier","text-analysis","text-classification","tfidf-vectorizer","xgboost"],"created_at":"2024-11-24T06:12:18.449Z","updated_at":"2025-03-18T00:53:34.973Z","avatar_url":"https://github.com/Gui-Sitton.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DetectNegativeReviews\n\n## Intro\nThe Film Junky Union, a new community for classic film enthusiasts, is developing a system to filter and categorize film reviews. The aim is to train a model to automatically detect negative reviews. You will use a dataset of movie reviews from IMDB with polarity labeling to create a model to classify reviews as positive and negative. It will need to achieve an F1 value of at least 0.85.\n\n### Data description\nThe data was provided by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng and Christopher Potts. (2011). Word learning vectors for sentiment analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).\n\n**Selected fields:**\nreview: the text of the review\npos: the goal, '0' for negative and '1' for positive\nds_part: 'train'/'test' for the training/testing part of the dataset, respectively\nThere are other fields in the dataset.\n\n## Libraries used\n\nimport math\n\nimport re\n\nimport numpy as np\n\nimport pandas as pd\n\nfrom sklearn.model_selection import train_test_split\n\nimport matplotlib\n\nimport matplotlib.pyplot as plt\n\nimport matplotlib.dates as mdates\n\nimport seaborn as sns\n\nfrom nltk.corpus import stopwords as nltk_stopwords\n\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\nfrom sklearn.metrics import accuracy_score, f1_score\n\nfrom tqdm.auto import tqdm\n\nfrom sklearn.linear_model import LogisticRegression\n\nimport nltk\n\nfrom nltk.corpus import stopwords as nltk_stopwords\n\nnltk.download('stopwords')\n\nfrom lightgbm import LGBMClassifier\n\nfrom sklearn.ensemble import RandomForestClassifier\n\nimport xgboost as xgb\n\nfrom catboost import CatBoostClassifier\n\n## Conclusion\n\n* Linear regression and LGBM were the best models, while the model 0 performed like a random model.\n* Linear regression performed slightly better with the first type of text processing, normalization. With an accuracy of 88% and an f1 value of 0.88. The only difference was the APS value in the test set, from 0.95 with the first model to 0.94 with the second model using other text processing, which obtained a lower f1 value, also resulting in 0.87.\n* In my reviews, all the models I tested performed as well as a random model, as I believe the dataframe was too small to work with and I didn't have enough examples, so it resulted in something like 50% accuracy for all the models.\n\n\n# Detectar avaliações negativas\n\n## Introdução\nA Film Junky Union, uma nova comunidade para entusiastas de filmes clássicos, está desenvolvendo um sistema para filtrar e categorizar críticas de filmes. O objetivo é treinar um modelo para detectar automaticamente críticas negativas. Você usará um conjunto de dados de resenhas de filmes do IMDB com rotulagem de polaridade para criar um modelo para classificar as resenhas como positivas e negativas. Ele precisará atingir um valor F1 de pelo menos 0,85.\n\n### Descrição dos dados\nOs dados foram fornecidos por Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng e Christopher Potts. (2011). Word learning vectors for sentiment analysis (Vetores de aprendizado de palavras para análise de sentimentos). 49ª Reunião Anual da Associação de Linguística Computacional (ACL 2011).\n\n**Campos selecionados:**\nrevisão: o texto da revisão\npos: o objetivo, \"0\" para negativo e \"1\" para positivo\nds_part: 'train'/'test' para a parte de treinamento/teste do conjunto de dados, respectivamente\nHá outros campos no conjunto de dados.\n\n## Bibliotecas usadas\n\nimport math\n\nimport re\n\nimport numpy as np\n\nimport pandas as pd\n\nfrom sklearn.model_selection import train_test_split\n\nimport matplotlib\n\nimport matplotlib.pyplot as plt\n\nimport matplotlib.dates as mdates\n\nimport seaborn as sns\n\nfrom nltk.corpus import stopwords as nltk_stopwords\n\nfrom sklearn.feature_extraction.text import TfidfVectorizer\n\nfrom sklearn.metrics import accuracy_score, f1_score\n\nfrom tqdm.auto import tqdm\n\nfrom sklearn.linear_model import LogisticRegression\n\nimportar nltk\n\nfrom nltk.corpus import stopwords as nltk_stopwords\n\nnltk.download('stopwords')\n\nfrom lightgbm import LGBMClassifier\n\nfrom sklearn.ensemble import RandomForestClassifier\n\nimportar xgboost como xgb\n\nfrom catboost import CatBoostClassifier\n\n## Conclusão\n\n* A regressão linear e o LGBM foram os melhores modelos, enquanto o modelo 0 funcionou como um modelo aleatório.\n* A regressão linear teve um desempenho ligeiramente melhor com o primeiro tipo de processamento de texto, a normalização. Com uma precisão de 88% e um valor f1 de 0,88. A única diferença foi o valor APS no conjunto de teste, de 0,95 com o primeiro modelo para 0,94 com o segundo modelo usando outro processamento de texto, que obteve um valor f1 menor, também resultando em 0,87.\n* Em minhas análises, todos os modelos que testei tiveram o mesmo desempenho de um modelo aleatório, pois acredito que o quadro de dados era muito pequeno para trabalhar e eu não tinha exemplos suficientes, o que resultou em algo como 50% de precisão para todos os modelos.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgui-sitton%2Fdetectnegativereviews","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgui-sitton%2Fdetectnegativereviews","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgui-sitton%2Fdetectnegativereviews/lists"}