{"id":24560196,"url":"https://github.com/gpsyrou/text_analysis_of_consumer_reviews","last_synced_at":"2025-08-17T10:07:27.132Z","repository":{"id":45071094,"uuid":"344601354","full_name":"gpsyrou/Text_Analysis_of_Consumer_Reviews","owner":"gpsyrou","description":"Natural Language Processing (NLP) and analysis on reviews about delivery companies in the UK based on reviews extracted from the Trustpilot website","archived":false,"fork":false,"pushed_at":"2023-01-28T22:27:57.000Z","size":10310,"stargazers_count":2,"open_issues_count":13,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-02T16:39:09.579Z","etag":null,"topics":["latent-dirichlet-allocation","nlp","python","topic-extraction","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gpsyrou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-04T20:30:34.000Z","updated_at":"2023-05-03T14:23:35.000Z","dependencies_parsed_at":"2023-02-15T19:16:04.303Z","dependency_job_id":null,"html_url":"https://github.com/gpsyrou/Text_Analysis_of_Consumer_Reviews","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gpsyrou/Text_Analysis_of_Consumer_Reviews","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpsyrou%2FText_Analysis_of_Consumer_Reviews","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpsyrou%2FText_Analysis_of_Consumer_Reviews/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpsyrou%2FText_Analysis_of_Consumer_Reviews/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpsyrou%2FText_Analysis_of_Consumer_Reviews/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gpsyrou","download_url":"https://codeload.github.com/gpsyrou/Text_Analysis_of_Consumer_Reviews/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpsyrou%2FText_Analysis_of_Consumer_Reviews/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270832481,"owners_count":24653553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["latent-dirichlet-allocation","nlp","python","topic-extraction","topic-modeling"],"created_at":"2025-01-23T07:16:05.441Z","updated_at":"2025-08-17T10:07:26.996Z","avatar_url":"https://github.com/gpsyrou.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Topic Modelling with NLP \u0026 Latent Dirichlet Allocation on Customer Reviews\n![Python](https://img.shields.io/badge/-Python-000?\u0026logo=Python) ![Jupyter Notebook](https://img.shields.io/badge/Jupyter-Notebook-orange?\u0026logo=Jupyter)\n\nPurpose of this project is to leverage reviews about major delivery companies that are operating in the UK, and perform NLP tasks to analyze different aspects of the reviews like the sentiment, most common words, probability distributions across word sequences, and more.\n\n## Introduction\n\nIn this project we are going to explore the world of logistic companies and the issues that they might be facing. Specifically, we are going to focus on analyzing data regarding a few of the most well-known delivery companies in the UK, namely \u003ca href=\"https://en.wikipedia.org/wiki/Deliveroo\" style=\"text-decoration:none\"\u003e Deliveroo\u003c/a\u003e, \u003ca href=\"https://en.wikipedia.org/wiki/UberEats\" style=\"text-decoration:none\"\u003e UberEats\u003c/a\u003e, \u003ca href=\"https://en.wikipedia.org/wiki/Just_Eat\" style=\"text-decoration:none\"\u003e Just Eat\u003c/a\u003e and \u003ca href=\"https://stuart.com/\" style=\"text-decoration:none\"\u003e Stuart\u003c/a\u003e. To do that, we are going to utilize the internet and the reviews that someone can many different platforms - especially these platforms that are specializing at collecting reviews and opininions of customers for a plethora of companies and services. \n\nThe first iteration of this project it's using the reviews that can be found in the famous consumer review website \u003ca href=\"https://en.wikipedia.org/wiki/Trustpilot\" style=\"text-decoration:none\"\u003e TrustPilot\u003c/a\u003e. Even though the website is already providing some API functionalities, we are going to write our own web-scraping tool to retrieve the data in the format that we want. We will attempt to collect as many reviews as possible and then use them to identify interesting findings in the text. For example, we will try to identify what is the sentiment across all reviews for a specific company, what are the most common words and bigrams (i.e. pairs of words that tend to appear next to each other) in the reviews, and more. Finally, we will implement a \u003ca href=\"https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation\" style=\"text-decoration:none\"\u003e Latent Dirichlet Allocation\u003c/a\u003e model to try and identify what are the topics that these reviews correspond to. Note that they LDA model is going to be implemented twice, one for the negative and one for the positive reviews.\n\n\n## Project Roadmap\n\n```mermaid\ngraph   LR\n    A[Build a tool to connect to web sources APIs] --\u003e|Get reviews from web| B[Clean reviews]\n    B --\u003e D[Knowledge Graphs]\n    B --\u003e F[Unsupervised Clustering]\n    B --\u003e C(Sentiment Analysis)\n    B --\u003e |Identify topic of review| E[Topic Extraction]\n    E --\u003e  |Train Model| I[Assign Topic to new instances]\n    C --\u003e |Train Model| J[Sentiment Classifier]\n    I --\u003e K[Build UI]\n    J --\u003e K[Build UI]\n```\n\n### Version 1.0: (Most recent version of the Notebook can be found here: \u003ca href=\"https://github.com/gpsyrou/Text_Analysis_of_Consumer_Reviews/blob/main/jupyter_notebook/reviews.ipynb\"\u003eV1.0 Notebook\u003c/a\u003e)\n\n- [x] Impementation of the v1.0 of web scraper and data collection API \n- [x] Developed a standard LDA model for topic identification\n- [x] Created first version of visualizations to present the results\n\n## Web-Scrapping Tool and Data Retrieval\n\nIn order to collect the reviews directly from the TrustPilot website, we have created a web-scrapping tool that allowed us to automate this process across different companies \u0026 their corresponding reviews. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files. Moreover, we have packaged the tool into a python library. Hence, if you are thinking of working on a similar project where you need to retrieve data from TrustPilot, you can install the package that you can find \u003ca href=\"https://github.com/gpsyrou/Text_Analysis_of_Consumer_Reviews/blob/main/trustplt.py\" style=\"text-decoration:none\"\u003ehere\u003c/a\u003e. As of January 2023, the package contains the main functionalities to collect many different information from the website, like the reviews, reviewer_id, date of the review, user rating, and more. \n\nFor the first iteration of the project, we have built the aforementioned package with the functionality to retrieve the following information - which will also be the features in our dataset:\n\n1. **Company**: Name of the Company that we are examining (e.g. Deliveroo, UberEats, JustEat, Stuart)\n2. **Id**: The unique identifier for the review\n3. **Reviewer_Id**: Unique id for a reviewer/user\n4. **Title**: Title of the review\n5. **Review**: The text corresponding to the review submitted from the reviewer\n6. **Date**: Day of review submission\n7. **Rating**: The rating about the company, as submitted from the reviewer\n\n### Input Schema\n\n| Column/Feature                                  | Type | Description |\n|-------------------------------------------------|---| ---|\n| Company                                         | NVARCHAR | Name of the delivery company |\n| Id                                              | NVARCHAR | Id of the review |\n| Reviewer_Id                                     | NVARCHAR | Id of the reviewer    |\n| Title                                           | NVARCHAR | Title of the review    |\n| Review                                          | NVARCHAR | The review itself - free text field    |\n| Date                                            | DATE     | Day that the review was submitted    |\n| Rating                                          | BIGINT   | Rating (1-5)|\n\n\n## Data Retrieval API\n\nTo get reviews from the TrustPilot website, we are leveraging a custom made web scraping tool. This tool is iterating across different pages of the website and collects the reviews and any other relevant information, with the output being stored in CSV files.\n\n### Running Guide\n\n1. Set-up the appropriate configurations in config.json (\u003ca href=\"https://raw.githubusercontent.com/gpsyrou/Text_Analysis_of_Consumer_Reviews/main/config.json\"\u003eexample\u003c/a\u003e). The config needs to get populated with the following metadata:\u003cbr\u003e\n        - \u003cem\u003esource_url\u003c/em\u003e: Main domain URL\u003cbr\u003e\n        - \u003cem\u003estarting_page\u003c/em\u003e: Domain subpath to a specific reviews page\u003cbr\u003e\n        - \u003cem\u003esteps\u003c/em\u003e: Defines number of pages to iterate over\u003cbr\u003e\n        - \u003cem\u003ecompany\u003c/em\u003e: Company/Service of interest\u003cbr\u003e\n\n2. Execute the python retriever script\u003cbr\u003e\n        `python data_retriever.py`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpsyrou%2Ftext_analysis_of_consumer_reviews","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgpsyrou%2Ftext_analysis_of_consumer_reviews","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpsyrou%2Ftext_analysis_of_consumer_reviews/lists"}