{"id":27306273,"url":"https://github.com/micheldpd24/cust_review_bertopic","last_synced_at":"2026-05-16T21:02:38.629Z","repository":{"id":287336942,"uuid":"964061458","full_name":"micheldpd24/cust_review_bertopic","owner":"micheldpd24","description":"Topic Modeling of Customer Reviews using BERTopic","archived":false,"fork":false,"pushed_at":"2025-04-21T16:51:35.000Z","size":4083,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-10T13:42:26.110Z","etag":null,"topics":["bertopic","coherence-score","dash","dashboard","docker","embeddings","similarity-score","topic-modeling","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/micheldpd24.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-10T16:19:51.000Z","updated_at":"2025-04-21T16:51:38.000Z","dependencies_parsed_at":"2025-04-11T09:08:19.732Z","dependency_job_id":"ece888d8-6698-4f54-a847-9f54746bc31c","html_url":"https://github.com/micheldpd24/cust_review_bertopic","commit_stats":null,"previous_names":["micheldpd24/cust_review_bertopic"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/micheldpd24/cust_review_bertopic","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/micheldpd24%2Fcust_review_bertopic","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/micheldpd24%2Fcust_review_bertopic/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/micheldpd24%2Fcust_review_bertopic/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/micheldpd24%2Fcust_review_bertopic/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/micheldpd24","download_url":"https://codeload.github.com/micheldpd24/cust_review_bertopic/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/micheldpd24%2Fcust_review_bertopic/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33118950,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T18:38:32.183Z","status":"ssl_error","status_checked_at":"2026-05-16T18:38:29.903Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bertopic","coherence-score","dash","dashboard","docker","embeddings","similarity-score","topic-modeling","transformers"],"created_at":"2025-04-12T03:59:16.619Z","updated_at":"2026-05-16T21:02:38.611Z","avatar_url":"https://github.com/micheldpd24.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **BERTopic Customer Reviews Topic Modeling**\n\n## **Overview**\nThis project uses **BERTopic**, a powerful topic modeling library, to analyze customer reviews and extract meaningful topics. The pipeline is designed to be modular, scalable, and easy to integrate into an ETL (Extract, Transform, Load) pipeline. It includes:\n\n- **Data Loading**: Preprocessing raw customer review data.\n- **Topic Modeling**: Training a BERTopic model to identify topics in the reviews.\n- **Evaluation**: Calculating coherence scores and topic diversity metrics to evaluate model performance.\n- **Visualization**: Providing an interactive dashboard to explore topics, their distributions, and related documents.\n- **Docker Integration**: Packaging the application into a Docker container for seamless deployment in production environments.\n\n---\n\n## **Features**\n1. **BERTopic Model**:\n   - Utilizes transformers for embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering.\n   - Configurable parameters via a YAML configuration file.\n\n2. **Interactive Dashboard**:\n   - Built using **Dash** and **Plotly** for visualizing topics, topic distributions, and sample documents.\n   - Includes tabs for:\n     - Topic overview\n     - Top terms per topic\n     - Similarity Matrix\n     - Document map (2D visualization of topics)\n     - Topic over time\n     - Topic hierarchy\n     - Topic distribution\n\n3. **Model Evaluation**:\n   - Computes coherence scores using Gensim's CoherenceModel.\n   - Calculates topic diversity to measure the uniqueness of topics.\n\n4. **Docker Support**:\n   - The application is containerized for easy deployment at the end of an ETL pipeline.\n\n---\n\n## **Project Structure**\nThe project is organized as follows:\n\n```\n├── main.py                  # Main script to run the pipeline\n├── Dockerfile               # Docker configuration for containerization\n├── requirements.txt         # Python dependencies\n├── config.yaml              # Configuration file for pipeline parameters\n├── assets/                  # Dashboard screenshots\n├── data/                    # Directory for input/output data\n│   ├── full/                # Customer reviews (zip file of customer reviews csv file)\n│   └── results/             # Output directory for model artifacts\n└── README.md                # Project documentation\n```\n\n---\n\n## **Dashboard Screenshots**\nBelow are some screenshots of the interactive dashboard generated by the pipeline. These visuals provide insights into the topics extracted from customer reviews and their relationships.\n\n### **0. Dashboard Overview**\nOverall view of the dashboard layout\n\n![Dashboard layout](assets/dashboard.png)\n\n### **1. Topic Overview**\nThis tab provides a high-level summary of the identified topics, including their sizes and names.\n\n![Topic Overview](assets/topic_overview.png)\n\n### **2. Top Terms Per Topic**\nA bar chart displaying the top terms for each topic, highlighting the most representative words.\n\n![Top Terms](assets/top_terms.png)\n\n### **3. Topic Similarity**\nHeatmap matrix indicating how similar certain topics are to each other\n\n![Top Terms](assets/similarity_matrix.png)\n\n### **4. Document Map**\nA 2D visualization of the topics and their associated documents. Each point represents a document, colored by its assigned topic.\n![Document Map](assets/document_map.png)\n\n### **5. Topic Over Time**\nA visualization of frequency of topics over time\n![Topic Hierarchy](assets/topic_over_time.png)\n\n---\n\n## **Configuration**\nThe pipeline is configured using a YAML file (`config.yaml`). Below is an example configuration:\n\n```yaml\n# config.yaml\n# -----------\ndata:\n  input_filepath: \"data/full/full_reviews.csv\"  # Path to the CSV file\n  review_column: \"review\"  # Column name for reviews\n  timestamps_column: \"yearMonth\"  # Column name for timestamps\n  sample_size: null  # Optional: Number of reviews to sample. Use null for all data \n\nmodel:\n  transformer_name: \"all-MiniLM-L6-v2\"\n  language: \"french\"\n  min_topic_size: 20\n  nr_topics: \"auto\"\n  top_n_words: 10\n  umap:\n    n_neighbors: 15\n    n_components: 5\n    min_dist: 0.0\n    metric: \"cosine\"\n    random_state: 42\n  hdbscan:\n    min_cluster_size: 15\n    metric: \"euclidean\"\n    cluster_selection_method: \"eom\"\n    prediction_data: true\n  vectorizer:\n    stop_words: None\n    ngram_range: [1, 3]\n\nevaluation:\n  coherence_metrics:    # Coherence metrics to calculate\n    - \"c_v\"  # Coherence Measure V\n    - \"u_mass\"  # UMass Coherence\n    - \"c_npmi\" # Normalized Pointwise Mutual Information\n\noutput:\n  save_model: true   # Whether to save the trained model\n  output_dir: \"data/results\"  # Directory to save model outputs\n\ndasboard:\n  port: 8050  # Port for the dashboard  \n  host: \"0.0.0.0\"\n  debug: True\n```\n\n---\n\n## **Usage**\n\n### **1. Prerequisites**\n- Python 3.8 or higher\n- Docker (for containerization)\n\nInstall the required dependencies:\n```bash\npip install -r requirements.txt\n```\n\n### **2. Running the Pipeline**\nTo run the pipeline locally, execute the following command:\n```bash\npython main.py --config config.yaml\n```\n\nThis will:\n1. Load and preprocess the customer reviews.\n2. Train the BERTopic model.\n3. Evaluate the model using coherence and diversity metrics.\n4. Save the trained model and outputs to the specified directory.\n5. Launch the interactive Dash dashboard.\n\n### **3. Accessing the Dashboard**\nOnce the pipeline is running, the dashboard will be accessible at:\n```\nhttp://localhost:8050\n```\n\n### **4. Docker Deployment**\nTo deploy the application using Docker:\n\n#### **Step 0 : Unzip the Reviews data file**\n```bash\nunzip data/full/full_reviews.zip -d data/full/\n```\n\n#### **Step 1: Build the Docker Image**\n```bash\ndocker build -t bertopic-container .\n```\n\n#### **Step 2: Run the Container**\n```bash\ndocker run -p 8051:8050 \\\n           -v \"$PWD/data/full:/data/full\" \\\n           -v \"$PWD/data/results:/data/results\" \\\n           -v \"$PWD:/app\" \\\n           bertopic-container\n```\n\nAccess the dashboard at:\n```\nhttp://localhost:8051\n```\n\n---\n\n## **Evaluation Metrics**\nThe pipeline evaluates the topic model using the following metrics:\n1. **Coherence Scores**:\n   - Measures the interpretability of topics.\n   - Supported metrics: \n      - *Coherence Measure V (`c_v`)*:\n      \n        Measures the coherence of topics based on the normalized pointwise mutual information (NPMI) between pairs of top words in a topic. It evaluates how often these words co-occur in the corpus., \n\n        Range: Typically between 0 and 1 (higher is better).\n\n      - UMass Coherence (`u_mass`): \n        \n        Measures the log-conditional probability of word co-occurrences. It calculates the likelihood of one word appearing given the presence of another word in the corpus, \n\n        Range: Often negative (closer to zero is better).\n\n      - *Normalized Pointwise Mutual Information (`c_npmi`): \n        \n        C_NPMI normalizes the PMI score to ensure coherence values fall within a fixed range (typically [-1, 1]). It penalizes unrelated word pairs more heavily than other metrics.\n\n        A less negative score (closer to zero) indicates better coherence.\n\n\n\n2. **Topic Diversity**:\n   - Measures the uniqueness of topics by calculating the ratio of unique words to total words across all topics.\n\n      Range: Between 0 and 1 (higher is better).\n\n---\n## **References**\n[ Maarten Grootendorst. Leveraging BERT and c-TF-IDF to create easily interpretable topics](https://github.com/MaartenGr/BERTopic)\n\n---\n\n## **License**\nThis project is licensed under the **MIT License**. See the `LICENSE` file for details.\n\n---\n\nFeel free to reach out with any questions or suggestions! 🚀\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicheldpd24%2Fcust_review_bertopic","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicheldpd24%2Fcust_review_bertopic","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicheldpd24%2Fcust_review_bertopic/lists"}