{"id":31556934,"url":"https://github.com/supsi-deass-cpps/multilingual_thematic_analysis","last_synced_at":"2025-10-04T23:20:00.330Z","repository":{"id":315258434,"uuid":"1058693381","full_name":"SUPSI-DEASS-CPPS/multilingual_thematic_analysis","owner":"SUPSI-DEASS-CPPS","description":"Modular R pipeline for multilingual survey analysis — translate, embed, cluster, and visualize open-ended responses using Google Cloud and tidyverse tools.","archived":false,"fork":false,"pushed_at":"2025-09-26T09:57:19.000Z","size":166,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-26T11:42:50.575Z","etag":null,"topics":["clustering","data-visualization","linguistics","multilingual-analysis","natural-language-processing","qualitative-research","r","reproducible-research","social-science","survey-data","text-mining","thematic-analysis","translation"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SUPSI-DEASS-CPPS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-17T12:29:34.000Z","updated_at":"2025-09-26T09:57:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"ca3ccc00-6249-41c7-87f2-8e3f97442682","html_url":"https://github.com/SUPSI-DEASS-CPPS/multilingual_thematic_analysis","commit_stats":null,"previous_names":["supsi-deass-cpps/multilingual_thematic_analysis"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/SUPSI-DEASS-CPPS/multilingual_thematic_analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SUPSI-DEASS-CPPS%2Fmultilingual_thematic_analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SUPSI-DEASS-CPPS%2Fmultilingual_thematic_analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SUPSI-DEASS-CPPS%2Fmultilingual_thematic_analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SUPSI-DEASS-CPPS%2Fmultilingual_thematic_analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SUPSI-DEASS-CPPS","download_url":"https://codeload.github.com/SUPSI-DEASS-CPPS/multilingual_thematic_analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SUPSI-DEASS-CPPS%2Fmultilingual_thematic_analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278386557,"owners_count":25978197,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","data-visualization","linguistics","multilingual-analysis","natural-language-processing","qualitative-research","r","reproducible-research","social-science","survey-data","text-mining","thematic-analysis","translation"],"created_at":"2025-10-04T23:19:57.675Z","updated_at":"2025-10-04T23:20:00.315Z","avatar_url":"https://github.com/SUPSI-DEASS-CPPS.png","language":"R","readme":"# Multilingual Thematic Analysis in R\n\nA modular R pipeline for translating, cleaning, and clustering multilingual survey comments.\n\nBuilt for multilingual surveys, this pipeline combines cloud-powered translation, contextual embeddings, unsupervised clustering, and rich visualizations to uncover thematic insights across languages — all with reproducibility, modularity, and privacy at its core.\n\n---\n\n## 🛡️ Badges\n\n![R version](https://img.shields.io/badge/R-≥4.2-blue)\n![License: MIT](https://img.shields.io/badge/License-MIT-green)\n![renv](https://img.shields.io/badge/Reproducible%20Environment-renv-yellow)\n![Last Updated](https://img.shields.io/badge/Last%20Updated-September%202025-orange)\n[![DOI](https://zenodo.org/badge/1058693381.svg)](https://doi.org/10.5281/zenodo.17185563)\n[![GitHub Verified](https://img.shields.io/badge/Verified-GitHub-blue?logo=github)](https://github.com/salvatoremaione)\n[![Verified Identity](https://img.shields.io/badge/Verified-ORCID-green?logo=orcid)](https://orcid.org/0000-0002-5944-2589)\n[![Trusted Maintainer](https://img.shields.io/badge/Trusted%20Maintainer-Keybase-8A2BE2?logo=keybase\u0026logoColor=white)](https://keybase.io/salvatore)  \n\n---\n\n## ⚡ Quickstart Summary\n\nFor impatient users, here’s how to get started in 5 steps:\n\n1. **Clone this repository**\n   ```bash\n   git clone https://github.com/SUPSI-DEASS-CPPS/multilingual_thematic_analysis.git\n   cd multilingual_thematic_analysis\n   ```\n2. Install R ≥ `4.2` and required packages (see [Dependencies](#-dependencies)).\n   \n3. **Set up your .Renviron**\n   ```bash\n   cp .Renviron.example .Renviron\n   ```\n   Edit it with your Google Cloud credentials and restart R.\n\n4. **Place your survey file** \n   Save your UTF-8 `.tsv` file as `data/comments.tsv`\n\n5. **Restore the environment**\n   ```r\n   install.packages(\"renv\")\n   renv::restore()\n   ```\n6. **Run the pipeline**\n   ```r\n   source(\"scripts/00_validate_responses.R\")\n   source(\"scripts/01_load_translate.R\")\n   source(\"scripts/02_contextual_embeddings.R\")\n   source(\"scripts/03_clustering.R\")\n   source(\"scripts/04_visualization.R\")\n   ```\n\n---\n\n## 📑 Table of Contents\n- [Quickstart Summary](#-quickstart-summary)\n- [Overview](#-overview)\n- [Research Context](#-research-context)\n- [Features](#-features)\n- [Project Structure](#-project-structure)\n- [Prerequisites](#-prerequisites)\n- [Input Format Specification](#-input-format-specification)\n- [Configuration Guide](#-configuration-guide)\n- [Script Dependencies](#-script-dependencies)\n- [Installation](#-installation)\n- [Usage](#-usage)\n- [Example Output](#-example-output)\n- [How It Works](#-how-it-works)\n- [Troubleshooting](#-troubleshooting)\n- [Data Privacy Note](#-data-privacy-note)\n- [Limitations](#-limitations)\n- [For Developers](#-for-developers)\n- [License](#-license)\n\n---\n\n## 📚 Overview\n\nThis project provides a workflow for thematic analysis of multilingual text data. It includes:\n\n- Validation of raw input  \n- Translation using the polyglotr package or Google Cloud Translation API  \n- Text preprocessing  \n- Clustering with PCA and k-means  \n- Visualization of thematic patterns\n\n## 🧠 Research Context\n\nOpen-ended survey responses offer rich qualitative insights, but analyzing them at scale — especially across multiple languages — is notoriously difficult. Manual coding is time-consuming, inconsistent, and often biased by language fluency.\n\nThis pipeline addresses that challenge by combining:\n- **Automated translation** to unify multilingual feedback\n- **Contextual embeddings** to capture semantic nuance\n- **Unsupervised clustering** to reveal emergent themes\n- **Visualizations** to communicate findings clearly\n\nIt’s designed for researchers, analysts, and institutions who need to process large volumes of multilingual text data — whether for service evaluations, policy feedback, or global user studies — with reproducibility, transparency, and privacy in mind.\n\n---\n\n## ✨ Features\n\n- Filters missing, short, or corrupted responses  \n- Detects actual language of each comment  \n- Translates comments to English via Google Cloud Translation API  \n- Caches translations and embeddings\n- Generates contextual embeddings via Vertex AI\n- Applies PCA + UMAP + HDBSCAN/KMeans clustering\n- Labels clusters using TF-IDF keywords\n- Generates deterministic wordclouds with ggwordcloud\n\n---\n\n## 📂 Project Structure\n\n```\n├── R/ \n│   └── utils.R \n├── config/ \n│   └── config.yml\n├── data/ \n│   └── raw_survey_data_placeholder\n├── output/\n│   ├── csv/\n│   │   ├── 00_flagged_responses.csv\n│   │   ├── 00_clean_comments.csv\n│   │   ├── 01_translated_comments.csv\n│   │   ├── 02_embedding_summary.csv\n│   │   └── cluster/\n│   │       └── 03_clusters_summary.csv\n│   ├── rds/\n│   │   └── embeddings/\n│   │       └── all_embeddings.rds\n│   ├── png/\n│   │   └── cluster/\n│   │   └── wordclouds/\n├── cache/\n├── scripts/\n│   ├── 00_validate_responses.R\n│   ├── 01_load_translate.R\n│   ├── 02_contextual_embeddings.R\n│   ├── 03_clustering.R\n│   └── 04_visualization.R\n├── renv/\n│   └── (local renv infrastructure, ignored by git)\n├── renv.lock\n├── .Renviron.example\n├── .gitignore\n└── multilingual_analysis.Rproj\n```\n\n---\n\n## 🧰 Prerequisites\n\nBefore running the pipeline, ensure you have the following:\n\n- **R ≥ 4.2** installed on your system  \n- **RStudio** (recommended for easier script execution and environment management)  \n- **Google Cloud account** with access to:\n  - Cloud Translation API\n  - Vertex AI (for embeddings)\n- **Service account keys** for both APIs saved as `.json` files  \n- **Internet access** to connect to Google Cloud services  \n- **UTF-8 encoded survey file** named `comments.tsv` placed in the `data/` folder  \n- **Environment variables** configured in `.Renviron` (see `.Renviron.example`)  \n- **renv** package installed to restore the project environment:\n  ```r\n  install.packages(\"renv\")\n  renv::restore()\n  ```\n\nThese prerequisites ensure reproducibility, secure API access, and compatibility with the pipeline’s modular structure.\n\n---\n\n## 📄 Input Format Specification\n\nThe pipeline expects a UTF-8 encoded tab-separated file named `comments.tsv` placed in the `data/` folder.\n\n### Required structure:\n\n- File format: `.tsv` (tab-separated values)  \n- Encoding: UTF-8  \n- Columns:\n  - `ResponseId`: Unique identifier for each survey response  \n  - `UserLanguage`: Language code or label (e.g., `en`, `fr`, `de`)  \n  - `Q4.2` to `Q4.10`: Open-ended comment fields (can vary depending on your survey)\n\n### Example layout:\n\n| ResponseId | UserLanguage | Q4.2                   | Q4.3         | Q4.4                     | ... | Q4.10       |\n|------------|--------------|------------------------|--------------|--------------------------|-----|-------------|\n| R_001      | en           | I loved the service.   | Very clean.  | Staff was friendly.      | ... | Will return!|\n| R_002      | it           | Il servizio era ottimo.| Molto pulito.| Il personale era gentile.| ... | Tornerò!    |\n\n\n⚠️ The pipeline assumes that the open-ended questions are labeled as `Q4.2` to `Q4.10`.  \nIf your survey uses different column names, you’ll need to adjust the `comment_cols` variable in the scripts or update the configuration file accordingly.\n\n---\n\n## ⚙️ Configuration Guide\n\nThe pipeline uses a centralized configuration file located at:\n\n```\nconfig/config.yml\n```\n\nThis file controls all major parameters across the scripts, making the pipeline easy to customize without editing code.\n\n### Key sections:\n\n- `paths`: Input/output directories for data, results, and cache  \n- `translation`: Minimum comment length, max allowed issues, and environment variable for API key  \n- `embeddings`: Model ID, batch size, timeout, and environment variables for Vertex AI  \n- `clustering`: UMAP dimensions, minimum documents, clustering method settings, and plot toggle  \n- `visualization`: Wordcloud settings including max words, stopwords, colors, and export formats\n- `question_stopwords`: Question-specific stopwords\n\n### Example snippet:\n\n```yaml\ntranslation:\n  min_comment_length: 10\n  max_allowed_issues: 2\n  google_key_env: \"GOOGLE_TRANSLATE_KEY\"\n\nclustering:\n  min_docs: 5\n  min_pts: 5\n  umap_dims: 50\n  max_kmeans_k: 10\n  make_plots: true\n```\n\nCustomization tips:\n- To change the number of clusters, adjust `max_kmeans_k`\n- To skip wordcloud generation, set `make_plots: false`\n- To use a different embedding model, update `model_id` under `embeddings`\n- To add more stopwords, edit `custom_stopwords` and `regex_stopwords` under `visualization`\n\n⚠️ After modifying `config.yml`, re-run the affected scripts to apply changes.\n\n---\n\n## 🔗 Script Dependencies\n\nThe pipeline is modular but sequential — each script depends on the outputs of the previous one.\n\n### Execution flow:\n```\n00_validate_responses.R\n↓ 01_load_translate.R\n↓ 02_contextual_embeddings.R\n↓ 03_clustering.R\n↓ 04_visualization.R\n```\n\n### Dependency map:\n\n- `00_validate_responses.R`  \n  → Reads `data/comments.tsv`  \n  → Outputs `00_clean_comments.csv` and `00_flagged_responses.csv`\n\n- `01_load_translate.R`  \n  → Reads `00_clean_comments.csv`  \n  → Outputs `01_translated_comments.csv`\n\n- `02_contextual_embeddings.R`  \n  → Reads `01_translated_comments.csv`  \n  → Outputs `all_embeddings.rds`, `all_embeddings.csv`, and `02_embedding_summary.csv`\n\n- `03_clustering.R`  \n  → Reads `all_embeddings.rds` and `01_translated_comments.csv`  \n  → Outputs cluster CSVs, UMAP plots, and `03_clusters_summary.csv`\n\n- `04_visualization.R`  \n  → Reads cluster CSVs  \n  → Outputs wordclouds in PNG and HTML formats\n\n⚠️ If you modify any intermediate output (e.g., translated comments), re-run the downstream scripts to reflect those changes.\n\n---\n\n## 🚀 Installation\n\n1. **Clone the repository**\n\t```bash\n\tgit clone https://github.com/SUPSI-DEASS-CPPS/multilingual_thematic_analysis.git\n\tcd multilingual_thematic_analysis\n\t```\n\n2. **Set up environment variables**\n   - Copy `.Renviron.example` to `.Renviron`:\n     ```bash\n     cp .Renviron.example .Renviron\n     ```\n   - Edit `.Renviron` and add your Google Cloud credentials:\n     ```bash\n     GOOGLE_TRANSLATE_KEY=/path/to/google-translate-key.json\n     GOOGLE_VERTEX_KEY=/path/to/google-vertex-key.json\n     GCP_PROJECT=your-gcp-project-id\n     ```\n   - Restart R or RStudio to apply changes.\n\n3. **Restore the R environment**\n   - This project uses [renv](https://rstudio.github.io/renv/) for reproducible package management.\n   - In R, run:\n     ```r\n     install.packages(\"renv\")  # if not already installed\n     renv::restore()\n     ```\n   - This installs the exact package versions listed in `renv.lock`.\n\n4. **Open the project**\n   - Use R or RStudio in the cloned project directory.\n\n---\n\n## 🛠 Usage\n\n### Step 1: Prepare your data\n\nExport your survey as a UTF-8 encoded `.tsv` file with line breaks preserved and choice text.\n\nSave it as:\n\n```bash\ndata/comments.tsv\n```\n\n⚠️ The data/ folder is ignored by Git to protect sensitive data. A placeholder file (raw_survey_data_placeholder) is included so the folder exists in the repo. Replace it with your actual survey file named comments.tsv\n\n### Step 2: Set up Google Cloud Translation API\n\n1. Go to [Google Cloud Console](https://console.cloud.google.com/)  \n2. Create a new project (e.g., `MyProject`)  \n3. Enable the **Cloud Translation API** under APIs \u0026 Services \u003e Library  \n4. Go to **APIs \u0026 Services \u003e Credentials**  \n5. Click **Create Credentials \u003e Service Account**  \n6. Name it (e.g., `translation-service`) and click **Done**  \n7. In the Service Account list, click your new account  \n8. Go to the **Keys** tab  \n9. Click **Add Key \u003e Create new key**, choose **JSON**, and download the file  \n10. Move the file to a secure location, e.g.: `~/.gcloud/translation-key.json`\n\n### Step 3: Configure environment variables\n\nUse the `.Renviron` file to store your keys and project ID.\nCopy `.Renviron.example` to `.Renviron` and edit it with your values:\n\n```bash\nGOOGLE_TRANSLATE_KEY=/path/to/google-translate-key.json\nGOOGLE_VERTEX_KEY=/path/to/google-vertex-key.json\nGCP_PROJECT=your-gcp-project-id\n```\n\nRestart R or RStudio after editing `.Renviron`.\n\n### Step 4: Install dependencies with renv\n\nThis project uses [renv](https://rstudio.github.io/renv/) for reproducible environments.\n\n1. Install renv if not already installed:\n\n```r\ninstall.packages(\"renv\")\n```\n\n2. Initialize or restore the environment:\n\n```r\nrenv::restore()\n```\n\n3. This installs the exact package versions listed in `renv.lock`.\n\n### Step 5: Run the pipeline\n\n```r\nsource(\"scripts/00_validate_responses.R\")\nsource(\"scripts/01_load_translate.R\")\nsource(\"scripts/02_contextual_embeddings.R\")\nsource(\"scripts/03_clustering.R\")\nsource(\"scripts/04_visualization.R\")\n```\n\nThe translation script will:\n- Detect the actual language of each comment;\n- Retry failed translations up to 3 times;\n- Skip short or non-linguistic entries;\n- Translate using the correct source language;\n- Cache translations to avoid redundant API calls;\n- Display a progress bar with estimated time remaining;\n\n---\n\n## 📂 Example Output\n\nAfter running the full pipeline, you’ll find the following outputs:\n\n- `output/csv/00_flagged_responses.csv`  \n  → Responses flagged for being missing, too short, or corrupted\n\n- `output/csv/00_clean_comments.csv`  \n  → Validated and cleaned comments ready for translation\n\n- `output/csv/01_translated_comments.csv`  \n  → Comments translated to English using Google Cloud Translation API\n\n- `output/rds/embeddings/all_embeddings.rds`  \n  → Combined contextual embeddings for all questions\n\n- `output/csv/embeddings/all_embeddings.csv`  \n  → Embeddings in CSV format for inspection or reuse\n\n- `output/csv/02_embedding_summary.csv`  \n  → Summary of embedding dimensions and valid rows per question\n\n- `output/csv/cluster/03_clusters_summary.csv`  \n  → Cluster labels, methods used, and silhouette scores\n\n- `output/csv/cluster/clusters_Q4.X_translated.csv`  \n  → Cluster assignments and labels for each question (X = 2 to 10)\n\n- `output/png/wordclouds/04_wordclouds_Q4.X_translated_combined.png`  \n  → Wordclouds summarizing cluster themes\n  \n---\n\n## 🔍 How It Works\n\nThe pipeline consists of five modular R scripts, each performing a distinct stage of multilingual thematic analysis:\n\n1. **Validation** (`00_validate_responses.R`)  \n   - Loads raw survey data from `data/comments.tsv`  \n   - Flags missing, short, or corrupted responses  \n   - Filters out low-quality entries  \n   - Outputs `00_clean_comments.csv` and `00_flagged_responses.csv`\n\n2. **Translation** (`01_load_translate.R`)  \n   - Detects the actual language of each comment  \n   - Translates comments to English using the Google Cloud Translation API  \n   - Caches results to avoid redundant API calls  \n   - Outputs `01_translated_comments.csv`\n\n3. **Embedding** (`02_contextual_embeddings.R`)  \n   - Generates contextual embeddings using Google Vertex AI  \n   - Saves per-question and combined embeddings in `.rds` and `.csv` formats  \n   - Outputs `02_embedding_summary.csv` and `all_embeddings.rds`\n\n4. **Clustering** (`03_clustering.R`)  \n   - Reduces embedding dimensions with PCA and UMAP  \n   - Applies HDBSCAN clustering (with KMeans fallback)  \n   - Labels clusters using TF-IDF keywords  \n   - Outputs cluster assignments and visual plots  \n   - Saves `03_clusters_summary.csv` and per-question cluster files\n\n5. **Visualization** (`04_visualization.R`)\n   - Cleans and tokenizes text using `stringi` + `quanteda`\n   - Applies both global and question‑specific stopwords (from `config.yml`)\n   - Optionally removes regex‑based stopwords\n   - Aggregates token frequencies and scales weights deterministically\n   - Generates wordclouds for each cluster \n   - Saves outputs in `output/png/wordclouds/`\n\nEach script is standalone and can be run independently, but they are designed to work sequentially for full pipeline execution.\n\n---\n\n## 🧯 Troubleshooting\n\nHere are common issues you might encounter when running the pipeline, along with suggested fixes:\n\n### 🔑 Missing API key or environment variable\n**Error:** `Error: GOOGLE_TRANSLATE_KEY not found`  \n**Fix:**  \n- Ensure `.Renviron` exists in your project root  \n- Add the correct path to your translation key:\n  ```bash\n  GOOGLE_TRANSLATE_KEY=/path/to/google-translate-key.json\n  ```\n- Restart R or RStudio to reload environment variables\n\n### 📦 renv restore fails or hangs\n**Error:** Packages not installing, or `renv::restore()` fails\n**Fix:**\n- Ensure you have internet access\n- Run `install.packages(\"renv\")` before restoring\n- Try `renv::diagnostics()` for detailed troubleshooting\n- If needed, delete `renv/` and re-run `renv::init()` followed by `renv::restore()`\n\n### 🌐 Translation API quota exceeded\n**Error:** `403: Quota exceeded` or `API key invalid`\n**Fix:**\n- Check your Google Cloud billing and quota settings\n- Ensure the Translation API is enabled for your project\n- Verify that your service account has the correct permissions\n\n### 🧠 Vertex AI embedding timeout\n**Error:** `Request timed out` or `Embedding failed`\n**Fix:**\n- Increase `timeout_seconds` in `config.yml` under `embeddings`\n- Reduce `batch_size` to avoid overloading the API\n- Ensure your Vertex AI key and project ID are correctly set in `.Renviron`\n\n### 📁 File not found\n**Error:** `Error in read_csv: file does not exist`\n**Fix:**\n- Ensure the expected input file (`comments.tsv`) is placed in the `data/` folder\n- Check that filenames match exactly (case-sensitive)\n- Re-run the previous script to regenerate missing outputs\n\n### 📊 Empty or invalid output\n**Error:** Wordclouds or plots are blank\n**Fix:**\n- Check that your input data contains enough valid comments\n- Adjust `min_comment_length`, `min_docs`, or `min_freq` in `config.yml`\n- Review logs printed by each script for warnings or skipped entries\nIf you encounter other issues, try running each script individually and inspecting intermediate outputs in the `output/` folder.\n\n---\n\n## 🔒 Data Privacy Note\n\nThis pipeline is designed to respect the privacy of survey respondents and prevent accidental exposure of sensitive data.\n\n### Key safeguards:\n\n- The `data/` folder is excluded from version control via `.gitignore`  \n- A placeholder file (`raw_survey_data_placeholder`) is included to preserve folder structure  \n- Raw survey files (e.g., `comments.tsv`) should be stored locally and never committed to GitHub  \n- Translated and cleaned outputs are stored in `output/`, which is also ignored by default  \n- API keys and credentials are stored in `.Renviron`, which is excluded from GitHub  \n- A `.Renviron.example` file is provided to guide collaborators without exposing secrets\n\n⚠️ Always review your data before sharing outputs or publishing results.  \nIf working with personally identifiable information (PII), consider additional anonymization steps before analysis.\n\n---\n\n## ⚠️ Limitations\n\n- Requires internet access for API queries to Google Cloud Translation and Vertex AI  \n- Google Cloud Translation API may require billing setup for high-volume usage  \n- Embedding generation via Vertex AI may incur costs depending on your quota and usage  \n- Language detection may be less accurate for short, ambiguous, or mixed-language comments  \n- Clustering performance depends on the quality and diversity of responses  \n- Wordclouds may overrepresent common filler terms if stopwords are not fully filtered  \n- Scripts assume a consistent survey structure (e.g., Q4.2 to Q4.10); customization may be needed for other formats\n\n---\n\n## 🧑‍💻 For Developers\n\nThis pipeline is modular by design and easy to extend. Here are a few ways developers can build on it:\n\n### 🔄 Swap out embedding models\n- Replace Vertex AI with Hugging Face models (e.g., `sentence-transformers`)  \n- Adjust `02_contextual_embeddings.R` to use local or open-source alternatives\n\n### 💬 Add sentiment analysis\n- Integrate sentiment scoring after translation  \n- Use packages like `textdata`, `syuzhet`, or `sentimentr`  \n- Visualize sentiment trends alongside thematic clusters\n\n### 🧪 Customize clustering\n- Try alternative methods like DBSCAN, hierarchical clustering, or topic modeling (LDA)  \n- Tune `min_pts`, `umap_dims`, or `max_kmeans_k` in `config.yml`\n\n### 🌍 Support other survey formats\n- Modify `comment_cols` to match different question labels  \n- Add logic to handle matrix-style or nested responses\n\n### 📦 Package the pipeline\n- Convert scripts into an R package with exported functions  \n- Add unit tests and vignettes for reproducibility\n\nIf you build on this pipeline, feel free to fork the repo or open a discussion to share your improvements.\n\n---\n\n## 📜 License\n\nThis project is licensed under the [MIT License](LICENSE). For questions or collaboration inquiries, feel free to reach out via GitHub Issues.\n\n**Acknowledgment:**  \nDocumentation and refactoring support were assisted by Microsoft Copilot, an AI companion that helped streamline the pipeline and improve reproducibility.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsupsi-deass-cpps%2Fmultilingual_thematic_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsupsi-deass-cpps%2Fmultilingual_thematic_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsupsi-deass-cpps%2Fmultilingual_thematic_analysis/lists"}