{"id":50677241,"url":"https://github.com/ash-datapro/sa-stack","last_synced_at":"2026-06-08T16:05:56.641Z","repository":{"id":331295300,"uuid":"1126068646","full_name":"ash-datapro/sa-stack","owner":"ash-datapro","description":"A production-style NLP pipeline from data ingestion to model serving.","archived":false,"fork":false,"pushed_at":"2026-01-01T04:01:59.000Z","size":108305,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-05T20:33:27.702Z","etag":null,"topics":["docker","evaluation-metrics","mlops","model-serving","plumber","postgresql","reproducible-research","sentiment-analysis","shiny","sql","text-classification","tidymodels"],"latest_commit_sha":null,"homepage":"https://github.com/ash-datapro/sa-stack/blob/main/media/demo.gif","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ash-datapro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-01T02:43:08.000Z","updated_at":"2026-01-01T03:59:14.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ash-datapro/sa-stack","commit_stats":null,"previous_names":["ash-datapro/sa-stack"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ash-datapro/sa-stack","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ash-datapro%2Fsa-stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ash-datapro%2Fsa-stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ash-datapro%2Fsa-stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ash-datapro%2Fsa-stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ash-datapro","download_url":"https://codeload.github.com/ash-datapro/sa-stack/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ash-datapro%2Fsa-stack/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34069529,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","evaluation-metrics","mlops","model-serving","plumber","postgresql","reproducible-research","sentiment-analysis","shiny","sql","text-classification","tidymodels"],"created_at":"2026-06-08T16:05:56.543Z","updated_at":"2026-06-08T16:05:56.633Z","avatar_url":"https://github.com/ash-datapro.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sentiment Analysis (R + Shiny + Plumber + tidymodels)\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"media/demo.gif\" width=\"100%\" alt=\"Sentiment Explorer demo\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.r-project.org/\"\u003e\u003cimg alt=\"R\" src=\"https://img.shields.io/badge/R-276DC3?style=for-the-badge\u0026logo=r\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.tidymodels.org/\"\u003e\u003cimg alt=\"tidymodels\" src=\"https://img.shields.io/badge/tidymodels-0F766E?style=for-the-badge\u0026logo=rstudio\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.rplumber.io/\"\u003e\u003cimg alt=\"Plumber\" src=\"https://img.shields.io/badge/Plumber-111827?style=for-the-badge\u0026logo=r\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://shiny.posit.co/\"\u003e\u003cimg alt=\"Shiny\" src=\"https://img.shields.io/badge/Shiny-0B2A4C?style=for-the-badge\u0026logo=r\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.docker.com/\"\u003e\u003cimg alt=\"Docker\" src=\"https://img.shields.io/badge/Docker-2496ED?style=for-the-badge\u0026logo=docker\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.postgresql.org/\"\u003e\u003cimg alt=\"PostgreSQL\" src=\"https://img.shields.io/badge/PostgreSQL-316192?style=for-the-badge\u0026logo=postgresql\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://en.wikipedia.org/wiki/SQL\"\u003e\u003cimg alt=\"SQL\" src=\"https://img.shields.io/badge/SQL-025E8C?style=for-the-badge\u0026logo=databricks\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n**Sentiment Analysis Stack** is a production-style, end-to-end sentiment analysis project that pairs:\n\n- **A clean, user-friendly Shiny dashboard** for exploring the Stanford Sentiment Treebank (SST) and scoring text.\n- **A Plumber backend** that serves a serialized **tidymodels** bundle (`model.rds`) for consistent, repeatable scoring.\n- **Training artifacts** (reports + plots) saved alongside the model so evaluation is not an afterthought.\n\n---\n\n## Dataset snapshot (SST / Treebank)\n\nA quick, analyst-style summary of the dataset used in this project:\n\n1. **Two complementary views of sentiment**\n   - *Sentences* (human-readable utterances) and *phrases* (fine-grained fragments) let you analyze sentiment at different granularities.\n\n2. **Broad coverage of text length and structure**\n   - The corpus includes short fragments through longer sentences, which is useful for stress-testing robustness (very short text is often hardest to score reliably).\n\n3. **Labels are inherently “soft”**\n   - Sentiment is represented on a continuous scale in the source data and often binned into classes (binary or 5-class). Borderline examples are expected—uncertainty is a feature of the dataset, not a bug.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"media/data_snapshot.png\" width=\"85%\" alt=\"SST dataset snapshot\"\u003e\n\u003c/p\u003e\n\n\u003e **Plot:** Sentiment score distribution (with optional binning overlays).  \n\n---\n\n## Why this project matters\n\nMost “demo ML apps” fail in the same places: inconsistent preprocessing between training and serving, unclear label semantics, and no reproducible evaluation trail.\n\nThis project aims to be the opposite:\n\n- **One pipeline, everywhere**: the same recipe/feature steps used in training are embedded in the saved bundle and reused at scoring time.\n- **Clear contract**: the model bundle defines expected inputs and output schema (labels, scores), so downstream code stays stable.\n- **Evaluation-first**: model training produces metrics + plots that travel with the model and can be reviewed alongside the UI.\n- **Separation of concerns**: data exploration and scoring UX live in Shiny; model execution and versioning live in the backend.\n\n---\n\n## Architecture\n\n```text\n          ┌──────────────────────────────┐\n          │        Shiny Frontend        │\n          │  Overview • Exploration      │\n          │  + text scoring UI           │\n          └──────────────┬───────────────┘\n                         │\n                         │ (HTTP JSON)\n                         ▼\n          ┌──────────────────────────────┐\n          │        Plumber Backend        │\n          │  /health • /meta • scoring    │\n          │  loads backend/api/model.rds  │\n          └──────────────┬───────────────┘\n                         │\n                         ▼\n          ┌──────────────────────────────┐\n          │   Training + Artifacts        │\n          │   reports/ (metrics + plots)  │\n          │   model.rds (tidymodels)      │\n          └──────────────────────────────┘\n\n      Data: data/sst_treebank.rds (+ optional db scripts in postgresql/)\n````\n\n---\n\n## Features\n\n### Dashboard (Shiny)\n\n* **Overview**: label distribution, score distribution, length distributions, KPI cards, filtering by unit/split/label/score range.\n* **Exploration**: interactive scatter (score vs length), split breakdown, preview table.\n* **Scoring**: load a model bundle and score input text with a friendly UI (no “API jargon” needed).\n* **Downloads**: export filtered data to CSV for analysis.\n\n### Backend (Plumber + tidymodels)\n\n* Loads a **saved model bundle** (`model.rds`) that contains:\n\n  * the fitted model/workflow\n  * preprocessing recipe / tokenization steps\n  * output schema (label levels) and optional threshold metadata\n* Serves metadata and scoring in a stable, testable interface.\n\n### Reports\n\n* Training saves **metrics + plots** under `backend/reports/` so you can inspect performance without rerunning notebooks.\n\n---\n\n## Repository layout\n\n```text\nsentiment/\n  backend/\n    api/\n      Dockerfile\n      model.rds\n      ... (plumber entry + training code)\n    reports/\n      ... (metrics + plots)\n    run-docker/\n\n  data/\n    eda/\n    input-data/\n    sst_treebank.rds\n\n  frontend/\n    app.R\n    Dockerfile\n    requirements\n    test-frontend.R\n    R/\n      about.R\n      ... (other UI modules)\n\n  media/\n    demo.gif\n\n  postgresql/\n    00-create-db-steps\n    01-load-sst-db.R\n    02-create-schema.sql\n    03-grant-schema.sql\n\n  docker-compose.yml\n```\n\n---\n\n## Getting started (local)\n\n### Prerequisites\n\n* **R** ≥ 4.3\n* Recommended: **RStudio**\n* Optional: **Docker + docker-compose**\n\n### 1) Run the backend\n\nFrom `backend/api/`, start the Plumber service (exact script name may vary in your repo):\n\n```r\n# backend/api\n# source(\"main.R\") or equivalent plumber entry\n# pr$run(host = \"0.0.0.0\", port = 8000)\n```\n\n### 2) Run the frontend\n\nFrom `frontend/`:\n\n```r\nsetwd(\"frontend\")\nshiny::runApp(\".\", host = \"0.0.0.0\", port = 8501)\n```\n\nThen open:\n\n* Frontend: `http://127.0.0.1:8501`\n\n---\n\n## Run with Docker Compose\n\nFrom repo root:\n\n```sh\ndocker compose up --build\n```\n\nExpected services:\n\n* `backend` on `:8000`\n* `frontend` on `:8501`\n\n\u003e If you’re running locally without Docker, keep backend base URL as `http://127.0.0.1:8000`.\n\n---\n\n## Machine learning behind the scenes\n\nThis repo is intentionally built around a few production-grade ideas:\n\n1. **A saved bundle, not just a model**\n\n   * The deployable artifact is `model.rds`, which includes preprocessing + model + schema metadata.\n   * That means you don’t “recreate features” in the UI or API; you reuse the same pipeline.\n\n2. **Schema-driven outputs**\n\n   * The bundle advertises label levels (binary or multi-class), so the UI and backend can render consistently.\n   * Thresholding is treated as a first-class concept (when applicable), not a hidden constant.\n\n3. **Artifacts travel with the model**\n\n   * Reports/plots in `backend/reports/` are generated at training time and kept for review.\n   * This supports practical workflows like “promote a model version only when artifacts look good.”\n\n4. **UI stays user-friendly**\n\n   * The frontend avoids exposing internal endpoint names or ML plumbing.\n   * Users see “score text”, distributions, and explainable outputs (label + confidence).\n\n---\n\n## Common issues\n\n* **Nothing loads / empty charts**\n\n  * Confirm `data/sst_treebank.rds` exists (or update the default path in `frontend/app.R`).\n* **Model won’t load**\n\n  * Ensure `backend/api/model.rds` exists and matches the expected bundle structure (has a scoring function / workflow + schema).\n* **Docker networking**\n\n  * Inside compose, services must refer to each other by service name (e.g., `http://backend:8000`), not `127.0.0.1`.\n\n---\n\n## Roadmap\n\n* Add lightweight model cards (data source, evaluation summary, known limitations).\n* Add batch scoring UX (CSV upload + download scored results) once stability is locked in.\n* Optional: add database-backed exploration using the `postgresql/` scripts.\n\n---\n\n### Built with\n\n* R, Shiny, bslib (Bootstrap 5)\n* tidymodels (workflows, recipes, parsnip)\n* Plumber\n* plotly, DT, dplyr, stringr\n* Docker / docker-compose\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fash-datapro%2Fsa-stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fash-datapro%2Fsa-stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fash-datapro%2Fsa-stack/lists"}