{"id":50933471,"url":"https://github.com/belajarcarabelajar/rasalytics","last_synced_at":"2026-06-17T06:32:55.028Z","repository":{"id":363586436,"uuid":"1263976930","full_name":"belajarcarabelajar/rasalytics","owner":"belajarcarabelajar","description":"A powerful YouTube comments scraper and hybrid sentiment analyzer specifically tuned for English and Indonesian languages.","archived":false,"fork":false,"pushed_at":"2026-06-12T01:34:07.000Z","size":204,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-06-12T03:12:16.492Z","etag":null,"topics":["bun","machine-learning","ollama","sentiment-analysis","typescript","youtube-comments"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/belajarcarabelajar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":"audit-reports/audit-findings.json","citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-09T12:51:07.000Z","updated_at":"2026-06-12T01:34:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/belajarcarabelajar/rasalytics","commit_stats":null,"previous_names":["belajarcarabelajar/youtube-comments-scraper","belajarcarabelajar/rasalytics"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/belajarcarabelajar/rasalytics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/belajarcarabelajar%2Frasalytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/belajarcarabelajar%2Frasalytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/belajarcarabelajar%2Frasalytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/belajarcarabelajar%2Frasalytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/belajarcarabelajar","download_url":"https://codeload.github.com/belajarcarabelajar/rasalytics/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/belajarcarabelajar%2Frasalytics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34437449,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-17T02:00:05.408Z","response_time":127,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bun","machine-learning","ollama","sentiment-analysis","typescript","youtube-comments"],"created_at":"2026-06-17T06:32:54.177Z","updated_at":"2026-06-17T06:32:55.023Z","avatar_url":"https://github.com/belajarcarabelajar.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Rasalytics\nA powerful YouTube comments scraper and hybrid sentiment analyzer specifically tuned for English and Indonesian languages.\n\n![Version](https://img.shields.io/badge/version-1.0.0-blue)\n![Bun Version](https://img.shields.io/badge/Bun-v1.3.14-black?logo=bun)\n![TypeScript](https://img.shields.io/badge/TypeScript-5.0-blue?logo=typescript)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\u003e **Demo / Screenshot**\n\u003e ![Analysis Summary Demo](./demo_summary.jpg)\n\n## Description\nAnalyzing YouTube comments manually can be overwhelming, especially for videos with thousands of interactions. This tool automates the extraction and analysis of YouTube comments, providing deep, actionable insights into audience sentiment. It combines a lexicon-based approach, localized HuggingFace transformer models (SST-2 for English, BERT for Indonesian), and local Ollama Qwen2.5 for accuracy verification. Built-in spam and toxicity filters ensure the resulting data is clean and highly relevant.\n\n## 🌍 Philosophy, Mission, \u0026 Societal Impact\n\n*“Technology without philosophy is just a tool; but technology driven by a profound mission is a catalyst for societal change.”*\n\nWhile sentiment analysis is heavily utilized in the corporate world for brand monitoring and market research, **this repository is built upon a radically different philosophy: democratizing data for political transparency and social accountability.** \n\nIn the modern digital era—where algorithms curate echo chambers and public opinion is easily manipulated—open-source analytical tools must step up to serve the broader society. Our mission focuses on the following pillars:\n\n1. **Defending Digital Democracy \u0026 Transparency**\n   Political discourse on platforms like YouTube is often obscured by algorithmic bias, making it difficult to gauge true public sentiment. This tool empowers citizens, independent journalists, and researchers to bypass \"filter bubbles\" and transparently audit how political campaigns, policies, or figures are actually being received by the public.\n   \n2. **Combatting Astroturfing \u0026 Organized Manipulation (Buzzers)**\n   Political propaganda frequently relies on engineered toxicity and inorganic spam (e.g., coordinated *buzzer* attacks or bot farms) to drown out genuine debate. By integrating rigorous spam and toxicity detection, this tool aims to separate organic citizen feedback from paid manipulation, providing a clearer picture of authentic public discourse.\n\n3. **Mitigating Societal Polarization**\n   Echo chambers thrive on extreme sentiments. By openly mapping and quantifying the spectrum of opinions (Positive, Negative, Mixed, Neutral), we aim to provide objective data that cools down hyper-polarized debates. When society can see the *data-driven reality* of a discussion, it prevents the loudest, most toxic voices from dictating the political narrative.\n\nUltimately, this project is not just a technological achievement in machine learning; it is a **grassroots, open-source movement**. We aim to equip society with the same powerful analytical capabilities once reserved for massive tech conglomerates and political elites, ensuring that the digital public square remains accountable, transparent, and democratic.\n\n## Features\n- **Data Scraping**: Fetches top-level comments and replies using the official YouTube Data API v3.\n- **Hybrid Sentiment Analysis**: Uses HuggingFace Transformers (SST-2 for English, BERT-multilingual for Indonesian), Lexicon-based fallbacks, and Ollama Qwen2.5 for accuracy verification.\n- **Spam \u0026 Toxicity Detection**: Built-in detection for spam URLs/keywords and toxic vocabulary.\n- **Rich Markdown Reports**: Generates a detailed report with actionable insights, summary metrics, and full data export to markdown.\n\n## Tech Stack\n- **Runtime**: [Bun](https://bun.sh) \u0026 TypeScript\n- **Machine Learning**: `@xenova/transformers`, `sentiment` (Lexicon), Local Ollama (qwen2.5:1.5b)\n- **Language Detection \u0026 Preprocessing**: `franc-min`, `emoji-emotion`\n\n## Prerequisites\n- **Bun**: v1.0 or higher.\n- **Ollama**: Running locally with the `qwen2.5:1.5b` model (`ollama run qwen2.5:1.5b`).\n- **YouTube Data API Key**: A valid API key from Google Cloud Console.\n\n## Installation\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/belajarcarabelajar/rasalytics.git\n   cd rasalytics\n   ```\n2. Install dependencies:\n   ```bash\n   bun install\n   ```\n\n## Configuration\nCreate a `.env` file in the root directory and add your YouTube API Key. You can use the `.env.example` file as a template:\n```env\nYOUTUBE_API_KEY=YOUR_YOUTUBE_API_KEY_HERE\n```\n\u003e **Warning**: Do not commit the `.env` file or any real API keys. It is already added to `.gitignore`. Users should create and configure their own YouTube API key via the Google Cloud Console.\n\n## AI Agents / Skills Setup\nIf you are using AI coding assistants (like Cursor, Cline, Claude Code, or Antigravity), this repository contains a unified Single Source of Truth (SSOT) for agent skills. To automatically set up the rules and symlinks for your AI providers, run:\n\n```bash\nbun run setup:skills\n```\nThis will automatically configure `.cursorrules`, `.clinerules`, and the necessary `.agents` and `.claude` directories.\n\n## Cloudflare Website Deployment\nThe project includes a deployable website version using **Cloudflare Pages** (frontend) and **Cloudflare Workers** (backend API). \n**Note:** The Cloudflare API uses an *edge-safe* shared sentiment module that relies exclusively on deterministic lexicon-based and statistical checks (no heavy generative AI or Ollama dependencies) to ensure fast cold starts and security.\n\n### Environment Setup\nUpdate your `.env` (or set via `wrangler` and Cloudflare Dashboard):\n- **Local:** `YOUTUBE_API_KEY` for CLI.\n- **Worker (Backend):** No API keys required for the edge-safe sentiment API. \n- **Pages (Frontend):** Set public variables like `VITE_API_URL` if building a complex frontend framework. The current vanilla HTML setup connects to the API automatically.\n\n### Manual Deployment Steps\n1. Login to Cloudflare:\n   ```bash\n   bunx wrangler login\n   ```\n2. Run the deployment script:\n   ```bash\n   bun run deploy:website\n   # or manually: bash scripts/deploy-website.sh\n   ```\n   \nThis script will:\n- Validate `bun` and `wrangler` installation.\n- Deploy the Worker API (`rasalytics-api`) to Cloudflare Workers.\n- Deploy the static frontend to Cloudflare Pages (`rasalytics-web`).\n\n### Security Notes \u0026 Limitations\n- **ML-Only Limitation:** The Cloudflare Worker API does NOT use `@xenova/transformers` or `Ollama` generative AI due to edge limits and cold starts. It uses a lightweight, deterministic lexicon and rule-based approach.\n- Do NOT commit real secrets to the repository. Use `wrangler secret put \u003cNAME\u003e` for backend secrets.\n- `local_models/` and other offline artifacts are safely excluded from the website deployment.\n\n## Usage\nRun the script by providing a YouTube Video ID. You can also specify the maximum number of comment pages to fetch (default is 5).\n\n```bash\nbun run src/index.ts --videoId=5bKxkW_z408 --maxPages=2\n```\n\n### Example Output (Terminal)\n```text\nStarting comment collection for Video ID: 5bKxkW_z408...\n\n=== SENTIMENT RECAP ===\nMacro F1 requirement: Check test suite (rtk bun test)\nTotal Comments: 125\nPositive: 80\nNegative: 15\nNeutral: 20\nMixed: 0\nSpam: 8\nToxic: 2\n=======================\nFull markdown report saved to: /mnt/c/Users/Tedi Rahmat/Downloads/comments_5bKxkW_z408.md\n```\n\n## Project Structure\n```text\nrasalytics/\n├── src/\n│   ├── index.ts                 # Main scraper and analyzer CLI script\n│   ├── index.test.ts            # Test suite for sentiment and scraping logic\n│   ├── eval.test.ts             # Evaluation tests for sentiment accuracy\n│   ├── lexicons.ts              # Indonesian slang, toxic, and positive/negative lexicons\n│   ├── worker.ts                # Cloudflare Worker backend API\n│   └── shared-sentiment.ts      # Edge-safe sentiment logic for the backend\n├── scripts/\n│   ├── deploy-website.sh        # Deployment script for Cloudflare Worker and Pages\n│   └── setup-skills.ts          # Setup script for AI agents skills\n├── public/                      # Static frontend assets for Cloudflare Pages\n├── docs/                        # API, architecture, and Claude documentation\n├── audit-reports/               # Production-readiness audit reports and findings\n├── package.json                 # Dependencies and scripts\n├── tsconfig.json                # TypeScript configuration\n├── bun.lock                     # Bun lockfile\n├── .env                         # Environment variables (API Key)\n├── local_models/                # Cached transformer models\n├── analyze_offline.ts           # Offline comment analysis tool\n├── evaluate_baseline.ts         # Sentiment baseline evaluator\n└── fix_benchmark.ts             # Benchmark data fixer\n```\n\n## Contributing\nContributions are welcome! Please open an issue or submit a Pull Request if you'd like to improve the sentiment accuracy, add support for more languages, or optimize the scraping process.\n\n## API Reference / Internal Methods\nWhile primarily a CLI tool, the core logic is structured to be modular. Key components inside `src/index.ts` such as sentiment analysis pipelines and markdown report generators can potentially be exported. \n\n### `preprocess(text: string)`\nCleans and normalizes the input text by stripping URLs, converting emojis to text labels, and handling repeating characters.\n- **Returns**: `{ normalized: string, urls: string[] }`\n\n### `analyzeComment(text: string)`\nPerforms hybrid sentiment analysis (Transformers, Lexicon, and Ollama verification) as well as spam and toxicity checks.\n- **Returns**: `Promise\u003c{ score: number, confidence: number, label: string, isSpam: boolean, isToxic: boolean, reasoning: string }\u003e`\n\n### `fetchWithRetry(url: string, retries?: number, backoff?: number)`\nA robust internal network fetch handler that automatically retries API requests on network errors or 500-level HTTP responses with exponential backoff.\n- **Returns**: `Promise\u003cany\u003e`\n\n### `processComment(id: string, snippet: any)`\nProcesses a raw YouTube comment snippet, invokes the preprocessing and analyzer pipelines, and formats the result into a clean `CommentData` object.\n- **Returns**: `Promise\u003cCommentData\u003e`\n\n## Limitations \u0026 Compliance\n- **YouTube Terms of Service**: Users must comply with the [YouTube API Terms of Service](https://developers.google.com/youtube/terms/api-services-terms-of-service) when using this tool.\n- **Quota Limits**: The YouTube Data API v3 has strict quota limits (default 10,000 units per day). Fetching comments consumes quota (e.g., 1 unit per page of comments). Be mindful of your usage to avoid exhaustion.\n- **Privacy Risks**: Storing and redistributing scraped YouTube user comments presents privacy and copyright risks. Do not distribute or publish raw user data sets without verifying compliance obligations and redistribution rights under the YouTube ToS.\n\n## Acknowledgements\n- [Bun](https://bun.sh) for the incredibly fast TS runtime.\n- [Ollama](https://ollama.com/) \u0026 [Qwen2.5](https://qwenlm.github.io/) for advanced NLP sentiment verification.\n- [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) via `@xenova/transformers` for local ML inference.\n- YouTube Data API v3 for the data infrastructure.\n\n## License\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbelajarcarabelajar%2Frasalytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbelajarcarabelajar%2Frasalytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbelajarcarabelajar%2Frasalytics/lists"}