{"id":35318882,"url":"https://github.com/watrall/charm-market-intelligence-engine","last_synced_at":"2026-04-07T07:43:47.622Z","repository":{"id":313398230,"uuid":"1051277604","full_name":"watrall/charm-market-intelligence-engine","owner":"watrall","description":"Cultural resource and heritage management automated market intelligence pipeline: scrapers, NLP entity/skills extraction, Folium maps, Streamlit dashboard, and LLM briefs.","archived":false,"fork":false,"pushed_at":"2026-02-08T18:41:20.000Z","size":744,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-07T07:43:45.062Z","etag":null,"topics":["archaeology","beautifulsoup","cultural-resource-management","folium","geospatial","google-sheets","heritage-management","job-market-analysis","labor-market-intelligence","n8n","ollama","openai","pandas","plotly","skills-taxonomy","spacy","sqlite","streamlit","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/watrall.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"docs/SECURITY_CHANGES.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-05T18:08:42.000Z","updated_at":"2026-02-08T18:41:24.000Z","dependencies_parsed_at":"2025-09-05T20:36:00.832Z","dependency_job_id":"2c398529-6c62-452f-ba70-10739437139e","html_url":"https://github.com/watrall/charm-market-intelligence-engine","commit_stats":null,"previous_names":["watrall/charm-market-intelligence-engine"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/watrall/charm-market-intelligence-engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/watrall%2Fcharm-market-intelligence-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/watrall%2Fcharm-market-intelligence-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/watrall%2Fcharm-market-intelligence-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/watrall%2Fcharm-market-intelligence-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/watrall","download_url":"https://codeload.github.com/watrall/charm-market-intelligence-engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/watrall%2Fcharm-market-intelligence-engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31504897,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T03:10:19.677Z","status":"ssl_error","status_checked_at":"2026-04-07T03:10:13.982Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archaeology","beautifulsoup","cultural-resource-management","folium","geospatial","google-sheets","heritage-management","job-market-analysis","labor-market-intelligence","n8n","ollama","openai","pandas","plotly","skills-taxonomy","spacy","sqlite","streamlit","web-scraping"],"created_at":"2025-12-30T20:04:12.221Z","updated_at":"2026-04-07T07:43:47.616Z","avatar_url":"https://github.com/watrall.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CHARM - Market Intelligence Engine\n\n[![Docker Pulls](https://img.shields.io/docker/pulls/watrall/charm-market-intelligence-engine?logo=docker\u0026label=Docker%20Pulls)](https://hub.docker.com/r/watrall/charm-market-intelligence-engine)\n[![Explore in DeepWiki](https://img.shields.io/badge/Explore%20in-DeepWiki-blue?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld0JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9IndoaXRlIiBzdHJva2Utd2lkdGg9IjIiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIgc3Ryb2tlLWxpbmVqb2luPSJyb3VuZCI+PHBhdGggZD0iTTEyIDJhMTAgMTAgMCAxIDAgMTAgMTBIMTIiLz48cGF0aCBkPSJNMTIgMmExMCAxMCAwIDEgMC0xMCAxMGgxMCIvPjxwYXRoIGQ9Ik0xMiAydjIwIi8+PC9zdmc+)](https://deepwiki.com/watrall/charm-market-intelligence-engine)\n\nThis is a clean, runnable reference implementation for automated market analysis in the cultural resource \u0026 heritage management space (designed originally to be hosted and run on a local Synology NAS).\n\nCHARM = Cultural Heritage \u0026 Archaeological Resource Management\n\nThe point of CHARM is to guide the investment of resources and the development of undergraduate, graduate, and non-degree/professioal programs and curricula, including courses, micro-degrees, and professional certificates. While it was built with cultural heritage and archaeology in mind, the pipeline is intentionally modular and can be adapted to other disciplines with minimal changes.\n\n**Outcomes:** scrape job postings (American Anthropological Association \u0026 American Cultural Resources Association) → clean/dedupe → parse uploaded PDFs (industry reports) → spaCy Natural Language Processing (entity + skill extraction) → sentiment → geocode → analysis → insights → SQLite/CSVs → optional Google Sheets → Streamlit dashboard (Folium + Plotly) → downloadable PDF report.\n\n## Demo Mode\n\n[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://charm-market-intelligence-engine.streamlit.app/)\n\n**Demo mode (no external services):** set `DEMO_MODE=1` when launching the dashboard to use the bundled synthetic snapshot in `demo/processed/`. Streamlit Cloud should set this env var so it never scrapes or calls paid APIs.\n\n## Quick Start\n\nIf you want the easiest local setup, start the Streamlit app and use the built in wizard. It can ingest PDFs, run the pipeline, and show results in one place.\n\n```bash\n# 1. Clone and enter the repository\ngit clone https://github.com/YOUR_USERNAME/charm-market-intelligence-engine.git\ncd charm-market-intelligence-engine\n\n# 2. Set up environment\nmake setup\n\n# 3. Create a local .env file\ncp .env.example .env\n\n# 4. Allow runs from the Streamlit wizard\n# This should stay false in shared or hosted environments\n# Open .env and set ALLOW_PIPELINE_RUN=true\n\n# 5. Launch the app\nmake run-dashboard\n```\n\nIf you are comfortable with the command line and prefer to run the pipeline directly, this is the simplest path:\n\n```bash\n# 1. Clone and enter the repository\ngit clone https://github.com/YOUR_USERNAME/charm-market-intelligence-engine.git\ncd charm-market-intelligence-engine\n\n# 2. Set up environment (creates venv, installs deps, downloads NLP models)\nmake setup\n\n# 3. Run the pipeline\nmake run-pipeline\n\n# 4. Launch the dashboard\nmake run-dashboard\n```\n\n**Alternative (manual setup):**\n```bash\npython -m venv .venv \u0026\u0026 source .venv/bin/activate\npip install -r requirements.txt\ncp .env.example .env\n\n# Download NLP models\npython -m spacy download en_core_web_sm\npython -c \"import nltk; nltk.download('vader_lexicon')\"\n\n# Run pipeline and dashboard\npython scripts/pipeline.py\nstreamlit run dashboard/app.py\n```\n\nIf you are comfortable with Docker, you can run CHARM with containers:\n\n```bash\ncp .env.example .env\ndocker compose up --build dashboard\n```\n\nTo run the pipeline in Docker:\n\n```bash\ndocker compose run --rm pipeline\n```\n\n\u003e **Cost-safe dry run:** `USE_LLM` and `USE_SHEETS` default to `false` in `.env.example` so you can run the full pipeline locally without triggering OpenAI tokens or Google Sheets API calls. Flip them to `true` only after you are ready to authenticate those paid services.\n\n---\n\n## Validate Your Run\n\nAfter running the pipeline, check that these files were created:\n\n| File | What it contains | Success indicator |\n|------|------------------|-------------------|\n| `data/processed/jobs.csv` | Scraped and enriched job postings | File exists, has rows with `title`, `company`, `skills` columns |\n| `data/processed/analysis.json` | Summary statistics | Contains `num_jobs`, `top_skills`, `schema_version` |\n| `data/processed/insights.md` | Human-readable market brief | Contains \"## In-demand Skills\" section |\n| `data/charm.db` | SQLite database | File exists (if `USE_SQLITE=true`) |\n| `data/reports/CHARM_Report_*.pdf` | PDF report | File exists, starts with `%PDF`, \u003e 10 KB |\n\n**Quick validation command:**\n```bash\n# Check that key outputs exist and have content\nls -la data/processed/\nhead -5 data/processed/jobs.csv\ncat data/processed/analysis.json | head -20\n```\n\n**Dashboard validation:**\n1. Run `make run-dashboard`\n2. Open http://localhost:8501 in your browser\n3. You should see key findings cards, a map, and skill charts\n4. A \"Download report\" button in the header generates a PDF on demand\n5. If \"No data yet\" appears, the pipeline hasn't run successfully\n\n---\n\n## Data Directories\n\nCHARM organizes data into specific directories. Understanding this structure helps with debugging and customization.\n\n```\ncharm-market-intelligence-engine/\n├── config/                    # Configuration files\n│   ├── .env.example          # Environment variable template\n│   ├── insight_prompt.md     # LLM prompt template (editable)\n│   └── job_patterns.json     # Job type/seniority regex patterns\n├── data/                      # All generated data (gitignored)\n│   ├── cache/                # Cached API responses\n│   │   ├── job_descriptions.json   # Cached job detail pages\n│   │   ├── reports_cache.json      # Cached PDF extractions\n│   │   └── gsheets_jobs_urls.txt   # Synced job URLs\n│   ├── processed/            # Pipeline outputs (dashboard reads these)\n│   │   ├── jobs.csv          # Enriched job postings\n│   │   ├── reports.csv       # Parsed PDF reports\n│   │   ├── analysis.json     # Summary statistics\n│   │   ├── insights.md       # Generated brief\n│   │   └── wordcloud.png     # Visualization\n│   ├── geocache.csv          # Location → lat/lon cache\n│   └── charm.db              # SQLite database\n├── reports/                   # Drop PDFs here for parsing\n├── reports/ (Python)          # PDF report generation package\n│   ├── context.py            # Build report context from pipeline artifacts\n│   ├── pdf_report.py         # Assemble PDF with ReportLab\n│   └── styles.py             # Page layout, fonts, table styles\n├── skills/                    # Taxonomy definitions\n│   └── skills_taxonomy.csv   # Skill aliases → normalized names\n└── secrets/                   # Credentials (gitignored)\n    └── service_account.json  # Google API credentials\n```\n\n**What gets cached (and why):**\n- **Job descriptions**: Avoids re-fetching the same posting detail pages\n- **PDF text**: Avoids re-parsing unchanged PDFs\n- **Geocoding**: Nominatim rate limits require caching location lookups\n\n**To clear caches and force fresh data:**\n```bash\nrm -rf data/cache/\nmake run-pipeline\n```\n\n---\n\n## Environment Variables\n\nCopy `.env.example` to `.env` and configure as needed. The file includes detailed comments explaining each variable.\n\n### Core Flags (safe defaults)\n\n- `GOOGLE_SERVICE_ACCOUNT_FILE=ENTER_PATH_TO_SERVICE_ACCOUNT_JSON_HERE`\n  - Path to your service account key file (like `secrets/service_account.json`).\n- `GOOGLE_SHEET_ID=ENTER_GOOGLE_SHEET_ID_HERE`\n  - The ID from your Sheet URL.\n- `OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE`\n  - Only needed if `USE_LLM=true` and `LLM_PROVIDER=openai`.\n- `GEOCODE_CONTACT_EMAIL=ENTER_CONTACT_EMAIL_HERE`\n  - Required by Nominatim usage guidelines so your geocoding requests have a contact.\n\n### Quick reference\n| Variable | Default | Purpose |\n| --- | --- | --- |\n| `USE_SQLITE` | `true` | Persist processed jobs/reports into `data/charm.db`. |\n| `USE_SHEETS` | `false` | Append jobs/reports to Google Sheets when credentials are configured. |\n| `USE_LLM` | `false` | Enable the optional LLM brief via `config/insight_prompt.md`. |\n| `ALLOW_PIPELINE_RUN` | `false` | Allow the Streamlit wizard to run the pipeline on this machine. |\n| `PIPELINE_SCRAPE` | `true` | Run the job board scraper step. |\n| `PIPELINE_REPORTS` | `true` | Parse PDFs in the `reports` folder. |\n| `PIPELINE_NLP` | `true` | Run spaCy and skills extraction. |\n| `PIPELINE_SENTIMENT` | `true` | Add sentiment scores. |\n| `PIPELINE_GEOCODE` | `true` | Geocode locations using Nominatim. |\n| `USER_AGENT` | `CHARM/1.0 (research)` | HTTP header for scrapers; include contact info. |\n| `GEOCODE_CONTACT_EMAIL` | _(empty)_ | Injected into the Nominatim UA per policy. |\n| `LLM_PROVIDER`, `LLM_MODEL` | `openai`, `gpt-4o-mini` | Choose an LLM backend/model when `USE_LLM=true`. |\n| `LLM_BASE_URL` | _(empty)_ | Base URL for OpenAI compatible providers when `LLM_PROVIDER=openai_compat`. |\n| `HF_TOKEN`, `HF_MODEL` | _(empty)_ | Use Hugging Face hosted inference when `LLM_PROVIDER=hf_inference`. |\n| `GOOGLE_SERVICE_ACCOUNT_FILE`, `GOOGLE_SHEET_ID` | _(empty)_ | Required for Sheets sync/tests. |\n| `SCRAPER_MAX_WORKERS` | `4` | Number of concurrent detail-page fetches; lower for stricter rate limits. |\n| `SCRAPER_REQUEST_INTERVAL` | `0.8` | Minimum seconds between outbound requests (global). Increase to slow the scraper. |\n\nAfter editing, verify:\n```bash\npython scripts/gsheets_test.py   # check Google Sheets access\npython scripts/pipeline.py       # run the end-to-end pipeline\n```\n\n## Architecture\n- **n8n orchestration**: n8n is an open-source workflow automation tool; we use it for Cron/Webhook triggers that run the pipeline via one Execute Command.\n- **Python pipeline**: scraping → cleaning/dedupe → report parsing → Natural Language Processing (NLP) → sentiment → geocoding → analysis → insights → persistence.\n- **Storage**: CSVs for the dashboard + **SQLite** for durable querying; **Google Sheets** for sharing raw rows.\n- **Dashboard**: Streamlit with **Plotly** charts and a **Folium** map (heatmap + clustered markers).\n- **Large Language Model (LLM, optional)**: brief insights when enabled in `.env`.\n\n## n8n Scheduling (Synology)\nImport `n8n/charm_workflow.json` and point Execute Command to:\n```bash\nbash -lc \"cd /data/charm-market-intelligence-engine \u0026\u0026 source .venv/bin/activate \u0026\u0026 python scripts/pipeline.py\"\n```\n\n## Adapting to Other Industries\n- Add parsers in `scripts/scrape_jobs.py` for new job boards.\n- Update `skills/skills_taxonomy.csv` with additional skills/aliases.\n- Expand rules in `scripts/insights.py` to map skills → program formats.\n- Customize `config/job_patterns.json` to tweak job-type/seniority detection; run `python scripts/validate_patterns.py` (or `make validate-patterns`) after edits to ensure regexes compile.\n\n## Google Sheets Integration (opt-in)\n1. Enable **Google Sheets API** and **Google Drive API** in Google Cloud Platform (GCP).\n2. Create a **Service Account**, download the JSON key to `secrets/service_account.json` (or your path).\n3. Set `GOOGLE_SHEET_ID` and `GOOGLE_SERVICE_ACCOUNT_FILE` in `.env` (replace the placeholders).\n4. Share the Sheet with the service account email as **Editor**.\n5. Test:\n```bash\npython scripts/gsheets_test.py\n```\n\n## Troubleshooting\n\n### Common issues and fixes\n\n| Symptom | Likely Cause | Fix |\n|---------|--------------|-----|\n| `ModuleNotFoundError: No module named 'spacy'` | Virtual environment not activated | Run `source venv/bin/activate` (macOS/Linux) or `venv\\Scripts\\activate` (Windows) |\n| `OSError: [E050] Can't find model 'en_core_web_sm'` | spaCy language model not downloaded | Run `python -m spacy download en_core_web_sm` |\n| No jobs scraped (empty CSV) | CSS selectors out of date OR site blocking requests | Check `scripts/scrape_jobs.py` selectors; add delays between requests |\n| `FileNotFoundError: config/.env` | Environment file missing | Copy `.env.example` to `config/.env` and fill in values |\n| Geocoding extremely slow | Nominatim rate limiting | Normal behavior; geocache (`data/geocache.db`) speeds up repeat runs |\n| `google.auth.exceptions.DefaultCredentialsError` | Service account JSON missing or path wrong | Verify `GOOGLE_SERVICE_ACCOUNT_FILE` path in `.env` |\n| Sheets append fails silently | Sheet ID incorrect or missing permissions | Double-check `GOOGLE_SHEET_ID`; share sheet with service account email |\n| Dashboard won't start | Port 8501 already in use | Kill other Streamlit processes or use `streamlit run dashboard/app.py --server.port 8502` |\n| `OPENAI_API_KEY` error when `USE_LLM=true` | Key not set or invalid | Add valid key to `.env` or set `USE_LLM=false` |\n\n### Permission issues (NAS / network drives)\nIf you're running on a NAS or shared drive:\n- Ensure write access to `data/` directory\n- SQLite may perform poorly over network mounts; consider running locally\n\n### Checking logs\nMost scripts print progress to stdout. For more detail:\n```bash\npython scripts/pipeline.py 2\u003e\u00261 | tee pipeline.log\n```\n\n## Insight prompt (external \u0026 editable)\nThe LLM question set lives in `config/insight_prompt.md`. It’s plain text with **{{variables}}** you can edit:\n- `{{INDUSTRY}}`, `{{DATE_TODAY}}`, `{{NUM_JOBS}}`, `{{UNIQUE_EMPLOYERS}}`, `{{GEOCODED}}`\n- `{{TOP_SKILLS_BULLETS}}` → a bullet list of top skills and counts\n\nTo see the fully rendered prompt (before sending to an LLM):\n```bash\npython scripts/preview_prompt.py\n```\n\nIf the file is missing, the workflow falls back to a concise built-in prompt.\n\n## Dashboard design choices\nDashboard design notes:\n- **Single column rhythm** with clear section spacing; primary actions (filters, downloads) are easy to find.\n- **Subtle cards** for key findings; no heavy boxes or loud colors.\n- **Plotly (plotly_white)** with reduced chart chrome; labels kept concise.\n- **Folium map** with heatmap + clustered markers for fast spatial scanning.\n- Hidden default Streamlit menu/footer to keep focus on data.\n- Sidebar filters drive all sections, so the page stays uncluttered.\n- **PDF export** via a header download button; the report is generated with ReportLab and cached by a content fingerprint so repeated clicks are instant.\n\n## PDF report export\n\nThe dashboard includes a \"Download report\" button that generates a multi-section PDF from the latest pipeline run. The report is built with ReportLab and uses the Inter font family (falls back to Helvetica if the TTFs are missing).\n\n**Sections included:**\n1. Cover page (title, date range, fingerprint)\n2. Executive Summary (5 data-driven bullets)\n3. Key Findings (up to 9 metrics in a 3-column grid)\n4. Trends \u0026 Signals (top 12 skills + emerging skills tables)\n5. Implications \u0026 Opportunities (actionable cards with \"Why it matters\")\n6. Methods \u0026 Governance (data sources, approach, limitations)\n7. Appendix (definitions and employer sources)\n\n**How it works:**\n- `reports/context.py` reads `jobs.csv`, `analysis.json`, and `insights.md` from the processed directory and builds a normalized context dict.\n- `reports/pdf_report.py` turns that context into ReportLab flowables and assembles the PDF.\n- `reports/styles.py` defines the page layout, paragraph styles, and table styles.\n- `dashboard/header.py` wires the download button into the Streamlit header; the PDF bytes are cached by a SHA-256 fingerprint of the source files, so the report is only regenerated when data changes.\n\n**Generating a report from the command line:**\n```bash\npython scripts/generate_report.py --proc-dir data/processed --out-dir data/reports\n```\n\nThis writes a `CHARM_Report_\u003cfingerprint\u003e.pdf` to the output directory and runs basic validity checks (PDF header present, file size \u003e 10 KB).\n\n\n## LLM options (opt-in)\nBy default `USE_LLM=false` in `config/.env.example`, so the rules-based brief runs without triggering model calls. Flip it to `true` only when you are ready to authenticate a provider. When enabled, the pipeline renders the external prompt (`config/insight_prompt.md`) with your current data.\n\nChoose a provider in `.env`:\n- `LLM_PROVIDER=openai` → set `OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE`\n- `LLM_PROVIDER=ollama` → set `OLLAMA_BASE_URL` (default `http://localhost:11434`) and `LLM_MODEL` (e.g., `llama3:instruct`)\n\nIf no key is present or the call fails, the pipeline still produces **rules-based insights**.\n\n\n### Google Sheets - reports worksheet\nThe pipeline can also append a **reports** tab with parsed report metadata:\n- Worksheet name (default): `reports` (configure via `GOOGLE_SHEET_WORKSHEET_REPORTS`)\n- Columns: report_name, word_count, skills (comma-separated)\n\n\n## Makefile quick commands\nUse the included `Makefile` to run common tasks with short commands:\n\n```bash\nmake setup        # venv + requirements + models\nmake run          # scrape → process → analyze → insights → SQLite/CSVs → Sheets\nmake dash         # launch the Streamlit dashboard\nmake sheets-test  # verify Google Sheets setup\nmake prompt       # preview rendered LLM prompt\nmake reset-db     # delete data/charm.db (keeps CSVs)\nmake clean        # clear caches\n```\n\nOn macOS/Linux it works out of the box. On Windows, use **Git Bash** or **WSL**.\n\n## How the workflow runs (step-by-step)\n1. **Scrape job boards** (AAA + ACRA) with pagination → `scripts/scrape_jobs.py`\n   - Collects: `source, title, company, location, date_posted, job_url, description`\n   - Walks “Next” pages safely (limit=10) and de-dupes by `job_url`.\n2. **Clean \u0026 de-duplicate** → `scripts/data_cleaning.py`\n   - Normalizes text, hashes `(title|company|desc-snippet)` to drop dupes.\n   - Extracts **salary** hints (`salary_min`, `salary_max`, `currency`) when present.\n3. **Parse industry reports (PDFs)** → `scripts/parse_reports.py`\n   - Reads PDFs from `/reports/` with PyMuPDF; outputs one row per report.\n4. **NLP enrichment (jobs + reports)** → `scripts/nlp_entities.py`\n   - spaCy NER (ORG/GPE/LOC) and **skills taxonomy** matching (`skills/skills_taxonomy.csv`).\n5. **Sentiment** (optional) → `scripts/sentiment_salience.py`\n6. **Geocode locations** with Nominatim + on-disk cache → `scripts/geocode.py`\n7. **Persist** results\n   - CSVs to `data/processed/` (for the dashboard)\n   - **SQLite** to `data/charm.db` (for durable querying and auditing)\n8. **Share** (optional): append **jobs** + **reports** to Google Sheets\n9. **Analyze** → `scripts/analyze.py` (top skills, counts; optional clustering)\n10. **Generate insights** → `scripts/insights.py`\n    - Rules-based recommendations (always)\n    - **LLM brief** using the external prompt (`config/insight_prompt.md`)\n11. **Visualize** → `dashboard/app.py` (Streamlit + Plotly + Folium)\n\n## Working with industry reports (PDFs)\n- Drop **.pdf** files into the `reports/` folder.\n- On the next run, the pipeline will extract text with PyMuPDF, enrich with NER + skills, and:\n  - write `data/processed/reports.csv`\n  - upsert into `data/charm.db` (`reports` table)\n  - append a concise row to Google Sheets (worksheet: `reports`) with `report_name`, `word_count`, and aggregated `skills`.\n- Parsed text is cached in `data/cache/reports_cache.json`, so unchanged PDFs aren't re-read on every execution.\n- Reports are combined with job data in analysis and in the LLM prompt context to surface **trends and gaps**.\n\n## Program mapping \u0026 outcomes\nThe insights module translates demand signals into **program formats**:\n- Undergraduate (online), Graduate (online)\n- Certificate, Post-baccalaureate\n- Workshop, Microlearning\n\nHow it works:\n- `scripts/insights.py` contains simple, transparent **rules** that map top skills to program formats.\n- If `USE_LLM=true`, the external prompt (`config/insight_prompt.md`) requests:\n  - **5 trend statements**, **3 emerging skills**, **3 program gaps/opportunities**, explicitly referencing those formats.\n- To tailor outputs for different catalogs or brands, adjust:\n  - `skills/skills_taxonomy.csv` (aliases \u0026 categories)\n  - mapping rules in `scripts/insights.py`\n  - the prompt language in `config/insight_prompt.md`\n\n## Scraping notes \u0026 governance\n- **Pagination:** the scrapers follow “Next” links (rel/aria/title/text) with a safe page limit.\n- **Politeness:** configurable rate limiting + polite User-Agent (see `SCRAPER_MAX_WORKERS` / `SCRAPER_REQUEST_INTERVAL` in `.env`). For example, the defaults (~4 workers, 0.8 s interval) average ~5 detail fetches/sec; lower these values if the target site throttles faster.\n- **Job-description caching:** fetched detail pages are stored in `data/cache/job_descriptions.json` so reruns avoid hammering the same postings. Adjust worker count/interval in `.env` to tune throughput.\n- **Dedupe:** by `job_url` and content hash to avoid churn and inflated counts.\n- **Respect sites:** check each site's **robots.txt** and Terms of Service before scraping; scale cautiously and cache aggressively.\n- **No PII:** the pipeline collects job-level, non-personal data only; avoid ingesting personally identifiable information (PII).\n\n\n## Mattermost notifications\nThe pipeline can post a completion message to a Mattermost channel using an **incoming webhook**.\n\n**Setup:**\n1. Create an **Incoming Webhook** in your Mattermost workspace (bound to a channel).\n2. In `.env`, set:\n\n   - `MATTERMOST_WEBHOOK_URL=ENTER_MATTERMOST_WEBHOOK_URL_HERE`\n\n   - `DASHBOARD_URL=ENTER_DASHBOARD_URL_HERE`  (public or internal URL for your Streamlit app)\n\n3. Run `make run` (or `python scripts/pipeline.py`).\n\n**What it sends:**\n- A check-marked completion line\n- A short summary (total postings, employers, geocoded count, top 5 skills)\n- A link to the dashboard\n- An optional short snippet from `insights.md` (“LLM Brief” if present)\n\n**Where to customize:** open `n8n/charm_workflow_mattermost.json`, find the `Notification Config` node, and edit the embedded JavaScript to change the summary payload.\n\n\n\n### Mattermost in n8n (post-run notification)\nImport `n8n/charm_workflow_mattermost.json` for a version of the workflow that **notifies Mattermost after each run**.\n\n**Configure in the “Notification Config” node:**\n- `webhookUrl` → your Mattermost **Incoming Webhook URL**\n- `dashboardUrl` → link to your Streamlit app (public or internal)\n- `mention` → optional (`@channel`, `@here`, or empty)\n- `thresholdSkillsCsv` → comma-separated skills to track (e.g., `ArcGIS,Section 106,NEPA`)\n- `thresholdPercent` → percentage change to trigger an alert (default `20`)\n\n**What the message includes:**\n- ✅ Completion line\n- Totals (postings, employers, geocoded)\n- Top 5 skills\n- Dashboard link\n- **Alerts** section if thresholds are hit (↑/↓ with % change vs previous run)\n- A short **Brief** snippet extracted from `insights.md`\n\n**How thresholds work:**\n- The workflow reads the previous snapshot from `data/processed/analysis_prev.json` (if present)\n- It writes the current `analysis.json` to that file after posting the message, so the next run compares properly\n- Zero results trigger a “No jobs scraped” alert automatically\n\n\n## Large Language Model (LLM) options (self-hosted vs. cloud)\nThis pipeline supports two classes of LLM backends:\n\n**Cloud (commercial): OpenAI**  \n- Set `LLM_PROVIDER=openai` and `OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE`  \n- Pros: highest quality, simple setup, scalable.  \n- Cons: usage cost; data governance requires key management.\n\n**Self-hosted:**\n1) **Ollama** (simple local runner on CPU/GPU; great for demos)\n   - Set `LLM_PROVIDER=ollama`, `OLLAMA_BASE_URL=http://localhost:11434`, `LLM_MODEL=llama3:instruct` (or similar)\n   - Pros: easiest local setup; good developer ergonomics.\n   - Cons: slower on CPU-only; fewer enterprise durability knobs.\n\n2) **OpenAI-compatible server** (e.g., vLLM on GPU)\n   - Set `LLM_PROVIDER=openai_compat`, `LLM_BASE_URL=http://YOUR-HOST:PORT/v1`, `LLM_MODEL=YourModelName`\n   - Pros: production-friendly throughput and token cost control; keeps data in your infra.\n   - Cons: requires GPU provisioning \u0026 ops (e.g., vLLM/TGI deployment).\n\n**Recommendation:** For a Synology/NAS demo, Ollama is the fastest path to a working self-host. For higher throughput or larger prompts, deploy **vLLM** with an OpenAI-compatible endpoint and switch to `LLM_PROVIDER=openai_compat`.\n\n### Why offer both (OpenAI + Ollama)\nSupporting both cloud and local LLMs is practical:\n\n- **Cost control**: Cloud models are metered. Ollama lets you run unlimited local inferences at no extra cost.\n- **Data governance**: Some orgs want text to stay on-prem. A local model keeps everything in your infrastructure.\n- **Portability**: Anyone can clone this repo and get working insights with Ollama -- no API key required.\n- **Failover**: If your cloud quota runs out or there's an outage, the local model keeps the pipeline functional.\n\n## Components \u0026 responsibilities (what each piece does)\n\nThis repository is organized so a reviewer can read it top-down and understand exactly how the system works. Every piece below has a clear, single responsibility.\n\n### Orchestration (n8n)\n- `n8n/charm_workflow.json` - Minimal scheduler/trigger that runs the Python pipeline via **Execute Command** from a Cron or Webhook.\n- `n8n/charm_workflow_mattermost.json` - Same as above, with post-run **Mattermost notifications**. Reads `analysis.json` and `insights.md`, composes a short message (totals, top skills, optional alerts, brief), posts to your incoming webhook, and snapshots the current analysis for next-run comparisons.\n\n### Configuration\n- `config/.env.example` - Environment variables with explicit placeholders (e.g., `ENTER_GOOGLE_SHEET_ID_HERE`). Copy to `.env` and fill in. Includes LLM provider switches, user agent, and dashboard URL.\n- `config/insight_prompt.md` - The human-editable prompt template used when `USE_LLM=true`. It’s rendered with live variables (date, counts, top skills) before calling the model.\n\n### Data definitions / taxonomy\n- `skills/skills_taxonomy.csv` - Deterministic mapping of common terms/aliases to normalized skill names and (optionally) categories. This keeps “GIS” vs “ArcGIS” vs “ArcGIS Pro” consistent in analysis.\n\n### Pipeline (Python)\n- `scripts/pipeline.py` - The orchestrator. Runs end-to-end: scrape → clean/dedupe → parse reports → NLP/skills → sentiment → geocode → analyze → insights → persist (CSV/SQLite) → optional Google Sheets append.\n- `scripts/scrape_jobs.py` - Scrapers for ACRA + AAA with **pagination** and per-item description fetching. Uses a configurable `USER_AGENT` and polite defaults.\n- `scripts/data_cleaning.py` - Normalization and duplicate detection (content hashing across title/company/description snippet). Also extracts salary hints when present.\n- `scripts/parse_reports.py` - Reads **PDFs** from `/reports/` with PyMuPDF; emits one record per report with the raw text for downstream NLP.\n- `scripts/nlp_entities.py` - spaCy NER for organizations and locations + taxonomy-based **skill extraction**. Produces a comma-separated `skills` column.\n- `scripts/sentiment_salience.py` - Optional lightweight sentiment using VADER (useful for qualitative clustering or future labeling).\n- `scripts/geocode.py` - Geocodes the `location` field using Nominatim with on-disk caching; attaches `lat` and `lon` for mapping.\n- `scripts/analyze.py` - **Pandas-based** summaries (top skills, counts, employers, geocoded totals). Can be extended to clustering or time-series.\n- `scripts/insights.py` - Generates a short, human-readable brief. Always emits rule-based recommendations; when `USE_LLM=true`, renders `config/insight_prompt.md` and calls the selected provider (OpenAI or Ollama).\n- `scripts/gsheets_sync.py` - Appends **jobs** and **reports** to Google Sheets. Handles worksheet creation and de-dupe by URL or name.\n- `scripts/gsheets_test.py` - One-liner connectivity check for Sheets credentials and permissions.\n- `scripts/preview_prompt.py` - Renders the final LLM prompt (with current data) so you can review or paste it elsewhere.\n- `scripts/pandas_examples.py` - Extra recipes for ad-hoc analysis; helpful for quick CSV exports during exploration.\n\n### Storage / outputs\n- `data/charm.db` - SQLite database created on first run (durable auditing and ad-hoc queries).\n- `data/processed/` - CSV and artifacts used by the dashboard: `jobs.csv`, `reports.csv`, `analysis.json`, `insights.md`, and `wordcloud.png`.\n- `docs/sql_examples.sql` - A few ready-to-use SQL queries against `charm.db` (e.g., salary by skill, recent Section 106/NEPA postings).\n- `docs/data_contract.md` - Field-level documentation for each exported file so downstream teams know how to consume them.\n\n### Dashboard (Streamlit + Folium + Plotly)\n- `dashboard/app.py` - Single-page, minimalist UI:\n  - Key findings cards (postings, employers, geocoded)\n  - Top skills bar chart (Plotly)\n  - Job map with heatmap + clustered markers (Folium)\n  - Insights panel and word cloud\n  - Sidebar filters and simple download actions (filtered CSV, analysis JSON)\n- `dashboard/header.py` - Header strip with the PDF download button; caches generated bytes by content fingerprint.\n- `.streamlit/config.toml` - Neutral, brand-agnostic theme.\n\n### Report generation (ReportLab)\n- `reports/context.py` - Builds and normalizes a report context dict from pipeline artifacts (`jobs.csv`, `analysis.json`, `insights.md`). Computes a SHA-256 fingerprint for cache invalidation.\n- `reports/pdf_report.py` - Assembles the multi-section PDF from the context dict using ReportLab flowables.\n- `reports/styles.py` - Page layout, paragraph styles, table styles, and font registration (Inter with Helvetica fallback).\n- `scripts/generate_report.py` - CLI script to generate and validate a PDF report without the dashboard.\n\n### Tooling\n- `Makefile` - Short commands for setup, running the pipeline, launching the dashboard, testing Sheets, previewing the prompt, and cleanup.\n- `requirements.txt` - Python dependencies (scraping, NLP, analysis, dashboard, LLM providers).\n- `LICENSE` - MIT license.\n- `CHANGELOG.md` - A concise record of what’s included in this release and why certain decisions were made.\n\n---\n\n## How the pieces interact (at a glance)\n\n1. **n8n** triggers the run (Cron or Webhook) → executes `scripts/pipeline.py` in the repo directory on your NAS.\n2. The **pipeline** scrapes jobs (with pagination), parses any PDFs in `/reports/`, enriches with NLP/skills, geocodes, analyzes, and writes outputs to both CSV and SQLite.\n3. **Google Sheets** (optional) is updated with appended rows for jobs and reports (so stakeholders can view raw structured data).\n4. The **dashboard** reads `data/processed/*` and refreshes automatically when files change.\n5. The **n8n Mattermost** workflow (optional) reads the latest outputs, composes a short message (totals, top skills, alerts), posts to your channel, and snapshots the analysis for the next run.\n\nEverything is idempotent: duplicates are filtered, pagination is capped, geocoding is cached, and runs can be scheduled safely.\n\n## Cost \u0026 usage planning\n- **LLM calls (optional):** Each pipeline run with `gpt-4o-mini` costs well under $1 (usually a few cents). The default prompt and response fit comfortably within 1200 tokens. Set `USE_LLM=false` for completely free runs. If you're running this on a schedule, estimate your monthly call volume and budget accordingly.\n- **Google Sheets sync (optional):** Setting `USE_SHEETS=true` turns on both Sheets and Drive APIs. They are metered after the free tier, and every run makes a few dozen append/read calls. Leave it `false` until you create a GCP project, confirm quotas, and budget for increased throughput (e.g., batch jobs nightly instead of per-scrape).\n- **Geocoding:** The built-in Nominatim client is free but rate-limited to 1 request/sec; heavy usage may require hosting your own instance. Because geocoding is cached in `data/geocache.csv`, reruns stay cost-free unless you clear the cache.\n- **Storage/dashboards:** Streamlit + SQLite incur no extra spend -- everything runs locally. When deploying to cloud infrastructure, include VM/storage costs in your overall estimate.\n- **Sheets cache resets:** The Google Sheets sync stores cached job/report IDs under `data/cache/`. If someone edits or deletes rows directly in the Sheet, clear those files before the next run so the pipeline can rebuild its local view of existing rows.\n\nDocument these toggles in your runbook so reviewers understand how to perform a zero-cost demo vs. a production run with LLM + Sheets enabled.\n- With the defaults (4 workers, 0.8 s interval) expect roughly 5 detail-page fetches per second. Increase the interval or lower workers if a target board publishes stricter rate limits.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwatrall%2Fcharm-market-intelligence-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwatrall%2Fcharm-market-intelligence-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwatrall%2Fcharm-market-intelligence-engine/lists"}