An open API service indexing awesome lists of open source software.

https://github.com/watrall/charm-market-intelligence-engine

Cultural resource and heritage management automated market intelligence pipeline: scrapers, NLP entity/skills extraction, Folium maps, Streamlit dashboard, and LLM briefs.
https://github.com/watrall/charm-market-intelligence-engine

archaeology beautifulsoup cultural-resource-management folium geospatial google-sheets heritage-management job-market-analysis labor-market-intelligence n8n ollama openai pandas plotly skills-taxonomy spacy sqlite streamlit web-scraping

Last synced: 3 months ago
JSON representation

Cultural resource and heritage management automated market intelligence pipeline: scrapers, NLP entity/skills extraction, Folium maps, Streamlit dashboard, and LLM briefs.

Awesome Lists containing this project

README

          

# CHARM - Market Intelligence Engine

[![Docker Pulls](https://img.shields.io/docker/pulls/watrall/charm-market-intelligence-engine?logo=docker&label=Docker%20Pulls)](https://hub.docker.com/r/watrall/charm-market-intelligence-engine)
[![Explore in DeepWiki](https://img.shields.io/badge/Explore%20in-DeepWiki-blue?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSIyNCIgaGVpZ2h0PSIyNCIgdmlld0JveD0iMCAwIDI0IDI0IiBmaWxsPSJub25lIiBzdHJva2U9IndoaXRlIiBzdHJva2Utd2lkdGg9IjIiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIgc3Ryb2tlLWxpbmVqb2luPSJyb3VuZCI+PHBhdGggZD0iTTEyIDJhMTAgMTAgMCAxIDAgMTAgMTBIMTIiLz48cGF0aCBkPSJNMTIgMmExMCAxMCAwIDEgMC0xMCAxMGgxMCIvPjxwYXRoIGQ9Ik0xMiAydjIwIi8+PC9zdmc+)](https://deepwiki.com/watrall/charm-market-intelligence-engine)

This is a clean, runnable reference implementation for automated market analysis in the cultural resource & heritage management space (designed originally to be hosted and run on a local Synology NAS).

CHARM = Cultural Heritage & Archaeological Resource Management

The point of CHARM is to guide the investment of resources and the development of undergraduate, graduate, and non-degree/professioal programs and curricula, including courses, micro-degrees, and professional certificates. While it was built with cultural heritage and archaeology in mind, the pipeline is intentionally modular and can be adapted to other disciplines with minimal changes.

**Outcomes:** scrape job postings (American Anthropological Association & American Cultural Resources Association) → clean/dedupe → parse uploaded PDFs (industry reports) → spaCy Natural Language Processing (entity + skill extraction) → sentiment → geocode → analysis → insights → SQLite/CSVs → optional Google Sheets → Streamlit dashboard (Folium + Plotly) → downloadable PDF report.

## Demo Mode

[![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://charm-market-intelligence-engine.streamlit.app/)

**Demo mode (no external services):** set `DEMO_MODE=1` when launching the dashboard to use the bundled synthetic snapshot in `demo/processed/`. Streamlit Cloud should set this env var so it never scrapes or calls paid APIs.

## Quick Start

If you want the easiest local setup, start the Streamlit app and use the built in wizard. It can ingest PDFs, run the pipeline, and show results in one place.

```bash
# 1. Clone and enter the repository
git clone https://github.com/YOUR_USERNAME/charm-market-intelligence-engine.git
cd charm-market-intelligence-engine

# 2. Set up environment
make setup

# 3. Create a local .env file
cp .env.example .env

# 4. Allow runs from the Streamlit wizard
# This should stay false in shared or hosted environments
# Open .env and set ALLOW_PIPELINE_RUN=true

# 5. Launch the app
make run-dashboard
```

If you are comfortable with the command line and prefer to run the pipeline directly, this is the simplest path:

```bash
# 1. Clone and enter the repository
git clone https://github.com/YOUR_USERNAME/charm-market-intelligence-engine.git
cd charm-market-intelligence-engine

# 2. Set up environment (creates venv, installs deps, downloads NLP models)
make setup

# 3. Run the pipeline
make run-pipeline

# 4. Launch the dashboard
make run-dashboard
```

**Alternative (manual setup):**
```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

# Download NLP models
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('vader_lexicon')"

# Run pipeline and dashboard
python scripts/pipeline.py
streamlit run dashboard/app.py
```

If you are comfortable with Docker, you can run CHARM with containers:

```bash
cp .env.example .env
docker compose up --build dashboard
```

To run the pipeline in Docker:

```bash
docker compose run --rm pipeline
```

> **Cost-safe dry run:** `USE_LLM` and `USE_SHEETS` default to `false` in `.env.example` so you can run the full pipeline locally without triggering OpenAI tokens or Google Sheets API calls. Flip them to `true` only after you are ready to authenticate those paid services.

---

## Validate Your Run

After running the pipeline, check that these files were created:

| File | What it contains | Success indicator |
|------|------------------|-------------------|
| `data/processed/jobs.csv` | Scraped and enriched job postings | File exists, has rows with `title`, `company`, `skills` columns |
| `data/processed/analysis.json` | Summary statistics | Contains `num_jobs`, `top_skills`, `schema_version` |
| `data/processed/insights.md` | Human-readable market brief | Contains "## In-demand Skills" section |
| `data/charm.db` | SQLite database | File exists (if `USE_SQLITE=true`) |
| `data/reports/CHARM_Report_*.pdf` | PDF report | File exists, starts with `%PDF`, > 10 KB |

**Quick validation command:**
```bash
# Check that key outputs exist and have content
ls -la data/processed/
head -5 data/processed/jobs.csv
cat data/processed/analysis.json | head -20
```

**Dashboard validation:**
1. Run `make run-dashboard`
2. Open http://localhost:8501 in your browser
3. You should see key findings cards, a map, and skill charts
4. A "Download report" button in the header generates a PDF on demand
5. If "No data yet" appears, the pipeline hasn't run successfully

---

## Data Directories

CHARM organizes data into specific directories. Understanding this structure helps with debugging and customization.

```
charm-market-intelligence-engine/
├── config/ # Configuration files
│ ├── .env.example # Environment variable template
│ ├── insight_prompt.md # LLM prompt template (editable)
│ └── job_patterns.json # Job type/seniority regex patterns
├── data/ # All generated data (gitignored)
│ ├── cache/ # Cached API responses
│ │ ├── job_descriptions.json # Cached job detail pages
│ │ ├── reports_cache.json # Cached PDF extractions
│ │ └── gsheets_jobs_urls.txt # Synced job URLs
│ ├── processed/ # Pipeline outputs (dashboard reads these)
│ │ ├── jobs.csv # Enriched job postings
│ │ ├── reports.csv # Parsed PDF reports
│ │ ├── analysis.json # Summary statistics
│ │ ├── insights.md # Generated brief
│ │ └── wordcloud.png # Visualization
│ ├── geocache.csv # Location → lat/lon cache
│ └── charm.db # SQLite database
├── reports/ # Drop PDFs here for parsing
├── reports/ (Python) # PDF report generation package
│ ├── context.py # Build report context from pipeline artifacts
│ ├── pdf_report.py # Assemble PDF with ReportLab
│ └── styles.py # Page layout, fonts, table styles
├── skills/ # Taxonomy definitions
│ └── skills_taxonomy.csv # Skill aliases → normalized names
└── secrets/ # Credentials (gitignored)
└── service_account.json # Google API credentials
```

**What gets cached (and why):**
- **Job descriptions**: Avoids re-fetching the same posting detail pages
- **PDF text**: Avoids re-parsing unchanged PDFs
- **Geocoding**: Nominatim rate limits require caching location lookups

**To clear caches and force fresh data:**
```bash
rm -rf data/cache/
make run-pipeline
```

---

## Environment Variables

Copy `.env.example` to `.env` and configure as needed. The file includes detailed comments explaining each variable.

### Core Flags (safe defaults)

- `GOOGLE_SERVICE_ACCOUNT_FILE=ENTER_PATH_TO_SERVICE_ACCOUNT_JSON_HERE`
- Path to your service account key file (like `secrets/service_account.json`).
- `GOOGLE_SHEET_ID=ENTER_GOOGLE_SHEET_ID_HERE`
- The ID from your Sheet URL.
- `OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE`
- Only needed if `USE_LLM=true` and `LLM_PROVIDER=openai`.
- `GEOCODE_CONTACT_EMAIL=ENTER_CONTACT_EMAIL_HERE`
- Required by Nominatim usage guidelines so your geocoding requests have a contact.

### Quick reference
| Variable | Default | Purpose |
| --- | --- | --- |
| `USE_SQLITE` | `true` | Persist processed jobs/reports into `data/charm.db`. |
| `USE_SHEETS` | `false` | Append jobs/reports to Google Sheets when credentials are configured. |
| `USE_LLM` | `false` | Enable the optional LLM brief via `config/insight_prompt.md`. |
| `ALLOW_PIPELINE_RUN` | `false` | Allow the Streamlit wizard to run the pipeline on this machine. |
| `PIPELINE_SCRAPE` | `true` | Run the job board scraper step. |
| `PIPELINE_REPORTS` | `true` | Parse PDFs in the `reports` folder. |
| `PIPELINE_NLP` | `true` | Run spaCy and skills extraction. |
| `PIPELINE_SENTIMENT` | `true` | Add sentiment scores. |
| `PIPELINE_GEOCODE` | `true` | Geocode locations using Nominatim. |
| `USER_AGENT` | `CHARM/1.0 (research)` | HTTP header for scrapers; include contact info. |
| `GEOCODE_CONTACT_EMAIL` | _(empty)_ | Injected into the Nominatim UA per policy. |
| `LLM_PROVIDER`, `LLM_MODEL` | `openai`, `gpt-4o-mini` | Choose an LLM backend/model when `USE_LLM=true`. |
| `LLM_BASE_URL` | _(empty)_ | Base URL for OpenAI compatible providers when `LLM_PROVIDER=openai_compat`. |
| `HF_TOKEN`, `HF_MODEL` | _(empty)_ | Use Hugging Face hosted inference when `LLM_PROVIDER=hf_inference`. |
| `GOOGLE_SERVICE_ACCOUNT_FILE`, `GOOGLE_SHEET_ID` | _(empty)_ | Required for Sheets sync/tests. |
| `SCRAPER_MAX_WORKERS` | `4` | Number of concurrent detail-page fetches; lower for stricter rate limits. |
| `SCRAPER_REQUEST_INTERVAL` | `0.8` | Minimum seconds between outbound requests (global). Increase to slow the scraper. |

After editing, verify:
```bash
python scripts/gsheets_test.py # check Google Sheets access
python scripts/pipeline.py # run the end-to-end pipeline
```

## Architecture
- **n8n orchestration**: n8n is an open-source workflow automation tool; we use it for Cron/Webhook triggers that run the pipeline via one Execute Command.
- **Python pipeline**: scraping → cleaning/dedupe → report parsing → Natural Language Processing (NLP) → sentiment → geocoding → analysis → insights → persistence.
- **Storage**: CSVs for the dashboard + **SQLite** for durable querying; **Google Sheets** for sharing raw rows.
- **Dashboard**: Streamlit with **Plotly** charts and a **Folium** map (heatmap + clustered markers).
- **Large Language Model (LLM, optional)**: brief insights when enabled in `.env`.

## n8n Scheduling (Synology)
Import `n8n/charm_workflow.json` and point Execute Command to:
```bash
bash -lc "cd /data/charm-market-intelligence-engine && source .venv/bin/activate && python scripts/pipeline.py"
```

## Adapting to Other Industries
- Add parsers in `scripts/scrape_jobs.py` for new job boards.
- Update `skills/skills_taxonomy.csv` with additional skills/aliases.
- Expand rules in `scripts/insights.py` to map skills → program formats.
- Customize `config/job_patterns.json` to tweak job-type/seniority detection; run `python scripts/validate_patterns.py` (or `make validate-patterns`) after edits to ensure regexes compile.

## Google Sheets Integration (opt-in)
1. Enable **Google Sheets API** and **Google Drive API** in Google Cloud Platform (GCP).
2. Create a **Service Account**, download the JSON key to `secrets/service_account.json` (or your path).
3. Set `GOOGLE_SHEET_ID` and `GOOGLE_SERVICE_ACCOUNT_FILE` in `.env` (replace the placeholders).
4. Share the Sheet with the service account email as **Editor**.
5. Test:
```bash
python scripts/gsheets_test.py
```

## Troubleshooting

### Common issues and fixes

| Symptom | Likely Cause | Fix |
|---------|--------------|-----|
| `ModuleNotFoundError: No module named 'spacy'` | Virtual environment not activated | Run `source venv/bin/activate` (macOS/Linux) or `venv\Scripts\activate` (Windows) |
| `OSError: [E050] Can't find model 'en_core_web_sm'` | spaCy language model not downloaded | Run `python -m spacy download en_core_web_sm` |
| No jobs scraped (empty CSV) | CSS selectors out of date OR site blocking requests | Check `scripts/scrape_jobs.py` selectors; add delays between requests |
| `FileNotFoundError: config/.env` | Environment file missing | Copy `.env.example` to `config/.env` and fill in values |
| Geocoding extremely slow | Nominatim rate limiting | Normal behavior; geocache (`data/geocache.db`) speeds up repeat runs |
| `google.auth.exceptions.DefaultCredentialsError` | Service account JSON missing or path wrong | Verify `GOOGLE_SERVICE_ACCOUNT_FILE` path in `.env` |
| Sheets append fails silently | Sheet ID incorrect or missing permissions | Double-check `GOOGLE_SHEET_ID`; share sheet with service account email |
| Dashboard won't start | Port 8501 already in use | Kill other Streamlit processes or use `streamlit run dashboard/app.py --server.port 8502` |
| `OPENAI_API_KEY` error when `USE_LLM=true` | Key not set or invalid | Add valid key to `.env` or set `USE_LLM=false` |

### Permission issues (NAS / network drives)
If you're running on a NAS or shared drive:
- Ensure write access to `data/` directory
- SQLite may perform poorly over network mounts; consider running locally

### Checking logs
Most scripts print progress to stdout. For more detail:
```bash
python scripts/pipeline.py 2>&1 | tee pipeline.log
```

## Insight prompt (external & editable)
The LLM question set lives in `config/insight_prompt.md`. It’s plain text with **{{variables}}** you can edit:
- `{{INDUSTRY}}`, `{{DATE_TODAY}}`, `{{NUM_JOBS}}`, `{{UNIQUE_EMPLOYERS}}`, `{{GEOCODED}}`
- `{{TOP_SKILLS_BULLETS}}` → a bullet list of top skills and counts

To see the fully rendered prompt (before sending to an LLM):
```bash
python scripts/preview_prompt.py
```

If the file is missing, the workflow falls back to a concise built-in prompt.

## Dashboard design choices
Dashboard design notes:
- **Single column rhythm** with clear section spacing; primary actions (filters, downloads) are easy to find.
- **Subtle cards** for key findings; no heavy boxes or loud colors.
- **Plotly (plotly_white)** with reduced chart chrome; labels kept concise.
- **Folium map** with heatmap + clustered markers for fast spatial scanning.
- Hidden default Streamlit menu/footer to keep focus on data.
- Sidebar filters drive all sections, so the page stays uncluttered.
- **PDF export** via a header download button; the report is generated with ReportLab and cached by a content fingerprint so repeated clicks are instant.

## PDF report export

The dashboard includes a "Download report" button that generates a multi-section PDF from the latest pipeline run. The report is built with ReportLab and uses the Inter font family (falls back to Helvetica if the TTFs are missing).

**Sections included:**
1. Cover page (title, date range, fingerprint)
2. Executive Summary (5 data-driven bullets)
3. Key Findings (up to 9 metrics in a 3-column grid)
4. Trends & Signals (top 12 skills + emerging skills tables)
5. Implications & Opportunities (actionable cards with "Why it matters")
6. Methods & Governance (data sources, approach, limitations)
7. Appendix (definitions and employer sources)

**How it works:**
- `reports/context.py` reads `jobs.csv`, `analysis.json`, and `insights.md` from the processed directory and builds a normalized context dict.
- `reports/pdf_report.py` turns that context into ReportLab flowables and assembles the PDF.
- `reports/styles.py` defines the page layout, paragraph styles, and table styles.
- `dashboard/header.py` wires the download button into the Streamlit header; the PDF bytes are cached by a SHA-256 fingerprint of the source files, so the report is only regenerated when data changes.

**Generating a report from the command line:**
```bash
python scripts/generate_report.py --proc-dir data/processed --out-dir data/reports
```

This writes a `CHARM_Report_.pdf` to the output directory and runs basic validity checks (PDF header present, file size > 10 KB).

## LLM options (opt-in)
By default `USE_LLM=false` in `config/.env.example`, so the rules-based brief runs without triggering model calls. Flip it to `true` only when you are ready to authenticate a provider. When enabled, the pipeline renders the external prompt (`config/insight_prompt.md`) with your current data.

Choose a provider in `.env`:
- `LLM_PROVIDER=openai` → set `OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE`
- `LLM_PROVIDER=ollama` → set `OLLAMA_BASE_URL` (default `http://localhost:11434`) and `LLM_MODEL` (e.g., `llama3:instruct`)

If no key is present or the call fails, the pipeline still produces **rules-based insights**.

### Google Sheets - reports worksheet
The pipeline can also append a **reports** tab with parsed report metadata:
- Worksheet name (default): `reports` (configure via `GOOGLE_SHEET_WORKSHEET_REPORTS`)
- Columns: report_name, word_count, skills (comma-separated)

## Makefile quick commands
Use the included `Makefile` to run common tasks with short commands:

```bash
make setup # venv + requirements + models
make run # scrape → process → analyze → insights → SQLite/CSVs → Sheets
make dash # launch the Streamlit dashboard
make sheets-test # verify Google Sheets setup
make prompt # preview rendered LLM prompt
make reset-db # delete data/charm.db (keeps CSVs)
make clean # clear caches
```

On macOS/Linux it works out of the box. On Windows, use **Git Bash** or **WSL**.

## How the workflow runs (step-by-step)
1. **Scrape job boards** (AAA + ACRA) with pagination → `scripts/scrape_jobs.py`
- Collects: `source, title, company, location, date_posted, job_url, description`
- Walks “Next” pages safely (limit=10) and de-dupes by `job_url`.
2. **Clean & de-duplicate** → `scripts/data_cleaning.py`
- Normalizes text, hashes `(title|company|desc-snippet)` to drop dupes.
- Extracts **salary** hints (`salary_min`, `salary_max`, `currency`) when present.
3. **Parse industry reports (PDFs)** → `scripts/parse_reports.py`
- Reads PDFs from `/reports/` with PyMuPDF; outputs one row per report.
4. **NLP enrichment (jobs + reports)** → `scripts/nlp_entities.py`
- spaCy NER (ORG/GPE/LOC) and **skills taxonomy** matching (`skills/skills_taxonomy.csv`).
5. **Sentiment** (optional) → `scripts/sentiment_salience.py`
6. **Geocode locations** with Nominatim + on-disk cache → `scripts/geocode.py`
7. **Persist** results
- CSVs to `data/processed/` (for the dashboard)
- **SQLite** to `data/charm.db` (for durable querying and auditing)
8. **Share** (optional): append **jobs** + **reports** to Google Sheets
9. **Analyze** → `scripts/analyze.py` (top skills, counts; optional clustering)
10. **Generate insights** → `scripts/insights.py`
- Rules-based recommendations (always)
- **LLM brief** using the external prompt (`config/insight_prompt.md`)
11. **Visualize** → `dashboard/app.py` (Streamlit + Plotly + Folium)

## Working with industry reports (PDFs)
- Drop **.pdf** files into the `reports/` folder.
- On the next run, the pipeline will extract text with PyMuPDF, enrich with NER + skills, and:
- write `data/processed/reports.csv`
- upsert into `data/charm.db` (`reports` table)
- append a concise row to Google Sheets (worksheet: `reports`) with `report_name`, `word_count`, and aggregated `skills`.
- Parsed text is cached in `data/cache/reports_cache.json`, so unchanged PDFs aren't re-read on every execution.
- Reports are combined with job data in analysis and in the LLM prompt context to surface **trends and gaps**.

## Program mapping & outcomes
The insights module translates demand signals into **program formats**:
- Undergraduate (online), Graduate (online)
- Certificate, Post-baccalaureate
- Workshop, Microlearning

How it works:
- `scripts/insights.py` contains simple, transparent **rules** that map top skills to program formats.
- If `USE_LLM=true`, the external prompt (`config/insight_prompt.md`) requests:
- **5 trend statements**, **3 emerging skills**, **3 program gaps/opportunities**, explicitly referencing those formats.
- To tailor outputs for different catalogs or brands, adjust:
- `skills/skills_taxonomy.csv` (aliases & categories)
- mapping rules in `scripts/insights.py`
- the prompt language in `config/insight_prompt.md`

## Scraping notes & governance
- **Pagination:** the scrapers follow “Next” links (rel/aria/title/text) with a safe page limit.
- **Politeness:** configurable rate limiting + polite User-Agent (see `SCRAPER_MAX_WORKERS` / `SCRAPER_REQUEST_INTERVAL` in `.env`). For example, the defaults (~4 workers, 0.8 s interval) average ~5 detail fetches/sec; lower these values if the target site throttles faster.
- **Job-description caching:** fetched detail pages are stored in `data/cache/job_descriptions.json` so reruns avoid hammering the same postings. Adjust worker count/interval in `.env` to tune throughput.
- **Dedupe:** by `job_url` and content hash to avoid churn and inflated counts.
- **Respect sites:** check each site's **robots.txt** and Terms of Service before scraping; scale cautiously and cache aggressively.
- **No PII:** the pipeline collects job-level, non-personal data only; avoid ingesting personally identifiable information (PII).

## Mattermost notifications
The pipeline can post a completion message to a Mattermost channel using an **incoming webhook**.

**Setup:**
1. Create an **Incoming Webhook** in your Mattermost workspace (bound to a channel).
2. In `.env`, set:

- `MATTERMOST_WEBHOOK_URL=ENTER_MATTERMOST_WEBHOOK_URL_HERE`

- `DASHBOARD_URL=ENTER_DASHBOARD_URL_HERE` (public or internal URL for your Streamlit app)

3. Run `make run` (or `python scripts/pipeline.py`).

**What it sends:**
- A check-marked completion line
- A short summary (total postings, employers, geocoded count, top 5 skills)
- A link to the dashboard
- An optional short snippet from `insights.md` (“LLM Brief” if present)

**Where to customize:** open `n8n/charm_workflow_mattermost.json`, find the `Notification Config` node, and edit the embedded JavaScript to change the summary payload.

### Mattermost in n8n (post-run notification)
Import `n8n/charm_workflow_mattermost.json` for a version of the workflow that **notifies Mattermost after each run**.

**Configure in the “Notification Config” node:**
- `webhookUrl` → your Mattermost **Incoming Webhook URL**
- `dashboardUrl` → link to your Streamlit app (public or internal)
- `mention` → optional (`@channel`, `@here`, or empty)
- `thresholdSkillsCsv` → comma-separated skills to track (e.g., `ArcGIS,Section 106,NEPA`)
- `thresholdPercent` → percentage change to trigger an alert (default `20`)

**What the message includes:**
- ✅ Completion line
- Totals (postings, employers, geocoded)
- Top 5 skills
- Dashboard link
- **Alerts** section if thresholds are hit (↑/↓ with % change vs previous run)
- A short **Brief** snippet extracted from `insights.md`

**How thresholds work:**
- The workflow reads the previous snapshot from `data/processed/analysis_prev.json` (if present)
- It writes the current `analysis.json` to that file after posting the message, so the next run compares properly
- Zero results trigger a “No jobs scraped” alert automatically

## Large Language Model (LLM) options (self-hosted vs. cloud)
This pipeline supports two classes of LLM backends:

**Cloud (commercial): OpenAI**
- Set `LLM_PROVIDER=openai` and `OPENAI_API_KEY=ENTER_OPENAI_API_KEY_HERE`
- Pros: highest quality, simple setup, scalable.
- Cons: usage cost; data governance requires key management.

**Self-hosted:**
1) **Ollama** (simple local runner on CPU/GPU; great for demos)
- Set `LLM_PROVIDER=ollama`, `OLLAMA_BASE_URL=http://localhost:11434`, `LLM_MODEL=llama3:instruct` (or similar)
- Pros: easiest local setup; good developer ergonomics.
- Cons: slower on CPU-only; fewer enterprise durability knobs.

2) **OpenAI-compatible server** (e.g., vLLM on GPU)
- Set `LLM_PROVIDER=openai_compat`, `LLM_BASE_URL=http://YOUR-HOST:PORT/v1`, `LLM_MODEL=YourModelName`
- Pros: production-friendly throughput and token cost control; keeps data in your infra.
- Cons: requires GPU provisioning & ops (e.g., vLLM/TGI deployment).

**Recommendation:** For a Synology/NAS demo, Ollama is the fastest path to a working self-host. For higher throughput or larger prompts, deploy **vLLM** with an OpenAI-compatible endpoint and switch to `LLM_PROVIDER=openai_compat`.

### Why offer both (OpenAI + Ollama)
Supporting both cloud and local LLMs is practical:

- **Cost control**: Cloud models are metered. Ollama lets you run unlimited local inferences at no extra cost.
- **Data governance**: Some orgs want text to stay on-prem. A local model keeps everything in your infrastructure.
- **Portability**: Anyone can clone this repo and get working insights with Ollama -- no API key required.
- **Failover**: If your cloud quota runs out or there's an outage, the local model keeps the pipeline functional.

## Components & responsibilities (what each piece does)

This repository is organized so a reviewer can read it top-down and understand exactly how the system works. Every piece below has a clear, single responsibility.

### Orchestration (n8n)
- `n8n/charm_workflow.json` - Minimal scheduler/trigger that runs the Python pipeline via **Execute Command** from a Cron or Webhook.
- `n8n/charm_workflow_mattermost.json` - Same as above, with post-run **Mattermost notifications**. Reads `analysis.json` and `insights.md`, composes a short message (totals, top skills, optional alerts, brief), posts to your incoming webhook, and snapshots the current analysis for next-run comparisons.

### Configuration
- `config/.env.example` - Environment variables with explicit placeholders (e.g., `ENTER_GOOGLE_SHEET_ID_HERE`). Copy to `.env` and fill in. Includes LLM provider switches, user agent, and dashboard URL.
- `config/insight_prompt.md` - The human-editable prompt template used when `USE_LLM=true`. It’s rendered with live variables (date, counts, top skills) before calling the model.

### Data definitions / taxonomy
- `skills/skills_taxonomy.csv` - Deterministic mapping of common terms/aliases to normalized skill names and (optionally) categories. This keeps “GIS” vs “ArcGIS” vs “ArcGIS Pro” consistent in analysis.

### Pipeline (Python)
- `scripts/pipeline.py` - The orchestrator. Runs end-to-end: scrape → clean/dedupe → parse reports → NLP/skills → sentiment → geocode → analyze → insights → persist (CSV/SQLite) → optional Google Sheets append.
- `scripts/scrape_jobs.py` - Scrapers for ACRA + AAA with **pagination** and per-item description fetching. Uses a configurable `USER_AGENT` and polite defaults.
- `scripts/data_cleaning.py` - Normalization and duplicate detection (content hashing across title/company/description snippet). Also extracts salary hints when present.
- `scripts/parse_reports.py` - Reads **PDFs** from `/reports/` with PyMuPDF; emits one record per report with the raw text for downstream NLP.
- `scripts/nlp_entities.py` - spaCy NER for organizations and locations + taxonomy-based **skill extraction**. Produces a comma-separated `skills` column.
- `scripts/sentiment_salience.py` - Optional lightweight sentiment using VADER (useful for qualitative clustering or future labeling).
- `scripts/geocode.py` - Geocodes the `location` field using Nominatim with on-disk caching; attaches `lat` and `lon` for mapping.
- `scripts/analyze.py` - **Pandas-based** summaries (top skills, counts, employers, geocoded totals). Can be extended to clustering or time-series.
- `scripts/insights.py` - Generates a short, human-readable brief. Always emits rule-based recommendations; when `USE_LLM=true`, renders `config/insight_prompt.md` and calls the selected provider (OpenAI or Ollama).
- `scripts/gsheets_sync.py` - Appends **jobs** and **reports** to Google Sheets. Handles worksheet creation and de-dupe by URL or name.
- `scripts/gsheets_test.py` - One-liner connectivity check for Sheets credentials and permissions.
- `scripts/preview_prompt.py` - Renders the final LLM prompt (with current data) so you can review or paste it elsewhere.
- `scripts/pandas_examples.py` - Extra recipes for ad-hoc analysis; helpful for quick CSV exports during exploration.

### Storage / outputs
- `data/charm.db` - SQLite database created on first run (durable auditing and ad-hoc queries).
- `data/processed/` - CSV and artifacts used by the dashboard: `jobs.csv`, `reports.csv`, `analysis.json`, `insights.md`, and `wordcloud.png`.
- `docs/sql_examples.sql` - A few ready-to-use SQL queries against `charm.db` (e.g., salary by skill, recent Section 106/NEPA postings).
- `docs/data_contract.md` - Field-level documentation for each exported file so downstream teams know how to consume them.

### Dashboard (Streamlit + Folium + Plotly)
- `dashboard/app.py` - Single-page, minimalist UI:
- Key findings cards (postings, employers, geocoded)
- Top skills bar chart (Plotly)
- Job map with heatmap + clustered markers (Folium)
- Insights panel and word cloud
- Sidebar filters and simple download actions (filtered CSV, analysis JSON)
- `dashboard/header.py` - Header strip with the PDF download button; caches generated bytes by content fingerprint.
- `.streamlit/config.toml` - Neutral, brand-agnostic theme.

### Report generation (ReportLab)
- `reports/context.py` - Builds and normalizes a report context dict from pipeline artifacts (`jobs.csv`, `analysis.json`, `insights.md`). Computes a SHA-256 fingerprint for cache invalidation.
- `reports/pdf_report.py` - Assembles the multi-section PDF from the context dict using ReportLab flowables.
- `reports/styles.py` - Page layout, paragraph styles, table styles, and font registration (Inter with Helvetica fallback).
- `scripts/generate_report.py` - CLI script to generate and validate a PDF report without the dashboard.

### Tooling
- `Makefile` - Short commands for setup, running the pipeline, launching the dashboard, testing Sheets, previewing the prompt, and cleanup.
- `requirements.txt` - Python dependencies (scraping, NLP, analysis, dashboard, LLM providers).
- `LICENSE` - MIT license.
- `CHANGELOG.md` - A concise record of what’s included in this release and why certain decisions were made.

---

## How the pieces interact (at a glance)

1. **n8n** triggers the run (Cron or Webhook) → executes `scripts/pipeline.py` in the repo directory on your NAS.
2. The **pipeline** scrapes jobs (with pagination), parses any PDFs in `/reports/`, enriches with NLP/skills, geocodes, analyzes, and writes outputs to both CSV and SQLite.
3. **Google Sheets** (optional) is updated with appended rows for jobs and reports (so stakeholders can view raw structured data).
4. The **dashboard** reads `data/processed/*` and refreshes automatically when files change.
5. The **n8n Mattermost** workflow (optional) reads the latest outputs, composes a short message (totals, top skills, alerts), posts to your channel, and snapshots the analysis for the next run.

Everything is idempotent: duplicates are filtered, pagination is capped, geocoding is cached, and runs can be scheduled safely.

## Cost & usage planning
- **LLM calls (optional):** Each pipeline run with `gpt-4o-mini` costs well under $1 (usually a few cents). The default prompt and response fit comfortably within 1200 tokens. Set `USE_LLM=false` for completely free runs. If you're running this on a schedule, estimate your monthly call volume and budget accordingly.
- **Google Sheets sync (optional):** Setting `USE_SHEETS=true` turns on both Sheets and Drive APIs. They are metered after the free tier, and every run makes a few dozen append/read calls. Leave it `false` until you create a GCP project, confirm quotas, and budget for increased throughput (e.g., batch jobs nightly instead of per-scrape).
- **Geocoding:** The built-in Nominatim client is free but rate-limited to 1 request/sec; heavy usage may require hosting your own instance. Because geocoding is cached in `data/geocache.csv`, reruns stay cost-free unless you clear the cache.
- **Storage/dashboards:** Streamlit + SQLite incur no extra spend -- everything runs locally. When deploying to cloud infrastructure, include VM/storage costs in your overall estimate.
- **Sheets cache resets:** The Google Sheets sync stores cached job/report IDs under `data/cache/`. If someone edits or deletes rows directly in the Sheet, clear those files before the next run so the pipeline can rebuild its local view of existing rows.

Document these toggles in your runbook so reviewers understand how to perform a zero-cost demo vs. a production run with LLM + Sheets enabled.
- With the defaults (4 workers, 0.8 s interval) expect roughly 5 detail-page fetches per second. Increase the interval or lower workers if a target board publishes stricter rate limits.