{"id":51277445,"url":"https://github.com/mikkelrask/henryrollins-scraper","last_synced_at":"2026-06-29T22:01:52.549Z","repository":{"id":363592053,"uuid":"1239749365","full_name":"mikkelrask/henryrollins-scraper","owner":"mikkelrask","description":"FANATIC! A dataset of Henry Rollins' listens on his KRCW radio show, with data dating back to 2017 - 496 episodes of weird and rare finds, fast paced punk and frog sounds. Includes a scraper that keeps the data up-to-date with henryrollins.com","archived":false,"fork":false,"pushed_at":"2026-06-23T13:20:12.000Z","size":9602,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"dev","last_synced_at":"2026-06-23T13:29:05.639Z","etag":null,"topics":["archive","data-analysis","data-visualization","music"],"latest_commit_sha":null,"homepage":"https://fanatic.raske.xyz","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mikkelrask.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-15T12:00:02.000Z","updated_at":"2026-06-23T13:20:54.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mikkelrask/henryrollins-scraper","commit_stats":null,"previous_names":["mikkelrask/henryrollins-scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mikkelrask/henryrollins-scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikkelrask%2Fhenryrollins-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikkelrask%2Fhenryrollins-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikkelrask%2Fhenryrollins-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikkelrask%2Fhenryrollins-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mikkelrask","download_url":"https://codeload.github.com/mikkelrask/henryrollins-scraper/tar.gz/refs/heads/dev","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikkelrask%2Fhenryrollins-scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34944147,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-29T02:00:05.398Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["archive","data-analysis","data-visualization","music"],"created_at":"2026-06-29T22:01:50.843Z","updated_at":"2026-06-29T22:01:52.541Z","avatar_url":"https://github.com/mikkelrask.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fanatic 📻\n\n**Henry Rollins Radio Scraper** — extracts track listings from [Henry Rollins' KCRW radio show](https://www.henryrollins.com/radio), two hours of fanatic-level deep cuts every Friday.\n\n| Component | URL | Stack |\n|-----------|-----|-------|\n| **Frontend** | https://fanatic.raske.xyz | Svelte 5 + Tailwind + D3 |\n| **API** | https://henryrollins-api.terminal-share.workers.dev | Hono + D1 (TypeScript) |\n| **Database** | Cloudflare D1 | SQLite (5.4 MB, 11 tables) |\n\n## What it does\n\n- **Scrapes** all monthly archive pages from `henryrollins.com/radio` (back to 2017)\n- **Parses** each episode's track listing (Hour 1 / Hour 2, numbered tracks with artist/title/album)\n- **Extracts** Bandcamp links separately (Henry shares a lot of them)\n- **Stores** everything in SQLite for analytics\n- **Resolves** artist names against MusicBrainz via `build_artist_cache.py`\n- **Enriches** artists with country, formed year, genres, tags, and bios from MusicBrainz / Last.fm / Wikipedia\n- **Tags** your local beets library with the `FANATIC` genre via `add_fanatic_genre.py`\n- **Deploys** the full stack to Cloudflare (Pages + D1 + Workers)\n\n## Architecture\n\n```\n                     ┌───────────────────────────────┐\n                     │    Cloudflare Pages            │\n                     │  ┌─────────────────────────┐   │\n                     │  │  Svelte 5 App           │   │\n                     │  │  (web/app/dist/)        │   │\n                     │  └────────┬────────────────┘   │\n                     │  /api/*   │ Pages Function      │\n                     │  ┌────────▼────────────────┐   │\n                     │  │  Hono API Worker         │   │\n                     │  │  (TypeScript, 38 routes) │   │\n                     │  └────────┬────────────────┘   │\n                     └───────────┼────────────────────┘\n                                │ env.DB\n                     ┌──────────▼──────────┐\n                     │  D1: henryrollins    │\n                     │  11 tables, 5.4 MB   │\n                     │  ~34K rows total     │\n                     └──────────────────────┘\n\nScraping / enrichment happen locally (Python). Only the data is pushed to\nthe cloud — zero external API calls at request time.\n```\n\n## Quick Start (Local Dev)\n\n### Setup\n\n```bash\n# Python environment\nuv venv .venv\nsource .venv/bin/activate\nuv pip install -r requirements.txt\n\n# Frontend\ncd web/app \u0026\u0026 npm install\n```\n\n### Run locally\n\n```bash\n# Terminal 1: Python API backend\ncd web/api \u0026\u0026 uvicorn main:app --reload --port 8000\n\n# Terminal 2: Svelte dev server (proxies /api → localhost:8000)\ncd web/app \u0026\u0026 npm run dev\n```\n\nThe Svelte dev proxy routes `/api/*` to the Python backend, so the frontend\nworks identically to production — just open `localhost:5173`.\n\n### Scraping\n\n```bash\n# Full historical scrape (back to 2017)\n./scraper\n\n# Just the most recent 3 months\n./scraper --limit-months 3\n\n# Re-export JSON from existing database without re-scraping\n./scraper --export-only\n```\n\n### Analytics\n\n```bash\n# Show summary stats\n./analytics\n\n# Top 20 most-played artists\n./analytics top-artists\n\n# Search for a specific artist\n./analytics artist \"Wire\"\n\n# All bandcamp links\n./analytics bandcamp-links\n```\n\n## Deploying to Cloudflare\n\nThe full stack (Worker + Pages + D1) runs on Cloudflare. Deploying pushes\nlocally scraped + enriched data up without making any API calls from the\nWorker itself.\n\n### Prerequisites\n\n```bash\n# Authenticate wrangler (one-time)\ncd worker \u0026\u0026 npx wrangler login\n\n# Create the D1 database (one-time — if not already done)\nnpx wrangler d1 create henryrollins\n# Then paste the database_id into worker/wrangler.toml\n\n# Set the admin API key (one-time)\necho \"your-secret-key\" | npx wrangler secret put ADMIN_API_KEY\n```\n\n### Full deploy\n\n```bash\n# 1. Scrape new episodes (enrichment happens inline)\n./scraper\n\n# 2. Deploy to Cloudflare (no separate enrichment step)\n./scripts/deploy.sh\n```\n\nOr step by step:\n\n```bash\n# 1. Scrape new episodes (MB + Last.fm enrichment done inline)\n./scraper\n\n# 2. Export to D1 SQL seed + push to Cloudflare\npython3 scripts/generate-d1-seed.py seed.sql\ncd worker\nnpx wrangler d1 execute henryrollins --remote --file=../seed.sql\n\n# 3. Deploy API Worker\nnpx wrangler deploy\n\n# 4. Deploy frontend to Pages\ncd ../web/app\nnpx wrangler pages deploy dist --project-name=fanatic --branch=main\n```\n\n\u003e Full last.fm re-enrichment (rarely needed): `python3 scripts/re-enrich-all.py`\n\n### Quick re-deploy (no new data)\n\n```bash\n# UI only: build frontend and deploy Pages (skip Worker + D1)\nnpm run deploy:ui\n\n# API Worker only\ncd worker \u0026\u0026 npx wrangler deploy\n```\n\n## Admin Panel\n\nThe admin panel is at https://fanatic.raske.xyz/#/admin. It's gated by an\nAPI key sent as the `x-admin-key` header.\n\nThe key is set via `wrangler secret put ADMIN_API_KEY` and is also stored\nlocally in `.env` and `worker/.dev.vars`.\n\nTo use the admin API directly:\n\n```bash\ncurl -H 'x-admin-key: your-secret-key' \\\n  https://fanatic.raske.xyz/api/admin/entities\n```\n\n## Cron Automation\n\nAdd this to your crontab to check for new episodes weekly:\n\n```cron\n# Scrape new Henry Rollins episodes every Monday at 9am\n0 9 * * 1 cd /path/to/henryrollins-scraper \u0026\u0026 ./scraper\n\n# For full deploy pipeline on a schedule (if you want auto-publish)\n# 30 9 * * 1 cd /path/to/henryrollins-scraper \u0026\u0026 ./scripts/deploy.sh\n```\n\n## Data Schema (SQLite)\n\n```sql\nepisodes           — id, broadcast (#NNN), url, date, scraped_at\ntracks             — id, episode_id, hour (1|2), position, artist, title, album,\n                     artist_norm, title_norm\nartists            — artist (normalized name), episode_count, track_count\nalbums             — artist, album (normalized), plays, first_seen, last_seen\nlinks              — id, episode_id, url, label (bandcamp links, etc.)\nartist_enrichment  — artist_name, mbid, country, formed_year, genres, tags,\n                     bio_summary, wikipedia_url, lastfm_tags, lastfm_bio,\n                     lastfm_listeners, lastfm_playcount, lastfm_url\nalbum_art          — album_name, artist_name, mbid, artwork_url, release_year\ncorrections        — track_id, episode_id, type, original_data, corrected_data\nrelease_group_cache — rg_mbid, data, fetched_at\n```\n\n## How Parsing Works\n\nThe scraper fetches monthly archive pages (`/on-the-radio-all?month=MM-YYYY`),\nwhich contain full track listings for every episode that month. Each `\u003carticle\u003e`\nelement is parsed for:\n\n1. **Broadcast number** from the `RADIO BROADCAST #NNN` header\n2. **URL** from the episode permalink\n3. **Hour markers** (`Hour 1`, `Hour 2`) to split the track listing\n4. **Numbered tracks** matching the pattern: `NN. Artist - Title / Album`\n5. **Bandcamp links** via URL pattern matching\n\nNon-music content (commentary, recommendations, links) is naturally filtered\nout since it doesn't match the numbered track pattern.\n\n## Local Files\n\n```\nhenryrollins-scraper/\n├── scraper              — Compiled scraper binary\n├── scraper.py           — Python scraper source\n├── build-cache          — Artist cache builder (binary)\n├── build_artist_cache.py\n├── analytics.py         — CLI analytics queries\n├── db/\n│   └── henryrollins.db  — Main SQLite database\n├── web/\n│   ├── app/             — Svelte 5 frontend\n│   └── api/             — Python FastAPI (local dev only)\n├── worker/              — Hono + D1 API Worker\n├── scripts/\n│   ├── deploy.sh        — Full deployment pipeline\n│   ├── generate-d1-seed.py  — DB → D1-safe SQL export\n│   ├── re-enrich-all.py     — Full last.fm re-enrichment (manual)\n│   └── re-enrich-all.py     — Full last.fm re-enrichment (manual)\n├── .env                 — ADMIN_API_KEY (local copy)\n└── worker/.dev.vars     — ADMIN_API_KEY for wrangler dev\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikkelrask%2Fhenryrollins-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmikkelrask%2Fhenryrollins-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikkelrask%2Fhenryrollins-scraper/lists"}