https://github.com/mikkelrask/henryrollins-scraper
FANATIC! A dataset of Henry Rollins' listens on his KRCW radio show, with data dating back to 2017 - 496 episodes of weird and rare finds, fast paced punk and frog sounds. Includes a scraper that keeps the data up-to-date with henryrollins.com
https://github.com/mikkelrask/henryrollins-scraper
archive data-analysis data-visualization music
Last synced: about 4 hours ago
JSON representation
FANATIC! A dataset of Henry Rollins' listens on his KRCW radio show, with data dating back to 2017 - 496 episodes of weird and rare finds, fast paced punk and frog sounds. Includes a scraper that keeps the data up-to-date with henryrollins.com
- Host: GitHub
- URL: https://github.com/mikkelrask/henryrollins-scraper
- Owner: mikkelrask
- Created: 2026-05-15T12:00:02.000Z (about 2 months ago)
- Default Branch: dev
- Last Pushed: 2026-06-23T13:20:12.000Z (7 days ago)
- Last Synced: 2026-06-23T13:29:05.639Z (7 days ago)
- Topics: archive, data-analysis, data-visualization, music
- Language: HTML
- Homepage: https://fanatic.raske.xyz
- Size: 9.16 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Fanatic 📻
**Henry Rollins Radio Scraper** — extracts track listings from [Henry Rollins' KCRW radio show](https://www.henryrollins.com/radio), two hours of fanatic-level deep cuts every Friday.
| Component | URL | Stack |
|-----------|-----|-------|
| **Frontend** | https://fanatic.raske.xyz | Svelte 5 + Tailwind + D3 |
| **API** | https://henryrollins-api.terminal-share.workers.dev | Hono + D1 (TypeScript) |
| **Database** | Cloudflare D1 | SQLite (5.4 MB, 11 tables) |
## What it does
- **Scrapes** all monthly archive pages from `henryrollins.com/radio` (back to 2017)
- **Parses** each episode's track listing (Hour 1 / Hour 2, numbered tracks with artist/title/album)
- **Extracts** Bandcamp links separately (Henry shares a lot of them)
- **Stores** everything in SQLite for analytics
- **Resolves** artist names against MusicBrainz via `build_artist_cache.py`
- **Enriches** artists with country, formed year, genres, tags, and bios from MusicBrainz / Last.fm / Wikipedia
- **Tags** your local beets library with the `FANATIC` genre via `add_fanatic_genre.py`
- **Deploys** the full stack to Cloudflare (Pages + D1 + Workers)
## Architecture
```
┌───────────────────────────────┐
│ Cloudflare Pages │
│ ┌─────────────────────────┐ │
│ │ Svelte 5 App │ │
│ │ (web/app/dist/) │ │
│ └────────┬────────────────┘ │
│ /api/* │ Pages Function │
│ ┌────────▼────────────────┐ │
│ │ Hono API Worker │ │
│ │ (TypeScript, 38 routes) │ │
│ └────────┬────────────────┘ │
└───────────┼────────────────────┘
│ env.DB
┌──────────▼──────────┐
│ D1: henryrollins │
│ 11 tables, 5.4 MB │
│ ~34K rows total │
└──────────────────────┘
Scraping / enrichment happen locally (Python). Only the data is pushed to
the cloud — zero external API calls at request time.
```
## Quick Start (Local Dev)
### Setup
```bash
# Python environment
uv venv .venv
source .venv/bin/activate
uv pip install -r requirements.txt
# Frontend
cd web/app && npm install
```
### Run locally
```bash
# Terminal 1: Python API backend
cd web/api && uvicorn main:app --reload --port 8000
# Terminal 2: Svelte dev server (proxies /api → localhost:8000)
cd web/app && npm run dev
```
The Svelte dev proxy routes `/api/*` to the Python backend, so the frontend
works identically to production — just open `localhost:5173`.
### Scraping
```bash
# Full historical scrape (back to 2017)
./scraper
# Just the most recent 3 months
./scraper --limit-months 3
# Re-export JSON from existing database without re-scraping
./scraper --export-only
```
### Analytics
```bash
# Show summary stats
./analytics
# Top 20 most-played artists
./analytics top-artists
# Search for a specific artist
./analytics artist "Wire"
# All bandcamp links
./analytics bandcamp-links
```
## Deploying to Cloudflare
The full stack (Worker + Pages + D1) runs on Cloudflare. Deploying pushes
locally scraped + enriched data up without making any API calls from the
Worker itself.
### Prerequisites
```bash
# Authenticate wrangler (one-time)
cd worker && npx wrangler login
# Create the D1 database (one-time — if not already done)
npx wrangler d1 create henryrollins
# Then paste the database_id into worker/wrangler.toml
# Set the admin API key (one-time)
echo "your-secret-key" | npx wrangler secret put ADMIN_API_KEY
```
### Full deploy
```bash
# 1. Scrape new episodes (enrichment happens inline)
./scraper
# 2. Deploy to Cloudflare (no separate enrichment step)
./scripts/deploy.sh
```
Or step by step:
```bash
# 1. Scrape new episodes (MB + Last.fm enrichment done inline)
./scraper
# 2. Export to D1 SQL seed + push to Cloudflare
python3 scripts/generate-d1-seed.py seed.sql
cd worker
npx wrangler d1 execute henryrollins --remote --file=../seed.sql
# 3. Deploy API Worker
npx wrangler deploy
# 4. Deploy frontend to Pages
cd ../web/app
npx wrangler pages deploy dist --project-name=fanatic --branch=main
```
> Full last.fm re-enrichment (rarely needed): `python3 scripts/re-enrich-all.py`
### Quick re-deploy (no new data)
```bash
# UI only: build frontend and deploy Pages (skip Worker + D1)
npm run deploy:ui
# API Worker only
cd worker && npx wrangler deploy
```
## Admin Panel
The admin panel is at https://fanatic.raske.xyz/#/admin. It's gated by an
API key sent as the `x-admin-key` header.
The key is set via `wrangler secret put ADMIN_API_KEY` and is also stored
locally in `.env` and `worker/.dev.vars`.
To use the admin API directly:
```bash
curl -H 'x-admin-key: your-secret-key' \
https://fanatic.raske.xyz/api/admin/entities
```
## Cron Automation
Add this to your crontab to check for new episodes weekly:
```cron
# Scrape new Henry Rollins episodes every Monday at 9am
0 9 * * 1 cd /path/to/henryrollins-scraper && ./scraper
# For full deploy pipeline on a schedule (if you want auto-publish)
# 30 9 * * 1 cd /path/to/henryrollins-scraper && ./scripts/deploy.sh
```
## Data Schema (SQLite)
```sql
episodes — id, broadcast (#NNN), url, date, scraped_at
tracks — id, episode_id, hour (1|2), position, artist, title, album,
artist_norm, title_norm
artists — artist (normalized name), episode_count, track_count
albums — artist, album (normalized), plays, first_seen, last_seen
links — id, episode_id, url, label (bandcamp links, etc.)
artist_enrichment — artist_name, mbid, country, formed_year, genres, tags,
bio_summary, wikipedia_url, lastfm_tags, lastfm_bio,
lastfm_listeners, lastfm_playcount, lastfm_url
album_art — album_name, artist_name, mbid, artwork_url, release_year
corrections — track_id, episode_id, type, original_data, corrected_data
release_group_cache — rg_mbid, data, fetched_at
```
## How Parsing Works
The scraper fetches monthly archive pages (`/on-the-radio-all?month=MM-YYYY`),
which contain full track listings for every episode that month. Each ``
element is parsed for:
1. **Broadcast number** from the `RADIO BROADCAST #NNN` header
2. **URL** from the episode permalink
3. **Hour markers** (`Hour 1`, `Hour 2`) to split the track listing
4. **Numbered tracks** matching the pattern: `NN. Artist - Title / Album`
5. **Bandcamp links** via URL pattern matching
Non-music content (commentary, recommendations, links) is naturally filtered
out since it doesn't match the numbered track pattern.
## Local Files
```
henryrollins-scraper/
├── scraper — Compiled scraper binary
├── scraper.py — Python scraper source
├── build-cache — Artist cache builder (binary)
├── build_artist_cache.py
├── analytics.py — CLI analytics queries
├── db/
│ └── henryrollins.db — Main SQLite database
├── web/
│ ├── app/ — Svelte 5 frontend
│ └── api/ — Python FastAPI (local dev only)
├── worker/ — Hono + D1 API Worker
├── scripts/
│ ├── deploy.sh — Full deployment pipeline
│ ├── generate-d1-seed.py — DB → D1-safe SQL export
│ ├── re-enrich-all.py — Full last.fm re-enrichment (manual)
│ └── re-enrich-all.py — Full last.fm re-enrichment (manual)
├── .env — ADMIN_API_KEY (local copy)
└── worker/.dev.vars — ADMIN_API_KEY for wrangler dev
```