{"id":28181381,"url":"https://github.com/ff6347/debateclub-firecrawled","last_synced_at":"2025-07-08T18:06:03.279Z","repository":{"id":288252549,"uuid":"966913102","full_name":"ff6347/debateclub-firecrawled","owner":"ff6347","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-08T09:50:48.000Z","size":308,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-04T03:55:39.313Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ff6347.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-15T16:33:11.000Z","updated_at":"2025-05-08T09:50:53.000Z","dependencies_parsed_at":"2025-04-16T14:15:19.431Z","dependency_job_id":null,"html_url":"https://github.com/ff6347/debateclub-firecrawled","commit_stats":null,"previous_names":["ff6347/debateclub-firecrawled"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ff6347/debateclub-firecrawled","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ff6347%2Fdebateclub-firecrawled","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ff6347%2Fdebateclub-firecrawled/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ff6347%2Fdebateclub-firecrawled/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ff6347%2Fdebateclub-firecrawled/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ff6347","download_url":"https://codeload.github.com/ff6347/debateclub-firecrawled/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ff6347%2Fdebateclub-firecrawled/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264320949,"owners_count":23590561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-16T03:12:46.470Z","updated_at":"2025-07-08T18:06:03.259Z","avatar_url":"https://github.com/ff6347.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Debateclub Markdown Link Scraper 🧭🔗🧠\n\nTired of markdown files turning into link graveyards? This tool breathes life back into them! It automatically extracts links, crawls the content, grabs useful metadata, gets an AI-powered summary, and  tags everything, storing it all neatly in a Supabase database. ✨\n\nThink of it as your personal web librarian and summarizer for all those interesting links scattered across your notes.\n\n(Actually this scratches my own itch for work at HBK-BS to provide the students with some links for our upcoming AI debateclub event. See the frontend code over here: https://github.com/ff6347/debateclub-fe and the deployed site there 👉🏾 https://debateclub.inpyjamas.dev)\n\n## Features 🚀\n\n* **Link Extraction:** Scans text files in a specified directory for `http/https` links.\n* **Metadata Scraping:** Uses [Firecrawl](https://firecrawl.dev/) to fetch not just the markdown content of linked pages, but also metadata like title, description, keywords, and Open Graph image URLs.\n* **AI Summarization:** Leverages the OpenAI API to generate concise summaries of the crawled content.\n* **Intelligent Tagging:** Queries OpenAI for relevant tags, intelligently suggesting existing tags from your database first, and only adding a limited number of new ones if necessary.\n* **Database Storage:** Persists link data, scraped content, metadata, summaries, and tags in a Supabase (PostgreSQL) database using a structured schema (`links`, `tags`, `link_tags` tables).\n* **Concurrency Control:** Uses `p-limit` to manage concurrent scraping and summarization tasks gracefully.\n* **CLI Interface:** Command-line operation with flags to control different stages of the process.\n\n## Tech Stack 🔧\n\n* Node.js / TypeScript\n* Supabase (PostgreSQL) for the database\n* Firecrawl.dev for scraping\n* OpenAI API for summaries \u0026 tagging\n* `direnv` for environment variable management\n\n## Prerequisites 📋\n\n* Node.js (Check `.node-version` for the recommended version, \u003e=23 suggested)\n* `npm` (comes with Node.js)\n* `direnv` ([Installation Guide](https://direnv.net/))\n* Access to a Supabase project (local or cloud)\n* **Note:** The default configuration (`.envrc.example` \u0026 CLI defaults) assumes a **local** Supabase instance is running (via Supabase CLI).\n* An OpenAI API Key\n* (Optional) A Firecrawl API Key/Instance URL if not using the free tier or default endpoint or running your own instance.\n* **Note:** The default configuration assumes a **local** Firecrawl instance is running at `http://localhost:3002` if `FIRECRAWL_API_URL` is not set.\n\n## Installation \u0026 Setup ⚙️\n\n1.  **Clone the repository:**\n  ```bash\n  git clone https://github.com/your-username/your-repo.git\n  cd your-repo\n  ```\n2.  **Install dependencies:**\n  ```bash\n  npm install\n  ```\n3.  **Set up Environment Variables:**\n  * Copy the example file: `cp .envrc.example .envrc`\n  * Edit `.envrc` and fill in your actual `SUPABASE_URL`, `SUPABASE_ANON_KEY`, `SUPABASE_SERVICE_ROLE_KEY`, and `OPENAI_API_KEY`. Add `FIRECRAWL_API_KEY` or `FIRECRAWL_API_URL` if needed.\n  * Run your local Firecrawl instance: `docker compose up -d`. See [Firecrawl Docs Self-Host](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md) for more details.\n  * Run your local Supabase instance: `supabase start`.\n  * The default values in `.envrc.example` likely point to local instances for Supabase and Firecrawl.\n  * Enable `direnv` for this directory: `direnv allow`\n\n## Usage 💡\n\nThe main script is run via `node src/cli.ts`.\n\n```bash\n# Run all steps (extraction, crawl, summary)\nnode src/cli.ts\n\n# Specify a different source directory for markdown files\nnode src/cli.ts --source-dir ./path/to/your/markdown\n\n# Skip specific steps\nnode src/cli.ts --skip-extraction\nnode src/cli.ts --skip-crawl\nnode src/cli.ts --skip-summary\n\n# Get help on options\nnode src/cli.ts --help\n```\n\nPlace your source markdown files in the directory specified by `--source-dir` (defaults to `./source-files`).\n\n## Environment Variables 🔑\n\nConfigure these in your `.envrc` file (loaded automatically by `direnv`):\n\n* `FIRECRAWL_API_KEY` (Optional): Your Firecrawl API key if needed.\n* `FIRECRAWL_API_URL` (Optional): Custom Firecrawl API endpoint.\n* `SUPABASE_URL` (**Required**): URL for your Supabase project API.\n* `SUPABASE_ANON_KEY` (**Required**): The public anonymous key for your Supabase project.\n* `OPENAI_API_KEY` (**Required**): Your OpenAI API key for summarization/tagging.\n\n## Development 🛠️\n\n* **Watch Mode:** Run the script and automatically restart on file changes:\n  ```bash\n  npm run dev\n  ```\n* **Generate Supabase Types:** After any database schema changes, update the TypeScript types:\n  ```bash\n  supabase gen types --local \u003e src/database.ts\n  # or --project-id \u003cyour-project-ref\u003e if using cloud\n  ```\n* **Run Tests:** (If/when tests are added)\n  ```bash\n  npm test\n  ```\n\n---\n\nHappy Link Exploring! 🎉 ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fff6347%2Fdebateclub-firecrawled","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fff6347%2Fdebateclub-firecrawled","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fff6347%2Fdebateclub-firecrawled/lists"}