{"id":36631872,"url":"https://github.com/fish-not-phish/pixurebyte","last_synced_at":"2026-01-12T09:38:49.797Z","repository":{"id":321097654,"uuid":"1070449511","full_name":"fish-not-phish/pixurebyte","owner":"fish-not-phish","description":"Pixurebyte is an open source, self-hostable platform for capturing and analyzing websites with ease. It takes screenshots, full HTML source, request/response data, and metadata — all within minutes — without being blocked by Cloudflare challenges or headless browser detection.","archived":false,"fork":false,"pushed_at":"2025-12-24T14:03:49.000Z","size":2671,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-25T09:03:33.300Z","etag":null,"topics":["automation","aws","cloudflare-bypass","django","docker","nextjs","postgresql","redis","scraping","self-hosted","threat-detection","threat-hunting","threat-intelligence","urlscan","urlscan-io"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fish-not-phish.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-05T23:57:30.000Z","updated_at":"2025-12-24T14:03:52.000Z","dependencies_parsed_at":"2025-10-28T03:00:16.655Z","dependency_job_id":null,"html_url":"https://github.com/fish-not-phish/pixurebyte","commit_stats":null,"previous_names":["fish-not-phish/pixurebyte"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fish-not-phish/pixurebyte","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fish-not-phish%2Fpixurebyte","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fish-not-phish%2Fpixurebyte/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fish-not-phish%2Fpixurebyte/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fish-not-phish%2Fpixurebyte/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fish-not-phish","download_url":"https://codeload.github.com/fish-not-phish/pixurebyte/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fish-not-phish%2Fpixurebyte/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28337737,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-12T06:09:07.588Z","status":"ssl_error","status_checked_at":"2026-01-12T06:05:18.301Z","response_time":98,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","aws","cloudflare-bypass","django","docker","nextjs","postgresql","redis","scraping","self-hosted","threat-detection","threat-hunting","threat-intelligence","urlscan","urlscan-io"],"created_at":"2026-01-12T09:38:49.200Z","updated_at":"2026-01-12T09:38:49.784Z","avatar_url":"https://github.com/fish-not-phish.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pixurebyte\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/fish-not-phish/pixurebyte/refs/heads/main/public/pixurebyte-full-light.png\" alt=\"Pixurebyte Logo\" width=\"400\"/\u003e\n\u003c/p\u003e\n\n**Pixurebyte** is an open-source, self-hostable website capture and analysis platform — inspired by URLScan but built for speed, control, and privacy.\n\nUnlike traditional web scanners that struggle with Cloudflare or bot-protection challenges, Pixurebyte allows you to **bypass challenge pages entirely** by hosting your own compute infrastructure.  \nYou get **screenshots, metadata, and raw HTML** of any site in minutes — all while maintaining full control of your data.\n\n[![Stars](https://img.shields.io/github/stars/fish-not-phish/pixurebyte?style=social)](https://github.com/fish-not-phish/pixurebyte/stargazers)\n[![Forks](https://img.shields.io/github/forks/fish-not-phish/pixurebyte?style=social)](https://github.com/fish-not-phish/pixurebyte/network/members)\n\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n![Status](https://img.shields.io/badge/status-Alpha-red)\n\n---\n\n## Key Features\n\n- 🧠 **Self-Hostable Architecture**  \n  Run your web app, database, and Redis locally while leveraging AWS for scalable compute and object storage.\n\n- ⚡ **Bypass Cloudflare \u0026 Bot Challenges**  \n  PixureByte leverages multiple libraries, including [Scrapling](https://github.com/D4Vinci/Scrapling) to assist in bypassing Cloudflare related bot challenges.\n\n- 🖼️ **Full Page Screenshots**  \n  Automatically capture high-resolution screenshots of pages.\n\n- 🌐 **Rich Site Metadata Collection**  \n  Collect HTML, response/request data, headers, and more.\n\n- 🧩 **Modular \u0026 Extensible**  \n  Designed for easy integration with your research workflows. New data collectors will be added soon.\n\n- 🔒 **Privacy-Conscious Design**  \n  Everything you scan and store stays within your control — nothing is sent to external services.\n\n---\n\n## AWS Compute Model\n\nPixurebyte uses **AWS ECS Fargate** to launch short-lived scan containers on demand.\n\n- Primary **capacity provider:** `FARGATE_SPOT`\n- Fallback: **Standard FARGATE** (if Spot is unavailable)\n- Task size: **0.5 vCPU / 2 GB RAM**\n- Typical runtime: **~1–3 minutes per scan**\n\nThis configuration balances **performance, reliability, and cost efficiency** — giving you full browser capabilities at minimal expense.\n\n---\n\n## Cost Breakdown — How Inexpensive It Really Is\n\nPixurebyte was designed with a **minimal AWS footprint** in mind.  \nAll heavy-lifting services (database, web, Redis, API) run **locally** using Docker Compose — meaning you only pay for **ephemeral AWS tasks** and **S3 storage**.\n\nBelow is an approximate cost estimate for a modest personal/research deployment:\n\n| Resource | Description | Est. Monthly Cost (USD) |\n|-----------|--------------|--------------------------|\n| **ECS Fargate Spot Tasks** | 0.5 vCPU / 2 GB RAM per scan, ~2 min average. Spot pricing ≈ $0.0008/min. | **$0.02 – $0.05 per scan** |\n| **Fallback Fargate (on-demand)** | Used only if Spot capacity unavailable (~2× cost). | **$0.04 – $0.10 per scan** |\n| **S3 Storage** | Screenshots, HTML, and JSON metadata (≈ 10–20 MB per scan). | **$0.02 – $0.10 / month** for hundreds of scans |\n| **CloudFront (optional)** | CDN delivery for public image access. | **Free (under free tier)** or ~$0.01/GB |\n| **AWS Data Transfer** | Negligible due to CDN usage. | **\u003c$0.10 / month** |\n\n\u003e 💡 **Total cost for light usage:** under **$1 per month** for ~100 scans.  \n\u003e Even at moderate scale (hundreds per week), expect costs under **$5–10/month**.  \n\u003e The bulk of your infrastructure — API, DB, Redis, and frontend — runs free on your own hardware.\n\n---\n\n## System Overview\n\nPixurebyte is composed of two parts:\n\n| Component | Description |\n|------------|-------------|\n| **Local Stack** | Django backend, Redis, PostgreSQL, and frontend (NextJS) served locally via Docker Compose |\n| **AWS Infrastructure** | S3 for media storage + ECS Fargate for ephemeral scan workers |\n\nThis hybrid model allows **fast local management** with **elastic remote compute**.\n\n---\n\n## Prerequisites\n\nBefore installing, make sure you have:\n\n- [Terraform](https://developer.hashicorp.com/terraform/downloads) ≥ **1.6.0**\n- [AWS CLI](https://aws.amazon.com/cli/) (configured with credentials)\n- [Docker](https://docs.docker.com/get-docker/)\n- [OpenSSL](https://www.openssl.org/) (used for generating Django secret keys)\n\n---\n\n## Installation\n\n## Installing Terraform (Linux)\n\nTo run Pixurebytes infrastructure components, you’ll need [Terraform](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli). Here's how to install it on a Debian-based Linux system (e.g. Ubuntu):\n\n**1. Update and install prerequisites**\n```bash\nsudo apt-get update -y \u0026\u0026 sudo apt-get install -y gnupg software-properties-common\n```\n**2. Install the HashiCorp GPG Key**\n```bash\nwget -O- https://apt.releases.hashicorp.com/gpg | \\\ngpg --dearmor | \\\nsudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg \u003e /dev/null\n```\n**3. Add the official HashiCorp repository to your linux system.**\n```bash\necho \"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \\\nhttps://apt.releases.hashicorp.com $(grep -oP '(?\u003c=UBUNTU_CODENAME=).*' /etc/os-release || lsb_release -cs) main\" | \\\nsudo tee /etc/apt/sources.list.d/hashicorp.list\n```\n**4. Download the package information**\n```bash\nsudo apt update -y\n```\n**5. Install Terraform**\n```bash\nsudo apt-get install -y terraform\n```\n## AWS Credentials Setup (Root User)\n\nTo allow Terraform to authenticate with AWS, you need to provide your **Access Key ID** and **Secret Access Key**. Here's how to obtain them from your AWS Root Account (IAM user is also sufficient):\n\n---\n\n### 1. Sign in to AWS\n\nGo to [https://aws.amazon.com/console/](https://aws.amazon.com/console/) and log in as the **root user** (email + password). Feel free to use an IAM user instead as long as the permissions are correct.\n\n---\n\n### 2. Create Access Keys (for root)\n\n1. Navigate to **My Security Credentials** (top-right dropdown → _“My Security Credentials”_).  \n2. Scroll down to the **Access keys** section.  \n3. Click **Create access key**.  \n4. **Download** or **copy** the credentials safely:\n   - `AWS_ACCESS_KEY_ID`\n   - `AWS_SECRET_ACCESS_KEY`\n\n\u003e ⚠️ You will only see the secret key **once**. Store it securely.\n\n---\n\n### 3. Configure the environment for Terraform\n\nYou can pass the credentials via environment variables:\n\n```bash\nexport AWS_ACCESS_KEY_ID=\"your-access-key-id\"\nexport AWS_SECRET_ACCESS_KEY=\"your-secret-access-key\"\nexport AWS_DEFAULT_REGION=\"us-east-2\"\n```\n\n### 4. Clone the Repository\n```bash\ngit clone https://github.com/fish-not-phish/pixurebyte.git\ncd pixurebyte\n```\n\n### 5. Deploy AWS Infrastructure\n```\ncd terraform\n./setup.sh\n```\n\nYou’ll be prompted for your custom domain and other configuration options.\nThe script automatically provisions:\n\n- S3 bucket for screenshots and raw HTML\n- ECS task definition for scan workers\n- CloudFront CDN (so your S3 bucket remains non-public)\n\nOnce finished, your AWS resources are ready for use.\n\n### 6. Launch the Local Environment\n```\ncd ..\ndocker compose up -d\n```\n\nThis brings up:\n\n- Django backend (localhost:8000)\n- Next.js frontend (localhost:3000)\n- Redis + Postgres containers\n\n## Teardown\n\n### Stop Local Services\n```\ndocker compose down\n```\n\nTo remove all stored data:\n```\ndocker compose down -v\n```\n\n### Destroy AWS Infrastructure\n```\ncd terraform\n./destroy.sh\n```\n\nThis cleanly tears down all provisioned AWS resources (ECS tasks, S3 bucket, etc).\n\n## Disclaimer\n\u003e [!CAUTION]\n\u003e This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffish-not-phish%2Fpixurebyte","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffish-not-phish%2Fpixurebyte","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffish-not-phish%2Fpixurebyte/lists"}