{"id":50732886,"url":"https://github.com/md-rejoyan-islam/scrape-server","last_synced_at":"2026-06-10T10:31:58.410Z","repository":{"id":363572194,"uuid":"1174140850","full_name":"md-rejoyan-islam/scrape-server","owner":"md-rejoyan-islam","description":"Scraping Server is a self-hostable web scraping REST API built with Bun, Express, and TypeScript. It uses a real Chrome browser powered by puppeteer-real-browser and stealth techniques to bypass Cloudflare Turnstile and common bot protections. The API extracts structured page data including metadata, product details, links, images, headings.","archived":false,"fork":false,"pushed_at":"2026-06-09T13:48:17.000Z","size":138,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-09T14:06:12.710Z","etag":null,"topics":["bun","cheerio","docker","express","puppeteer","swagger","typescript","zod"],"latest_commit_sha":null,"homepage":"http://scrape-server.rejoyan.me","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/md-rejoyan-islam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-06T05:35:20.000Z","updated_at":"2026-06-09T13:53:54.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/md-rejoyan-islam/scrape-server","commit_stats":null,"previous_names":["md-rejoyan-islam/scrape-server"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/md-rejoyan-islam/scrape-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/md-rejoyan-islam%2Fscrape-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/md-rejoyan-islam%2Fscrape-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/md-rejoyan-islam%2Fscrape-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/md-rejoyan-islam%2Fscrape-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/md-rejoyan-islam","download_url":"https://codeload.github.com/md-rejoyan-islam/scrape-server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/md-rejoyan-islam%2Fscrape-server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34149132,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bun","cheerio","docker","express","puppeteer","swagger","typescript","zod"],"created_at":"2026-06-10T10:31:58.288Z","updated_at":"2026-06-10T10:31:58.400Z","avatar_url":"https://github.com/md-rejoyan-islam.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# 🕷️ Scraping Server\n\n**A production-grade web scraping API with anti-bot bypass, structured product extraction, and OpenAPI docs.**\n\n[![Bun](https://img.shields.io/badge/Bun-1.3+-000?logo=bun\u0026logoColor=fbf0df)](https://bun.sh/)\n[![TypeScript](https://img.shields.io/badge/TypeScript-5.9-3178C6?logo=typescript\u0026logoColor=white)](https://www.typescriptlang.org/)\n[![Express](https://img.shields.io/badge/Express-5.x-000?logo=express\u0026logoColor=white)](https://expressjs.com/)\n[![Puppeteer](https://img.shields.io/badge/Puppeteer-24.x-40B5A4?logo=puppeteer\u0026logoColor=white)](https://pptr.dev/)\n[![Docker](https://img.shields.io/badge/Docker-ready-2496ED?logo=docker\u0026logoColor=white)](https://www.docker.com/)\n[![License](https://img.shields.io/badge/license-Proprietary-red.svg)](#-license)\n\n[![Live Demo](https://img.shields.io/badge/▶_Live_Demo-scrape--server.rejoyan.me-5b9dff?logoColor=white)](https://scrape-server.rejoyan.me)\n\n🌐 **Live instance:** [**scrape-server.rejoyan.me**](https://scrape-server.rejoyan.me) · [Swagger docs](https://scrape-server.rejoyan.me/api-docs)\n\n[Quick start](#-quick-start) · [API reference](#-api-reference) · [Swagger UI](#-interactive-api-docs-swagger) · [Configuration](#%EF%B8%8F-configuration) · [Docker notes](#-docker-notes) · [License](#-license)\n\n\u003c/div\u003e\n\n---\n\n## ✨ Highlights\n\n|                                |                                                                                                                                                 |\n| ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |\n| 🛡️ **Anti-bot bypass**         | Cloudflare Turnstile \u0026 generic challenges via [`puppeteer-real-browser`](https://www.npmjs.com/package/puppeteer-real-browser) + stealth plugin |\n| 🛒 **Structured product data** | Normalized output from JSON-LD, microdata, OpenGraph — including variants, prices, stock                                                        |\n| 🧩 **Pluggable extractors**    | `links`, `images`, `headings`, `text`, `prices`, `tables` — opt in per request                                                                  |\n| 📜 **Readability \u0026 Markdown**  | Clean article HTML and Markdown output via Mozilla Readability + Turndown                                                                       |\n| 📸 **Screenshots**             | Base64-encoded PNG of the rendered page                                                                                                         |\n| ⚡ **Three execution modes**   | Synchronous, async (job-based), and parallel batch (up to 10 URLs)                                                                              |\n| 📚 **Swagger UI**              | Interactive OpenAPI 3 docs at `/api-docs`                                                                                                       |\n| 🎯 **Field projection**        | `?fields=a,b,c` to trim responses                                                                                                               |\n| 🐳 **Docker-native**           | One-command bring-up with bundled Xvfb for headful Chrome                                                                                       |\n| ✅ **Type-safe inputs**        | Zod-validated request bodies                                                                                                                    |\n\n---\n\n## 🚀 Quick start\n\n### 🐳 Option A — Docker (recommended)\n\n```bash\ndocker compose up --build\n```\n\nThat's it. After ~2 minutes (first build):\n\n|                     |                                  |\n| ------------------- | -------------------------------- |\n| 🖥️ **Web UI**       | http://localhost:8090            |\n| 📖 **Swagger docs** | http://localhost:8090/api-docs   |\n| 📡 **API base**     | http://localhost:8090/api        |\n| ❤️ **Health**       | http://localhost:8090/api/health |\n\nStop the stack:\n\n```bash\ndocker compose down\n```\n\n### 💻 Option B — Local with Bun\n\n**Prerequisites:** [Bun](https://bun.sh/) `\u003e= 1.3` and a modern Chrome/Chromium (Puppeteer downloads one on first install).\n\n```bash\nbun install\nbun run dev       # watch mode — auto-reload on changes\n```\n\nBuild \u0026 run the compiled output:\n\n```bash\nbun run build     # tsc → dist/\nbun run start     # bun dist/src/server.js\n```\n\n---\n\n## 📖 Interactive API docs (Swagger)\n\nThe full OpenAPI 3 specification lives at [`docs/swagger.yaml`](docs/swagger.yaml) and is **rendered as interactive Swagger UI** once the server is running:\n\n\u003e 👉 **http://localhost:8090/api-docs**\n\nFrom the Swagger UI you can:\n\n- 🔍 Browse every endpoint with full request/response schemas\n- 🧪 **Try requests live** with the built-in \"Try it out\" button\n- 📥 Inspect example payloads and response shapes inline\n- 📤 Export/download the raw spec for codegen or client SDK generation\n\nThe static UI playground at **http://localhost:8090/** is a friendlier sandbox for non-engineers.\n\n\u003e 💡 **For client SDKs:** point your favorite OpenAPI generator (e.g. [`openapi-typescript`](https://www.npmjs.com/package/openapi-typescript), [`openapi-generator-cli`](https://openapi-generator.tech/)) at `http://localhost:8090/api-docs/swagger.json` to auto-generate a fully-typed client.\n\n---\n\n## ⚙️ Configuration\n\nAll settings are environment variables. Locally, drop them in a `.env` at the project root (auto-loaded). For Docker, set them under `environment:` in [`docker-compose.yml`](docker-compose.yml).\n\n| Variable             | Default | Description                                                                            |\n| -------------------- | ------- | -------------------------------------------------------------------------------------- |\n| `PORT`               | `8090`  | HTTP port the server listens on.                                                       |\n| `HEADLESS`           | `true`  | Run Chrome headless. _(Informational — `puppeteer-real-browser` is always headful.)_   |\n| `BOT_BYPASS_ENABLED` | `true`  | Reserved flag for future bot-bypass tuning.                                            |\n| `CHROME_PATH`        | auto    | Explicit path to Chrome/Chromium (set inside the container).                           |\n| `PROXY_URL`          | _none_  | Outbound HTTP proxy: `http://[user:pass@]host:port`. Used by Chrome for every request. |\n\n**Example `.env`:**\n\n```env\nPORT=8090\nPROXY_URL=http://user:pass@proxy.example.com:8080\n```\n\n\u003e 🔒 `.env` is **excluded** from the Docker build (see [`.dockerignore`](.dockerignore)). Container env must go in `docker-compose.yml`.\n\n---\n\n## 🔌 API reference\n\nBase URL: `http://localhost:8090/api` · Full schemas: **[Swagger UI](http://localhost:8090/api-docs)**\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eMethod\u003c/th\u003e\u003cth\u003ePath\u003c/th\u003e\u003cth\u003ePurpose\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ccode\u003ePOST\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"#post-apiscrape--synchronous-scrape\"\u003e\u003ccode\u003e/api/scrape\u003c/code\u003e\u003c/a\u003e\u003c/td\u003e\u003ctd\u003eSynchronous scrape\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ccode\u003ePOST\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"#post-apiscrapeasync--fire-and-forget\"\u003e\u003ccode\u003e/api/scrape/async\u003c/code\u003e\u003c/a\u003e\u003c/td\u003e\u003ctd\u003eFire-and-forget — returns a \u003ccode\u003ejobId\u003c/code\u003e\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ccode\u003ePOST\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"#post-apiscrapebatch--multi-url-batch\"\u003e\u003ccode\u003e/api/scrape/batch\u003c/code\u003e\u003c/a\u003e\u003c/td\u003e\u003ctd\u003eBatch up to 10 URLs in parallel\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ccode\u003eGET\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"#get-apijobsjobid--job-status--result\"\u003e\u003ccode\u003e/api/jobs/:jobId\u003c/code\u003e\u003c/a\u003e\u003c/td\u003e\u003ctd\u003ePoll a job's status \u0026 result\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ccode\u003eGET\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"#get-apijobs--list-all-jobs\"\u003e\u003ccode\u003e/api/jobs\u003c/code\u003e\u003c/a\u003e\u003c/td\u003e\u003ctd\u003eList all jobs (metadata only)\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003ccode\u003eGET\u003c/code\u003e\u003c/td\u003e\u003ctd\u003e\u003ca href=\"#get-apihealth--health-check\"\u003e\u003ccode\u003e/api/health\u003c/code\u003e\u003c/a\u003e\u003c/td\u003e\u003ctd\u003eServer health \u0026 uptime\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n---\n\n### `POST /api/scrape` — synchronous scrape\n\nScrapes a URL and returns the full result in the same response. Best for ad-hoc requests and testing.\n\n**Request body:**\n\n| Field        | Type                | Required | Default                                                  | Description                                           |\n| ------------ | ------------------- | -------- | -------------------------------------------------------- | ----------------------------------------------------- |\n| `url`        | `string` (URL)      | ✅       | —                                                        | The page to scrape.                                   |\n| `waitFor`    | `number` (0..60000) | ❌       | `3000`                                                   | Extra ms to wait after page load (JS-rendered pages). |\n| `extractors` | `ExtractorName[]`   | ❌       | `[\"links\",\"images\",\"headings\",\"text\",\"prices\",\"tables\"]` | Which extractors to run.                              |\n| `fullHtml`   | `boolean`           | ❌       | `false`                                                  | Include raw post-render HTML under `fullHtml`.        |\n| `screenshot` | `boolean`           | ❌       | `false`                                                  | Include base64 PNG under `screenshotUrl`.             |\n\n**Query:** `?fields=a,b,c` — projection of top-level fields.\n\n**Example:**\n\n```bash\ncurl -X POST http://localhost:8090/api/scrape \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"url\": \"https://example.com\",\n    \"waitFor\": 2000,\n    \"extractors\": [\"headings\", \"links\"]\n  }'\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eResponse shape\u003c/b\u003e\u003c/summary\u003e\n\n```json\n{\n  \"success\": true,\n  \"data\": {\n    \"url\": \"https://example.com/\",\n    \"crawl\": {\n      \"loadedUrl\": \"https://example.com/\",\n      \"loadedTime\": \"2026-05-11T12:34:56.000Z\",\n      \"referrerUrl\": \"https://example.com\",\n      \"httpStatusCode\": 200,\n      \"depth\": 0,\n      \"contentType\": \"text/html\"\n    },\n    \"metadata\": {\n      \"title\": \"...\",\n      \"description\": \"...\",\n      \"openGraph\": {},\n      \"jsonLd\": []\n    },\n    \"html\": \"\u003creadable article html\u003e\",\n    \"markdown\": \"# Title\\n\\n...\",\n    \"screenshotUrl\": null,\n    \"timeTaken\": \"3.42s\",\n    \"headings\": { \"h1\": [\"...\"], \"h2\": [\"...\"] },\n    \"links\": { \"total\": 12, \"items\": [] },\n    \"product\": { \"productTitle\": \"...\", \"variants\": [], \"priceTry\": null },\n    \"networkSummary\": { \"totalRequests\": 1, \"byType\": { \"document\": 1 } }\n  }\n}\n```\n\n\u003c/details\u003e\n\n---\n\n### `POST /api/scrape/async` — fire-and-forget\n\nSame body as `/api/scrape`. Returns immediately with a `jobId`; poll [`/api/jobs/:jobId`](#get-apijobsjobid--job-status--result) for results.\n\n```json\n{\n  \"success\": true,\n  \"jobId\": \"550e8400-e29b-41d4-a716-446655440000\",\n  \"message\": \"Scraping started. Poll /api/jobs/:jobId for results.\"\n}\n```\n\n---\n\n### `POST /api/scrape/batch` — multi-URL batch\n\nScrape up to **10 URLs in parallel**. Same options as `/api/scrape` but with a `urls` array.\n\n**Request:**\n\n```json\n{\n  \"urls\": [\"https://a.com\", \"https://b.com\"],\n  \"extractors\": [\"headings\"]\n}\n```\n\n**Response (immediate):**\n\n```json\n{\n  \"success\": true,\n  \"batchId\": \"...\",\n  \"jobIds\": [\"...\", \"...\"],\n  \"message\": \"Batch scraping started.\"\n}\n```\n\nPoll each `jobId` independently.\n\n---\n\n### `GET /api/jobs/:jobId` — job status \u0026 result\n\n| Field         | When           | Description                        |\n| ------------- | -------------- | ---------------------------------- |\n| `status`      | always         | `running` · `completed` · `failed` |\n| `data`        | on `completed` | Full `ScrapeResult` (see above).   |\n| `error`       | on `failed`    | Error message string.              |\n| `createdAt`   | always         | ISO timestamp.                     |\n| `completedAt` | when done      | ISO timestamp.                     |\n\nSupports `?fields=` to project the inner `data`.\n\n---\n\n### `GET /api/jobs` — list all jobs\n\nReturns metadata for every job in the in-memory store (without payloads).\n\n\u003e ⚠️ The job store is **in-memory** and resets on container restart. For production, swap [`src/services/job.service.ts`](src/services/job.service.ts) for Redis or Postgres.\n\n---\n\n### `GET /api/health` — health check\n\n```json\n{ \"status\": \"ok\", \"uptime\": 123.45 }\n```\n\n---\n\n## 🧩 Project layout\n\n```\n.\n├── 🐳 Dockerfile                    # Bun + puppeteer base image, Xvfb installed\n├── 🐳 docker-compose.yml            # Single-service compose (port 8090)\n├── 🐳 docker-entrypoint.sh          # Starts Xvfb, then execs CMD\n├── 📦 package.json\n├── 🔒 bun.lock\n├── ⚙️  tsconfig.json\n├── 📁 docs/\n│   └── 📜 swagger.yaml              # OpenAPI 3 spec → served at /api-docs\n├── 📁 public/\n│   └── 🌐 index.html                # Static UI → served at /\n└── 📁 src/\n    ├── server.ts                    # HTTP entrypoint\n    ├── app.ts                       # Express app, middleware, static \u0026 swagger\n    ├── config/index.ts              # Env vars, constants\n    ├── controllers/                 # Request handlers\n    ├── middlewares/validate.ts      # Zod body validator\n    ├── routes/                      # Thin route layer per resource\n    ├── services/\n    │   ├── job.service.ts           # In-memory job store\n    │   └── scraper.service.ts       # Timeout wrapper around scrapePage\n    ├── utils/\n    │   ├── pick-fields.ts           # ?fields= projection helper\n    │   └── scraper.ts               # All puppeteer + extraction logic\n    ├── validators/scrape.validator.ts  # Zod schemas\n    └── types/index.ts               # Domain \u0026 DTO types\n```\n\n---\n\n## 🛠️ Development\n\n### Scripts\n\n| Script          | What it does                                            |\n| --------------- | ------------------------------------------------------- |\n| `bun run dev`   | Run `src/server.ts` with `bun --watch` (no build step). |\n| `bun run build` | Compile TypeScript → `dist/`.                           |\n| `bun run start` | Run the compiled output (`bun dist/src/server.js`).     |\n\n### Type-check without emitting\n\n```bash\nbunx tsc --noEmit\n```\n\n### Adding a new extractor\n\n1. Add the name to `ExtractorName` in [`src/types/index.ts`](src/types/index.ts).\n2. Add it to the enum in [`src/validators/scrape.validator.ts`](src/validators/scrape.validator.ts).\n3. Implement `extractFoo($, baseUrl)` in [`src/utils/scraper.ts`](src/utils/scraper.ts).\n4. Wire it into the `scrapePage` block (`if (extractors.includes(\"foo\"))`).\n5. Add the response field to `ScrapeResult` in `types/index.ts`.\n6. Update [`docs/swagger.yaml`](docs/swagger.yaml) so Swagger UI reflects the new shape.\n\n---\n\n## 🐳 Docker notes\n\n### Why an entrypoint script?\n\n`puppeteer-real-browser` runs Chrome **headful** (`headless: false`) so anti-bot detection works — that requires a real X display. The image ships with `xvfb`, and [`docker-entrypoint.sh`](docker-entrypoint.sh) starts `Xvfb :99` before exec'ing the server, then sets `DISPLAY=:99` so Chrome attaches.\n\n### Tuning shared memory\n\nThe default `shm_size: \"2gb\"` is sized for a single Chrome. If you raise the batch limit or run many scrapes in parallel, bump it to `4gb`+ to prevent Chrome crashes.\n\n### Using a proxy\n\n```yaml\nenvironment:\n  - PROXY_URL=http://user:pass@proxy-server:8080\n```\n\nCredentials are URL-decoded before being passed to Chrome.\n\n---\n\n## ⚠️ Limitations \u0026 production notes\n\n- **In-memory job store** — jobs are lost on restart. Swap [`src/services/job.service.ts`](src/services/job.service.ts) for Redis or Postgres before going live.\n- **Batch cap is 10 URLs** — see `BATCH_MAX_URLS` in [`src/config/index.ts`](src/config/index.ts).\n- **Scrape timeout is 240s** — see `SCRAPE_TIMEOUT_MS` in [`src/config/index.ts`](src/config/index.ts).\n- **No authentication** — anyone with network access can scrape arbitrary URLs. Put it behind a reverse proxy with auth/rate-limits before exposing publicly.\n\n---\n\n## 📄 License\n\n**© 2026 Rejoyan Islam. All Rights Reserved.**\n\nThis project is **proprietary and source-available** — it is **not** open source. The code is published for reference and evaluation only. You may **not** copy, reproduce, modify, distribute, host, deploy, or create derivative works from any part of it without prior **written permission** from the author. See the full [`LICENSE`](LICENSE) for the exact terms.\n\nFor licensing or permission inquiries: **rejoyanislam0014@gmail.com**\n\n\u003cdiv align=\"center\"\u003e\n\nMade with 🕷️ + ☕ by [Rejoyan Islam](mailto:rejoyanislam0014@gmail.com) — built on [Bun](https://bun.sh/), [Express](https://expressjs.com/), and [Puppeteer](https://pptr.dev/).\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmd-rejoyan-islam%2Fscrape-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmd-rejoyan-islam%2Fscrape-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmd-rejoyan-islam%2Fscrape-server/lists"}