https://github.com/md-rejoyan-islam/scrape-server
Scraping Server is a self-hostable web scraping REST API built with Bun, Express, and TypeScript. It uses a real Chrome browser powered by puppeteer-real-browser and stealth techniques to bypass Cloudflare Turnstile and common bot protections. The API extracts structured page data including metadata, product details, links, images, headings.
https://github.com/md-rejoyan-islam/scrape-server
bun cheerio docker express puppeteer swagger typescript zod
Last synced: 11 days ago
JSON representation
Scraping Server is a self-hostable web scraping REST API built with Bun, Express, and TypeScript. It uses a real Chrome browser powered by puppeteer-real-browser and stealth techniques to bypass Cloudflare Turnstile and common bot protections. The API extracts structured page data including metadata, product details, links, images, headings.
- Host: GitHub
- URL: https://github.com/md-rejoyan-islam/scrape-server
- Owner: md-rejoyan-islam
- Created: 2026-03-06T05:35:20.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-06-09T13:48:17.000Z (12 days ago)
- Last Synced: 2026-06-09T14:06:12.710Z (12 days ago)
- Topics: bun, cheerio, docker, express, puppeteer, swagger, typescript, zod
- Language: TypeScript
- Homepage: http://scrape-server.rejoyan.me
- Size: 135 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ท๏ธ Scraping Server
**A production-grade web scraping API with anti-bot bypass, structured product extraction, and OpenAPI docs.**
[](https://bun.sh/)
[](https://www.typescriptlang.org/)
[](https://expressjs.com/)
[](https://pptr.dev/)
[](https://www.docker.com/)
[](#-license)
[](https://scrape-server.rejoyan.me)
๐ **Live instance:** [**scrape-server.rejoyan.me**](https://scrape-server.rejoyan.me) ยท [Swagger docs](https://scrape-server.rejoyan.me/api-docs)
[Quick start](#-quick-start) ยท [API reference](#-api-reference) ยท [Swagger UI](#-interactive-api-docs-swagger) ยท [Configuration](#%EF%B8%8F-configuration) ยท [Docker notes](#-docker-notes) ยท [License](#-license)
---
## โจ Highlights
| | |
| ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| ๐ก๏ธ **Anti-bot bypass** | Cloudflare Turnstile & generic challenges via [`puppeteer-real-browser`](https://www.npmjs.com/package/puppeteer-real-browser) + stealth plugin |
| ๐ **Structured product data** | Normalized output from JSON-LD, microdata, OpenGraph โ including variants, prices, stock |
| ๐งฉ **Pluggable extractors** | `links`, `images`, `headings`, `text`, `prices`, `tables` โ opt in per request |
| ๐ **Readability & Markdown** | Clean article HTML and Markdown output via Mozilla Readability + Turndown |
| ๐ธ **Screenshots** | Base64-encoded PNG of the rendered page |
| โก **Three execution modes** | Synchronous, async (job-based), and parallel batch (up to 10 URLs) |
| ๐ **Swagger UI** | Interactive OpenAPI 3 docs at `/api-docs` |
| ๐ฏ **Field projection** | `?fields=a,b,c` to trim responses |
| ๐ณ **Docker-native** | One-command bring-up with bundled Xvfb for headful Chrome |
| โ
**Type-safe inputs** | Zod-validated request bodies |
---
## ๐ Quick start
### ๐ณ Option A โ Docker (recommended)
```bash
docker compose up --build
```
That's it. After ~2 minutes (first build):
| | |
| ------------------- | -------------------------------- |
| ๐ฅ๏ธ **Web UI** | http://localhost:8090 |
| ๐ **Swagger docs** | http://localhost:8090/api-docs |
| ๐ก **API base** | http://localhost:8090/api |
| โค๏ธ **Health** | http://localhost:8090/api/health |
Stop the stack:
```bash
docker compose down
```
### ๐ป Option B โ Local with Bun
**Prerequisites:** [Bun](https://bun.sh/) `>= 1.3` and a modern Chrome/Chromium (Puppeteer downloads one on first install).
```bash
bun install
bun run dev # watch mode โ auto-reload on changes
```
Build & run the compiled output:
```bash
bun run build # tsc โ dist/
bun run start # bun dist/src/server.js
```
---
## ๐ Interactive API docs (Swagger)
The full OpenAPI 3 specification lives at [`docs/swagger.yaml`](docs/swagger.yaml) and is **rendered as interactive Swagger UI** once the server is running:
> ๐ **http://localhost:8090/api-docs**
From the Swagger UI you can:
- ๐ Browse every endpoint with full request/response schemas
- ๐งช **Try requests live** with the built-in "Try it out" button
- ๐ฅ Inspect example payloads and response shapes inline
- ๐ค Export/download the raw spec for codegen or client SDK generation
The static UI playground at **http://localhost:8090/** is a friendlier sandbox for non-engineers.
> ๐ก **For client SDKs:** point your favorite OpenAPI generator (e.g. [`openapi-typescript`](https://www.npmjs.com/package/openapi-typescript), [`openapi-generator-cli`](https://openapi-generator.tech/)) at `http://localhost:8090/api-docs/swagger.json` to auto-generate a fully-typed client.
---
## โ๏ธ Configuration
All settings are environment variables. Locally, drop them in a `.env` at the project root (auto-loaded). For Docker, set them under `environment:` in [`docker-compose.yml`](docker-compose.yml).
| Variable | Default | Description |
| -------------------- | ------- | -------------------------------------------------------------------------------------- |
| `PORT` | `8090` | HTTP port the server listens on. |
| `HEADLESS` | `true` | Run Chrome headless. _(Informational โ `puppeteer-real-browser` is always headful.)_ |
| `BOT_BYPASS_ENABLED` | `true` | Reserved flag for future bot-bypass tuning. |
| `CHROME_PATH` | auto | Explicit path to Chrome/Chromium (set inside the container). |
| `PROXY_URL` | _none_ | Outbound HTTP proxy: `http://[user:pass@]host:port`. Used by Chrome for every request. |
**Example `.env`:**
```env
PORT=8090
PROXY_URL=http://user:pass@proxy.example.com:8080
```
> ๐ `.env` is **excluded** from the Docker build (see [`.dockerignore`](.dockerignore)). Container env must go in `docker-compose.yml`.
---
## ๐ API reference
Base URL: `http://localhost:8090/api` ยท Full schemas: **[Swagger UI](http://localhost:8090/api-docs)**
MethodPathPurpose
POST/api/scrapeSynchronous scrape
POST/api/scrape/asyncFire-and-forget โ returns a jobId
POST/api/scrape/batchBatch up to 10 URLs in parallel
GET/api/jobs/:jobIdPoll a job's status & result
GET/api/jobsList all jobs (metadata only)
GET/api/healthServer health & uptime
---
### `POST /api/scrape` โ synchronous scrape
Scrapes a URL and returns the full result in the same response. Best for ad-hoc requests and testing.
**Request body:**
| Field | Type | Required | Default | Description |
| ------------ | ------------------- | -------- | -------------------------------------------------------- | ----------------------------------------------------- |
| `url` | `string` (URL) | โ
| โ | The page to scrape. |
| `waitFor` | `number` (0..60000) | โ | `3000` | Extra ms to wait after page load (JS-rendered pages). |
| `extractors` | `ExtractorName[]` | โ | `["links","images","headings","text","prices","tables"]` | Which extractors to run. |
| `fullHtml` | `boolean` | โ | `false` | Include raw post-render HTML under `fullHtml`. |
| `screenshot` | `boolean` | โ | `false` | Include base64 PNG under `screenshotUrl`. |
**Query:** `?fields=a,b,c` โ projection of top-level fields.
**Example:**
```bash
curl -X POST http://localhost:8090/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"waitFor": 2000,
"extractors": ["headings", "links"]
}'
```
Response shape
```json
{
"success": true,
"data": {
"url": "https://example.com/",
"crawl": {
"loadedUrl": "https://example.com/",
"loadedTime": "2026-05-11T12:34:56.000Z",
"referrerUrl": "https://example.com",
"httpStatusCode": 200,
"depth": 0,
"contentType": "text/html"
},
"metadata": {
"title": "...",
"description": "...",
"openGraph": {},
"jsonLd": []
},
"html": "",
"markdown": "# Title\n\n...",
"screenshotUrl": null,
"timeTaken": "3.42s",
"headings": { "h1": ["..."], "h2": ["..."] },
"links": { "total": 12, "items": [] },
"product": { "productTitle": "...", "variants": [], "priceTry": null },
"networkSummary": { "totalRequests": 1, "byType": { "document": 1 } }
}
}
```
---
### `POST /api/scrape/async` โ fire-and-forget
Same body as `/api/scrape`. Returns immediately with a `jobId`; poll [`/api/jobs/:jobId`](#get-apijobsjobid--job-status--result) for results.
```json
{
"success": true,
"jobId": "550e8400-e29b-41d4-a716-446655440000",
"message": "Scraping started. Poll /api/jobs/:jobId for results."
}
```
---
### `POST /api/scrape/batch` โ multi-URL batch
Scrape up to **10 URLs in parallel**. Same options as `/api/scrape` but with a `urls` array.
**Request:**
```json
{
"urls": ["https://a.com", "https://b.com"],
"extractors": ["headings"]
}
```
**Response (immediate):**
```json
{
"success": true,
"batchId": "...",
"jobIds": ["...", "..."],
"message": "Batch scraping started."
}
```
Poll each `jobId` independently.
---
### `GET /api/jobs/:jobId` โ job status & result
| Field | When | Description |
| ------------- | -------------- | ---------------------------------- |
| `status` | always | `running` ยท `completed` ยท `failed` |
| `data` | on `completed` | Full `ScrapeResult` (see above). |
| `error` | on `failed` | Error message string. |
| `createdAt` | always | ISO timestamp. |
| `completedAt` | when done | ISO timestamp. |
Supports `?fields=` to project the inner `data`.
---
### `GET /api/jobs` โ list all jobs
Returns metadata for every job in the in-memory store (without payloads).
> โ ๏ธ The job store is **in-memory** and resets on container restart. For production, swap [`src/services/job.service.ts`](src/services/job.service.ts) for Redis or Postgres.
---
### `GET /api/health` โ health check
```json
{ "status": "ok", "uptime": 123.45 }
```
---
## ๐งฉ Project layout
```
.
โโโ ๐ณ Dockerfile # Bun + puppeteer base image, Xvfb installed
โโโ ๐ณ docker-compose.yml # Single-service compose (port 8090)
โโโ ๐ณ docker-entrypoint.sh # Starts Xvfb, then execs CMD
โโโ ๐ฆ package.json
โโโ ๐ bun.lock
โโโ โ๏ธ tsconfig.json
โโโ ๐ docs/
โ โโโ ๐ swagger.yaml # OpenAPI 3 spec โ served at /api-docs
โโโ ๐ public/
โ โโโ ๐ index.html # Static UI โ served at /
โโโ ๐ src/
โโโ server.ts # HTTP entrypoint
โโโ app.ts # Express app, middleware, static & swagger
โโโ config/index.ts # Env vars, constants
โโโ controllers/ # Request handlers
โโโ middlewares/validate.ts # Zod body validator
โโโ routes/ # Thin route layer per resource
โโโ services/
โ โโโ job.service.ts # In-memory job store
โ โโโ scraper.service.ts # Timeout wrapper around scrapePage
โโโ utils/
โ โโโ pick-fields.ts # ?fields= projection helper
โ โโโ scraper.ts # All puppeteer + extraction logic
โโโ validators/scrape.validator.ts # Zod schemas
โโโ types/index.ts # Domain & DTO types
```
---
## ๐ ๏ธ Development
### Scripts
| Script | What it does |
| --------------- | ------------------------------------------------------- |
| `bun run dev` | Run `src/server.ts` with `bun --watch` (no build step). |
| `bun run build` | Compile TypeScript โ `dist/`. |
| `bun run start` | Run the compiled output (`bun dist/src/server.js`). |
### Type-check without emitting
```bash
bunx tsc --noEmit
```
### Adding a new extractor
1. Add the name to `ExtractorName` in [`src/types/index.ts`](src/types/index.ts).
2. Add it to the enum in [`src/validators/scrape.validator.ts`](src/validators/scrape.validator.ts).
3. Implement `extractFoo($, baseUrl)` in [`src/utils/scraper.ts`](src/utils/scraper.ts).
4. Wire it into the `scrapePage` block (`if (extractors.includes("foo"))`).
5. Add the response field to `ScrapeResult` in `types/index.ts`.
6. Update [`docs/swagger.yaml`](docs/swagger.yaml) so Swagger UI reflects the new shape.
---
## ๐ณ Docker notes
### Why an entrypoint script?
`puppeteer-real-browser` runs Chrome **headful** (`headless: false`) so anti-bot detection works โ that requires a real X display. The image ships with `xvfb`, and [`docker-entrypoint.sh`](docker-entrypoint.sh) starts `Xvfb :99` before exec'ing the server, then sets `DISPLAY=:99` so Chrome attaches.
### Tuning shared memory
The default `shm_size: "2gb"` is sized for a single Chrome. If you raise the batch limit or run many scrapes in parallel, bump it to `4gb`+ to prevent Chrome crashes.
### Using a proxy
```yaml
environment:
- PROXY_URL=http://user:pass@proxy-server:8080
```
Credentials are URL-decoded before being passed to Chrome.
---
## โ ๏ธ Limitations & production notes
- **In-memory job store** โ jobs are lost on restart. Swap [`src/services/job.service.ts`](src/services/job.service.ts) for Redis or Postgres before going live.
- **Batch cap is 10 URLs** โ see `BATCH_MAX_URLS` in [`src/config/index.ts`](src/config/index.ts).
- **Scrape timeout is 240s** โ see `SCRAPE_TIMEOUT_MS` in [`src/config/index.ts`](src/config/index.ts).
- **No authentication** โ anyone with network access can scrape arbitrary URLs. Put it behind a reverse proxy with auth/rate-limits before exposing publicly.
---
## ๐ License
**ยฉ 2026 Rejoyan Islam. All Rights Reserved.**
This project is **proprietary and source-available** โ it is **not** open source. The code is published for reference and evaluation only. You may **not** copy, reproduce, modify, distribute, host, deploy, or create derivative works from any part of it without prior **written permission** from the author. See the full [`LICENSE`](LICENSE) for the exact terms.
For licensing or permission inquiries: **rejoyanislam0014@gmail.com**
Made with ๐ท๏ธ + โ by [Rejoyan Islam](mailto:rejoyanislam0014@gmail.com) โ built on [Bun](https://bun.sh/), [Express](https://expressjs.com/), and [Puppeteer](https://pptr.dev/).