An open API service indexing awesome lists of open source software.

https://github.com/md-rejoyan-islam/scrape-server

Scraping Server is a self-hostable web scraping REST API built with Bun, Express, and TypeScript. It uses a real Chrome browser powered by puppeteer-real-browser and stealth techniques to bypass Cloudflare Turnstile and common bot protections. The API extracts structured page data including metadata, product details, links, images, headings.
https://github.com/md-rejoyan-islam/scrape-server

bun cheerio docker express puppeteer swagger typescript zod

Last synced: 11 days ago
JSON representation

Scraping Server is a self-hostable web scraping REST API built with Bun, Express, and TypeScript. It uses a real Chrome browser powered by puppeteer-real-browser and stealth techniques to bypass Cloudflare Turnstile and common bot protections. The API extracts structured page data including metadata, product details, links, images, headings.

Awesome Lists containing this project

README

          

# ๐Ÿ•ท๏ธ Scraping Server

**A production-grade web scraping API with anti-bot bypass, structured product extraction, and OpenAPI docs.**

[![Bun](https://img.shields.io/badge/Bun-1.3+-000?logo=bun&logoColor=fbf0df)](https://bun.sh/)
[![TypeScript](https://img.shields.io/badge/TypeScript-5.9-3178C6?logo=typescript&logoColor=white)](https://www.typescriptlang.org/)
[![Express](https://img.shields.io/badge/Express-5.x-000?logo=express&logoColor=white)](https://expressjs.com/)
[![Puppeteer](https://img.shields.io/badge/Puppeteer-24.x-40B5A4?logo=puppeteer&logoColor=white)](https://pptr.dev/)
[![Docker](https://img.shields.io/badge/Docker-ready-2496ED?logo=docker&logoColor=white)](https://www.docker.com/)
[![License](https://img.shields.io/badge/license-Proprietary-red.svg)](#-license)

[![Live Demo](https://img.shields.io/badge/โ–ถ_Live_Demo-scrape--server.rejoyan.me-5b9dff?logoColor=white)](https://scrape-server.rejoyan.me)

๐ŸŒ **Live instance:** [**scrape-server.rejoyan.me**](https://scrape-server.rejoyan.me) ยท [Swagger docs](https://scrape-server.rejoyan.me/api-docs)

[Quick start](#-quick-start) ยท [API reference](#-api-reference) ยท [Swagger UI](#-interactive-api-docs-swagger) ยท [Configuration](#%EF%B8%8F-configuration) ยท [Docker notes](#-docker-notes) ยท [License](#-license)

---

## โœจ Highlights

| | |
| ------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| ๐Ÿ›ก๏ธ **Anti-bot bypass** | Cloudflare Turnstile & generic challenges via [`puppeteer-real-browser`](https://www.npmjs.com/package/puppeteer-real-browser) + stealth plugin |
| ๐Ÿ›’ **Structured product data** | Normalized output from JSON-LD, microdata, OpenGraph โ€” including variants, prices, stock |
| ๐Ÿงฉ **Pluggable extractors** | `links`, `images`, `headings`, `text`, `prices`, `tables` โ€” opt in per request |
| ๐Ÿ“œ **Readability & Markdown** | Clean article HTML and Markdown output via Mozilla Readability + Turndown |
| ๐Ÿ“ธ **Screenshots** | Base64-encoded PNG of the rendered page |
| โšก **Three execution modes** | Synchronous, async (job-based), and parallel batch (up to 10 URLs) |
| ๐Ÿ“š **Swagger UI** | Interactive OpenAPI 3 docs at `/api-docs` |
| ๐ŸŽฏ **Field projection** | `?fields=a,b,c` to trim responses |
| ๐Ÿณ **Docker-native** | One-command bring-up with bundled Xvfb for headful Chrome |
| โœ… **Type-safe inputs** | Zod-validated request bodies |

---

## ๐Ÿš€ Quick start

### ๐Ÿณ Option A โ€” Docker (recommended)

```bash
docker compose up --build
```

That's it. After ~2 minutes (first build):

| | |
| ------------------- | -------------------------------- |
| ๐Ÿ–ฅ๏ธ **Web UI** | http://localhost:8090 |
| ๐Ÿ“– **Swagger docs** | http://localhost:8090/api-docs |
| ๐Ÿ“ก **API base** | http://localhost:8090/api |
| โค๏ธ **Health** | http://localhost:8090/api/health |

Stop the stack:

```bash
docker compose down
```

### ๐Ÿ’ป Option B โ€” Local with Bun

**Prerequisites:** [Bun](https://bun.sh/) `>= 1.3` and a modern Chrome/Chromium (Puppeteer downloads one on first install).

```bash
bun install
bun run dev # watch mode โ€” auto-reload on changes
```

Build & run the compiled output:

```bash
bun run build # tsc โ†’ dist/
bun run start # bun dist/src/server.js
```

---

## ๐Ÿ“– Interactive API docs (Swagger)

The full OpenAPI 3 specification lives at [`docs/swagger.yaml`](docs/swagger.yaml) and is **rendered as interactive Swagger UI** once the server is running:

> ๐Ÿ‘‰ **http://localhost:8090/api-docs**

From the Swagger UI you can:

- ๐Ÿ” Browse every endpoint with full request/response schemas
- ๐Ÿงช **Try requests live** with the built-in "Try it out" button
- ๐Ÿ“ฅ Inspect example payloads and response shapes inline
- ๐Ÿ“ค Export/download the raw spec for codegen or client SDK generation

The static UI playground at **http://localhost:8090/** is a friendlier sandbox for non-engineers.

> ๐Ÿ’ก **For client SDKs:** point your favorite OpenAPI generator (e.g. [`openapi-typescript`](https://www.npmjs.com/package/openapi-typescript), [`openapi-generator-cli`](https://openapi-generator.tech/)) at `http://localhost:8090/api-docs/swagger.json` to auto-generate a fully-typed client.

---

## โš™๏ธ Configuration

All settings are environment variables. Locally, drop them in a `.env` at the project root (auto-loaded). For Docker, set them under `environment:` in [`docker-compose.yml`](docker-compose.yml).

| Variable | Default | Description |
| -------------------- | ------- | -------------------------------------------------------------------------------------- |
| `PORT` | `8090` | HTTP port the server listens on. |
| `HEADLESS` | `true` | Run Chrome headless. _(Informational โ€” `puppeteer-real-browser` is always headful.)_ |
| `BOT_BYPASS_ENABLED` | `true` | Reserved flag for future bot-bypass tuning. |
| `CHROME_PATH` | auto | Explicit path to Chrome/Chromium (set inside the container). |
| `PROXY_URL` | _none_ | Outbound HTTP proxy: `http://[user:pass@]host:port`. Used by Chrome for every request. |

**Example `.env`:**

```env
PORT=8090
PROXY_URL=http://user:pass@proxy.example.com:8080
```

> ๐Ÿ”’ `.env` is **excluded** from the Docker build (see [`.dockerignore`](.dockerignore)). Container env must go in `docker-compose.yml`.

---

## ๐Ÿ”Œ API reference

Base URL: `http://localhost:8090/api` ยท Full schemas: **[Swagger UI](http://localhost:8090/api-docs)**

MethodPathPurpose
POST/api/scrapeSynchronous scrape
POST/api/scrape/asyncFire-and-forget โ€” returns a jobId
POST/api/scrape/batchBatch up to 10 URLs in parallel
GET/api/jobs/:jobIdPoll a job's status & result
GET/api/jobsList all jobs (metadata only)
GET/api/healthServer health & uptime

---

### `POST /api/scrape` โ€” synchronous scrape

Scrapes a URL and returns the full result in the same response. Best for ad-hoc requests and testing.

**Request body:**

| Field | Type | Required | Default | Description |
| ------------ | ------------------- | -------- | -------------------------------------------------------- | ----------------------------------------------------- |
| `url` | `string` (URL) | โœ… | โ€” | The page to scrape. |
| `waitFor` | `number` (0..60000) | โŒ | `3000` | Extra ms to wait after page load (JS-rendered pages). |
| `extractors` | `ExtractorName[]` | โŒ | `["links","images","headings","text","prices","tables"]` | Which extractors to run. |
| `fullHtml` | `boolean` | โŒ | `false` | Include raw post-render HTML under `fullHtml`. |
| `screenshot` | `boolean` | โŒ | `false` | Include base64 PNG under `screenshotUrl`. |

**Query:** `?fields=a,b,c` โ€” projection of top-level fields.

**Example:**

```bash
curl -X POST http://localhost:8090/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"waitFor": 2000,
"extractors": ["headings", "links"]
}'
```

Response shape

```json
{
"success": true,
"data": {
"url": "https://example.com/",
"crawl": {
"loadedUrl": "https://example.com/",
"loadedTime": "2026-05-11T12:34:56.000Z",
"referrerUrl": "https://example.com",
"httpStatusCode": 200,
"depth": 0,
"contentType": "text/html"
},
"metadata": {
"title": "...",
"description": "...",
"openGraph": {},
"jsonLd": []
},
"html": "",
"markdown": "# Title\n\n...",
"screenshotUrl": null,
"timeTaken": "3.42s",
"headings": { "h1": ["..."], "h2": ["..."] },
"links": { "total": 12, "items": [] },
"product": { "productTitle": "...", "variants": [], "priceTry": null },
"networkSummary": { "totalRequests": 1, "byType": { "document": 1 } }
}
}
```

---

### `POST /api/scrape/async` โ€” fire-and-forget

Same body as `/api/scrape`. Returns immediately with a `jobId`; poll [`/api/jobs/:jobId`](#get-apijobsjobid--job-status--result) for results.

```json
{
"success": true,
"jobId": "550e8400-e29b-41d4-a716-446655440000",
"message": "Scraping started. Poll /api/jobs/:jobId for results."
}
```

---

### `POST /api/scrape/batch` โ€” multi-URL batch

Scrape up to **10 URLs in parallel**. Same options as `/api/scrape` but with a `urls` array.

**Request:**

```json
{
"urls": ["https://a.com", "https://b.com"],
"extractors": ["headings"]
}
```

**Response (immediate):**

```json
{
"success": true,
"batchId": "...",
"jobIds": ["...", "..."],
"message": "Batch scraping started."
}
```

Poll each `jobId` independently.

---

### `GET /api/jobs/:jobId` โ€” job status & result

| Field | When | Description |
| ------------- | -------------- | ---------------------------------- |
| `status` | always | `running` ยท `completed` ยท `failed` |
| `data` | on `completed` | Full `ScrapeResult` (see above). |
| `error` | on `failed` | Error message string. |
| `createdAt` | always | ISO timestamp. |
| `completedAt` | when done | ISO timestamp. |

Supports `?fields=` to project the inner `data`.

---

### `GET /api/jobs` โ€” list all jobs

Returns metadata for every job in the in-memory store (without payloads).

> โš ๏ธ The job store is **in-memory** and resets on container restart. For production, swap [`src/services/job.service.ts`](src/services/job.service.ts) for Redis or Postgres.

---

### `GET /api/health` โ€” health check

```json
{ "status": "ok", "uptime": 123.45 }
```

---

## ๐Ÿงฉ Project layout

```
.
โ”œโ”€โ”€ ๐Ÿณ Dockerfile # Bun + puppeteer base image, Xvfb installed
โ”œโ”€โ”€ ๐Ÿณ docker-compose.yml # Single-service compose (port 8090)
โ”œโ”€โ”€ ๐Ÿณ docker-entrypoint.sh # Starts Xvfb, then execs CMD
โ”œโ”€โ”€ ๐Ÿ“ฆ package.json
โ”œโ”€โ”€ ๐Ÿ”’ bun.lock
โ”œโ”€โ”€ โš™๏ธ tsconfig.json
โ”œโ”€โ”€ ๐Ÿ“ docs/
โ”‚ โ””โ”€โ”€ ๐Ÿ“œ swagger.yaml # OpenAPI 3 spec โ†’ served at /api-docs
โ”œโ”€โ”€ ๐Ÿ“ public/
โ”‚ โ””โ”€โ”€ ๐ŸŒ index.html # Static UI โ†’ served at /
โ””โ”€โ”€ ๐Ÿ“ src/
โ”œโ”€โ”€ server.ts # HTTP entrypoint
โ”œโ”€โ”€ app.ts # Express app, middleware, static & swagger
โ”œโ”€โ”€ config/index.ts # Env vars, constants
โ”œโ”€โ”€ controllers/ # Request handlers
โ”œโ”€โ”€ middlewares/validate.ts # Zod body validator
โ”œโ”€โ”€ routes/ # Thin route layer per resource
โ”œโ”€โ”€ services/
โ”‚ โ”œโ”€โ”€ job.service.ts # In-memory job store
โ”‚ โ””โ”€โ”€ scraper.service.ts # Timeout wrapper around scrapePage
โ”œโ”€โ”€ utils/
โ”‚ โ”œโ”€โ”€ pick-fields.ts # ?fields= projection helper
โ”‚ โ””โ”€โ”€ scraper.ts # All puppeteer + extraction logic
โ”œโ”€โ”€ validators/scrape.validator.ts # Zod schemas
โ””โ”€โ”€ types/index.ts # Domain & DTO types
```

---

## ๐Ÿ› ๏ธ Development

### Scripts

| Script | What it does |
| --------------- | ------------------------------------------------------- |
| `bun run dev` | Run `src/server.ts` with `bun --watch` (no build step). |
| `bun run build` | Compile TypeScript โ†’ `dist/`. |
| `bun run start` | Run the compiled output (`bun dist/src/server.js`). |

### Type-check without emitting

```bash
bunx tsc --noEmit
```

### Adding a new extractor

1. Add the name to `ExtractorName` in [`src/types/index.ts`](src/types/index.ts).
2. Add it to the enum in [`src/validators/scrape.validator.ts`](src/validators/scrape.validator.ts).
3. Implement `extractFoo($, baseUrl)` in [`src/utils/scraper.ts`](src/utils/scraper.ts).
4. Wire it into the `scrapePage` block (`if (extractors.includes("foo"))`).
5. Add the response field to `ScrapeResult` in `types/index.ts`.
6. Update [`docs/swagger.yaml`](docs/swagger.yaml) so Swagger UI reflects the new shape.

---

## ๐Ÿณ Docker notes

### Why an entrypoint script?

`puppeteer-real-browser` runs Chrome **headful** (`headless: false`) so anti-bot detection works โ€” that requires a real X display. The image ships with `xvfb`, and [`docker-entrypoint.sh`](docker-entrypoint.sh) starts `Xvfb :99` before exec'ing the server, then sets `DISPLAY=:99` so Chrome attaches.

### Tuning shared memory

The default `shm_size: "2gb"` is sized for a single Chrome. If you raise the batch limit or run many scrapes in parallel, bump it to `4gb`+ to prevent Chrome crashes.

### Using a proxy

```yaml
environment:
- PROXY_URL=http://user:pass@proxy-server:8080
```

Credentials are URL-decoded before being passed to Chrome.

---

## โš ๏ธ Limitations & production notes

- **In-memory job store** โ€” jobs are lost on restart. Swap [`src/services/job.service.ts`](src/services/job.service.ts) for Redis or Postgres before going live.
- **Batch cap is 10 URLs** โ€” see `BATCH_MAX_URLS` in [`src/config/index.ts`](src/config/index.ts).
- **Scrape timeout is 240s** โ€” see `SCRAPE_TIMEOUT_MS` in [`src/config/index.ts`](src/config/index.ts).
- **No authentication** โ€” anyone with network access can scrape arbitrary URLs. Put it behind a reverse proxy with auth/rate-limits before exposing publicly.

---

## ๐Ÿ“„ License

**ยฉ 2026 Rejoyan Islam. All Rights Reserved.**

This project is **proprietary and source-available** โ€” it is **not** open source. The code is published for reference and evaluation only. You may **not** copy, reproduce, modify, distribute, host, deploy, or create derivative works from any part of it without prior **written permission** from the author. See the full [`LICENSE`](LICENSE) for the exact terms.

For licensing or permission inquiries: **rejoyanislam0014@gmail.com**

Made with ๐Ÿ•ท๏ธ + โ˜• by [Rejoyan Islam](mailto:rejoyanislam0014@gmail.com) โ€” built on [Bun](https://bun.sh/), [Express](https://expressjs.com/), and [Puppeteer](https://pptr.dev/).