https://github.com/northys/unstoppable

High-performance, extensible web scraper for e-commerce product data
https://github.com/northys/unstoppable

crawlee e-commerce product-data typescript web-scraping

Last synced: 10 months ago
JSON representation

High-performance, extensible web scraper for e-commerce product data

Host: GitHub
URL: https://github.com/northys/unstoppable
Owner: northys
Created: 2025-07-30T18:33:13.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-07-30T18:48:55.000Z (10 months ago)
Last Synced: 2025-07-30T21:08:40.506Z (10 months ago)
Topics: crawlee, e-commerce, product-data, typescript, web-scraping
Language: TypeScript
Size: 102 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Unstoppable - Product Data Scraper

A high-performance, extensible web scraper built with TypeScript and Crawlee for extracting product data from e-commerce websites.

## Features

- **Extensible Architecture**: Easy to add new website scrapers by extending the base scraper class

- **Parallel Processing**: Leverages Crawlee's concurrent crawling capabilities

- **TypeScript**: Full type safety and better development experience

- **Test-Driven**: Comprehensive test suite with HTML snapshots

- **Data Export**: Export scraped data in JSON or CSV formats

- **Proxy Support**: Built-in proxy rotation support

## Project Structure

```

unstoppable/

├── src/

│   ├── crawlers/        # Crawler implementations

│   ├── scrapers/        # Website-specific scrapers

│   ├── models/          # Data models

│   ├── utils/           # Utility functions

│   └── tests/           # Test files and fixtures

├── dist/                # Compiled JavaScript output

└── storage/             # Crawlee data storage

```

## Installation

```bash

npm install

```

## Usage

### Product Scraping

#### Development

```bash

npm start

# or

npm run start:dev

```

#### Production

```bash

npm run build

npm run start:prod

```

### Category Extraction

Extract categories from e-commerce websites:

```bash

# Extract all categories from Thomann

npm run extract-categories

# Extract from specific URL

npm run extract-categories -- --url https://www.thomann.de/de/index.html

# Export as CSV

npm run extract-categories -- --format csv

# Build category tree structure

npm run extract-categories -- --tree

# Verbose logging

npm run extract-categories -- --verbose

```

### Testing

```bash

npm test

```

### Linting and Type Checking

```bash

npm run lint

npm run typecheck

```

## Adding a New Website Scraper

1. Create a new scraper class extending `BaseScraper`:

```typescript

import { BaseScraper } from './BaseScraper.js';

export class NewWebsiteScraper extends BaseScraper {

  constructor() {

    super({

      name: 'NewWebsite',

      baseUrl: 'https://example.com',

    });

  }

  // Implement required methods

  async scrapeProductList(url: string): Promise { /* ... */ }

  async scrapeProductDetail(url: string): Promise { /* ... */ }

  async getCategoryUrls(): Promise { /* ... */ }

  isProductUrl(url: string): boolean { /* ... */ }

  isCategoryUrl(url: string): boolean { /* ... */ }

}

```

2. Use the scraper with ProductCrawler:

```typescript

import { ProductCrawler } from './crawlers/ProductCrawler.js';

import { NewWebsiteScraper } from './scrapers/NewWebsiteScraper.js';

const scraper = new NewWebsiteScraper();

const crawler = new ProductCrawler({ scraper });

await crawler.run();

```

## Data Models

### Product Model

Products are scraped with the following structure:

```typescript

interface Product {

  id: string;

  sku?: string;

  name: string;

  brand?: string;

  category: string[];

  price: {

    currency: string;

    amount: number;

    originalAmount?: number;

    discount?: number;

  };

  availability: {

    inStock: boolean;

    stockLevel?: number;

    deliveryTime?: string;

  };

  images: string[];

  description?: string;

  specifications?: Record;

  ratings?: {

    average: number;

    count: number;

  };

  url: string;

  scrapedAt: Date;

  source: string;

}

```

### Category Model

Categories are extracted with the following structure:

```typescript

interface Category {

  name: string;

  url: string;

  code: string;

  parentCategory?: string;

  level: number;

  productCount?: number;

  subcategories?: Category[];

  scrapedAt: Date;

  source: string;

}

```

## Configuration

Configure the crawler behavior through `ProductCrawler` options:

- `maxRequestsPerCrawl`: Limit the number of requests

- `proxyUrls`: Array of proxy URLs for rotation

- `startUrls`: Override default category URLs

## License

ISC

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/northys/unstoppable

Awesome Lists containing this project

README