https://github.com/northys/unstoppable
High-performance, extensible web scraper for e-commerce product data
https://github.com/northys/unstoppable
crawlee e-commerce product-data typescript web-scraping
Last synced: 10 months ago
JSON representation
High-performance, extensible web scraper for e-commerce product data
- Host: GitHub
- URL: https://github.com/northys/unstoppable
- Owner: northys
- Created: 2025-07-30T18:33:13.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-07-30T18:48:55.000Z (10 months ago)
- Last Synced: 2025-07-30T21:08:40.506Z (10 months ago)
- Topics: crawlee, e-commerce, product-data, typescript, web-scraping
- Language: TypeScript
- Size: 102 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Unstoppable - Product Data Scraper
A high-performance, extensible web scraper built with TypeScript and Crawlee for extracting product data from e-commerce websites.
## Features
- **Extensible Architecture**: Easy to add new website scrapers by extending the base scraper class
- **Parallel Processing**: Leverages Crawlee's concurrent crawling capabilities
- **TypeScript**: Full type safety and better development experience
- **Test-Driven**: Comprehensive test suite with HTML snapshots
- **Data Export**: Export scraped data in JSON or CSV formats
- **Proxy Support**: Built-in proxy rotation support
## Project Structure
```
unstoppable/
├── src/
│ ├── crawlers/ # Crawler implementations
│ ├── scrapers/ # Website-specific scrapers
│ ├── models/ # Data models
│ ├── utils/ # Utility functions
│ └── tests/ # Test files and fixtures
├── dist/ # Compiled JavaScript output
└── storage/ # Crawlee data storage
```
## Installation
```bash
npm install
```
## Usage
### Product Scraping
#### Development
```bash
npm start
# or
npm run start:dev
```
#### Production
```bash
npm run build
npm run start:prod
```
### Category Extraction
Extract categories from e-commerce websites:
```bash
# Extract all categories from Thomann
npm run extract-categories
# Extract from specific URL
npm run extract-categories -- --url https://www.thomann.de/de/index.html
# Export as CSV
npm run extract-categories -- --format csv
# Build category tree structure
npm run extract-categories -- --tree
# Verbose logging
npm run extract-categories -- --verbose
```
### Testing
```bash
npm test
```
### Linting and Type Checking
```bash
npm run lint
npm run typecheck
```
## Adding a New Website Scraper
1. Create a new scraper class extending `BaseScraper`:
```typescript
import { BaseScraper } from './BaseScraper.js';
export class NewWebsiteScraper extends BaseScraper {
constructor() {
super({
name: 'NewWebsite',
baseUrl: 'https://example.com',
});
}
// Implement required methods
async scrapeProductList(url: string): Promise { /* ... */ }
async scrapeProductDetail(url: string): Promise { /* ... */ }
async getCategoryUrls(): Promise { /* ... */ }
isProductUrl(url: string): boolean { /* ... */ }
isCategoryUrl(url: string): boolean { /* ... */ }
}
```
2. Use the scraper with ProductCrawler:
```typescript
import { ProductCrawler } from './crawlers/ProductCrawler.js';
import { NewWebsiteScraper } from './scrapers/NewWebsiteScraper.js';
const scraper = new NewWebsiteScraper();
const crawler = new ProductCrawler({ scraper });
await crawler.run();
```
## Data Models
### Product Model
Products are scraped with the following structure:
```typescript
interface Product {
id: string;
sku?: string;
name: string;
brand?: string;
category: string[];
price: {
currency: string;
amount: number;
originalAmount?: number;
discount?: number;
};
availability: {
inStock: boolean;
stockLevel?: number;
deliveryTime?: string;
};
images: string[];
description?: string;
specifications?: Record;
ratings?: {
average: number;
count: number;
};
url: string;
scrapedAt: Date;
source: string;
}
```
### Category Model
Categories are extracted with the following structure:
```typescript
interface Category {
name: string;
url: string;
code: string;
parentCategory?: string;
level: number;
productCount?: number;
subcategories?: Category[];
scrapedAt: Date;
source: string;
}
```
## Configuration
Configure the crawler behavior through `ProductCrawler` options:
- `maxRequestsPerCrawl`: Limit the number of requests
- `proxyUrls`: Array of proxy URLs for rotation
- `startUrls`: Override default category URLs
## License
ISC