An open API service indexing awesome lists of open source software.

https://github.com/jolovicdev/nexthread

LLM format scraper, Python Firecrawl alternative
https://github.com/jolovicdev/nexthread

crawler firecrawl llm markdown python scraper

Last synced: 3 months ago
JSON representation

LLM format scraper, Python Firecrawl alternative

Awesome Lists containing this project

README

          

# Nexthread 2.0 API Documentation

## Overview
Nexthread is a high-performance web scraping and crawling API that provides:
- Structured content extraction for AI and LLM processing
- Configurable content formats (markdown, HTML, text)
- Advanced web crawling with depth and domain filtering
- URL discovery and site mapping
- Comprehensive job management and history tracking

## API Endpoints

### Scraping Endpoints

#### 1. Start a Scraping Job
**Endpoint:** `POST /api/scrape/async`

**Example Request:**
```bash
curl -X POST "http://localhost:8000/api/scrape/async" \
-H "Content-Type: application/json" \
-d '{
"url": "https://quotes.toscrape.com/tag/humor/",
"formats": ["markdown", "text"],
"page_options": {
"extract_main_content": true,
"clean_markdown": true,
"exclude_tags": ["script", "style", "nav", "footer"]
}
}'
```

#### 2. Start Batch Scraping
**Endpoint:** `POST /api/scrape/batch`

**Example Request:**
```bash
curl -X POST "http://localhost:8000/api/scrape/batch" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://quotes.toscrape.com/tag/humor/",
"https://quotes.toscrape.com/tag/life/"
],
"formats": ["markdown"],
"page_options": {
"extract_main_content": true,
"clean_markdown": true
}
}'
```

### Crawling Endpoints

#### 1. Start a Crawl Job
**Endpoint:** `POST /api/crawl/async`

**Example Request:**
```bash
curl -X POST "http://localhost:8000/api/crawl/async" \
-H "Content-Type: application/json" \
-d '{
"url": "https://quotes.toscrape.com",
"options": {
"max_depth": 2,
"max_pages": 10,
"formats": ["markdown"],
"exclude_paths": ["/login/*"],
"include_only_paths": ["/tag/*", "/author/*"],
"page_options": {
"extract_main_content": true,
"clean_markdown": true
}
}
}'
```

### URL Mapping Endpoints

#### 1. Map Website URLs
**Endpoint:** `POST /api/map`

**Example Request:**
```bash
curl -X POST "http://localhost:8000/api/map" \
-H "Content-Type: application/json" \
-d '{
"url": "https://quotes.toscrape.com",
"search": "humor",
"options": {
"exclude_paths": ["/login/*"],
"include_only_paths": ["/tag/*", "/author/*"],
"include_subdomains": false,
"allow_backwards": false
}
}'
```

### Job Management

#### 1. Check Job Status
**Endpoint:** `GET /api/jobs/{job_id}`

**Example Response:**
```json
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"operation": "scrape",
"status": "completed",
"url": "https://quotes.toscrape.com/tag/humor/",
"created_at": "2025-02-15T01:54:43.260053",
"completed_at": "2025-02-15T01:54:44.012756",
"result": {
"metadata_content": {
"title": "Quotes to Scrape",
"language": "en"
},
"content": {
"markdown": "..."
}
}
}
```

#### 2. View Job History
**Endpoint:** `GET /api/history`

**Parameters:**
- `limit`: Number of results (default: 20)
- `offset`: Pagination offset (default: 0)
- `status`: Filter by status
- `days`: Filter by last N days

## Configuration Options

### Page Options
| Option | Type | Default | Description |
|---------------------|---------|---------|--------------------------------------------------|
| extract_main_content | boolean | true | Extract main content using readability algorithm |
| clean_markdown | boolean | true | Apply additional markdown cleaning rules |
| include_links | boolean | false | Include all links in the results |
| structured_json | boolean | false | Structure content as JSON where possible |
| use_browser | boolean | false | Use headless browser for JavaScript rendering |
| wait_for | integer | null | Milliseconds to wait after page load |
| exclude_tags | array | [] | HTML tags/selectors to exclude |
| max_retries | integer | 3 | Maximum retry attempts for failed requests |

### Markdown Cleaning
When `clean_markdown: true` is set, the following rules are applied:
- Remove redundant whitespace
- Normalize header levels
- Clean list formatting
- Remove HTML comments
- Normalize line endings
- Add proper spacing around headers and lists
- Remove duplicate blank lines

## Self-Hosting Guide

### Using Docker

1. Clone the repository:
```bash
git clone https://github.com/jolovicdev/Nexthread.git
cd Nexthread
```

2. Create .env file with your settings:
```bash
# Database configuration
DATABASE_URL=sqlite:///./data/scraper.db

# API settings
MAX_CONCURRENT_JOBS=4
RATE_LIMIT=100

# Scraping settings
DEFAULT_TIMEOUT=30
```

3. Build and run with Docker:
```bash
# Build the image (this may take a few minutes to install dependencies)
docker build -t nt-scraper .

# Create data directory for SQLite database
mkdir -p data && chmod 777 data

# Run the container
docker run -d \
--name nt-scraper \
-p 8000:8000 \
-v $(pwd)/data:/app/data \
--restart unless-stopped \
nt-scraper

# Check logs
docker logs -f nt-scraper
```

Note: The Docker image includes all necessary dependencies for headless browser support.

### Manual Installation

1. Install dependencies:
```bash
pip install -r requirements.txt
```

2. Install Playwright browser:
```bash
playwright install chromium
```

3. Start the server:
```bash
uvicorn main:app --reload
```

The API will be available at: http://localhost:8000/api
Documentation at: http://localhost:8000/api/docs

Note: this is in early development. I haven't read Firecrawl full source code to see how they do things, I just saw their https://www.firecrawl.dev/app endpoints and made a wild guess. Many features aren't implemented.