https://github.com/himashaherath/webextract
https://github.com/himashaherath/webextract
ai json langchain llm scraper web webscraper webscraping
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/himashaherath/webextract
- Owner: HimashaHerath
- License: mit
- Created: 2025-06-11T16:13:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-11T16:16:08.000Z (about 1 year ago)
- Last Synced: 2025-06-11T19:09:21.840Z (about 1 year ago)
- Topics: ai, json, langchain, llm, scraper, web, webscraper, webscraping
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ค LLM WebExtract
> **AI-Powered Web Content Extraction** - Turn any website into structured data using Large Language Models
[](https://badge.fury.io/py/llm-webextract)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
Ever wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? **LLM WebExtract** combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.
## ๐ฏ What Does This Do?
Instead of writing complex parsing rules for every website, this tool:
1. **๐ Scrapes webpages** using Playwright (handles modern JavaScript sites)
2. **๐ง Feeds content to AI** (local via Ollama, or cloud via OpenAI/Anthropic)
3. **๐ Returns structured data** - topics, entities, summaries, key facts, and more
Think of it as having an AI assistant that reads web pages and summarizes them for you.
## โญ Key Features
- **๐ Multi-Provider Support**: Works with Ollama (local), OpenAI, and Anthropic
- **๐ Modern Web Scraping**: Handles JavaScript-heavy sites with Playwright
- **๐ Pre-built Profiles**: Ready configurations for news, research, e-commerce
- **๐ก๏ธ Robust Error Handling**: Specific exceptions for different failure types
- **โก Batch Processing**: Extract from multiple URLs concurrently
- **๐๏ธ Flexible Configuration**: Environment variables, custom prompts, schemas
- **๐พ Smart Caching**: Avoid re-processing the same URLs
## ๐ Quick Start
### Installation
```bash
# Basic installation
pip install llm-webextract
playwright install chromium
# With cloud providers
pip install llm-webextract[openai] # For GPT models
pip install llm-webextract[anthropic] # For Claude models
pip install llm-webextract[all] # Everything
```
### 30-Second Example
```bash
# Command line (requires local Ollama)
llm-webextract extract "https://news.ycombinator.com"
# Test your setup
llm-webextract test
```
```python
# Python - Local Ollama
import webextract
result = webextract.quick_extract("https://techcrunch.com")
print(f"Summary: {result.get_summary()}")
print(f"Topics: {result.get_topics()}")
# Or use the dedicated Ollama function
result = webextract.extract_with_ollama("https://techcrunch.com", model="llama3.2")
```
## ๐ ๏ธ Configuration & Usage
### Provider Setup
#### ๐ Local with Ollama (Free & Private)
```python
from webextract import WebExtractor, ConfigBuilder, extract_with_ollama
# Using ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_ollama("llama3.2") # or any model you have
.build()
)
result = extractor.extract("https://example.com")
# Quick one-liner
result = extract_with_ollama("https://example.com", model="llama3.2")
```
#### โ๏ธ OpenAI GPT
```python
from webextract import extract_with_openai
# Quick one-liner
result = extract_with_openai("https://example.com", api_key="sk-...", model="gpt-4o-mini")
# Using ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_openai(api_key="sk-...", model="gpt-4o-mini")
.build()
)
```
#### ๐ง Anthropic Claude
```python
from webextract import extract_with_anthropic
# Quick one-liner
result = extract_with_anthropic("https://example.com", api_key="sk-ant-...", model="claude-3-5-sonnet-20241022")
# Using ConfigBuilder
extractor = WebExtractor(
ConfigBuilder()
.with_anthropic(api_key="sk-ant-...", model="claude-3-5-sonnet-20241022")
.build()
)
```
### Pre-built Profiles
```python
from webextract import ConfigProfiles, WebExtractor
# Optimized for different content types
news_extractor = WebExtractor(ConfigProfiles.news_scraping())
research_extractor = WebExtractor(ConfigProfiles.research_papers())
shop_extractor = WebExtractor(ConfigProfiles.ecommerce())
```
### Environment Variables
Set defaults to avoid repeating configuration:
```bash
export WEBEXTRACT_LLM_PROVIDER="openai"
export WEBEXTRACT_MODEL="gpt-4o-mini"
export WEBEXTRACT_API_KEY="sk-your-key"
export WEBEXTRACT_MAX_CONTENT="8000"
export WEBEXTRACT_REQUEST_TIMEOUT="45"
```
## ๐ What You Get Back
The AI analyzes content and returns structured data:
```json
{
"summary": "Article discusses the latest developments in AI technology...",
"topics": ["artificial intelligence", "machine learning", "tech industry"],
"entities": {
"people": ["Sam Altman", "Satya Nadella"],
"organizations": ["OpenAI", "Microsoft", "Google"],
"locations": ["San Francisco", "Silicon Valley"]
},
"sentiment": "positive",
"key_facts": [
"New model shows 40% improvement in reasoning",
"Beta testing starts next month",
"Open source version planned for 2024"
],
"category": "technology",
"important_dates": ["2024-03-15", "Q2 2024"],
"statistics": ["40% improvement", "$10B investment"],
"confidence": 0.89
}
```
## ๐ง Advanced Usage
### Custom Extraction Schema
```python
schema = {
"product_name": "Extract the main product name",
"price": "Extract the current price",
"rating": "Extract average rating (number only)",
"reviews_count": "Extract total number of reviews",
"key_features": "List main product features"
}
result = extractor.extract_with_custom_schema(
"https://amazon.com/product/...",
schema
)
```
### Batch Processing
```python
urls = [
"https://techcrunch.com/article1",
"https://venturebeat.com/article2",
"https://theverge.com/article3"
]
results = extractor.extract_batch(urls, max_workers=3)
for result in results:
if result and result.is_successful:
print(f"{result.url}: {result.get_summary()}")
```
### Error Handling
```python
from webextract import (
WebExtractor,
ExtractionError,
ScrapingError,
LLMError,
AuthenticationError
)
try:
result = extractor.extract("https://problematic-site.com")
except AuthenticationError:
print("Invalid API key")
except ScrapingError as e:
print(f"Failed to scrape website: {e}")
except LLMError as e:
print(f"AI processing failed: {e}")
except ExtractionError as e:
print(f"General extraction error: {e}")
```
### Custom Prompts
```python
config = (ConfigBuilder()
.with_openai("sk-...", "gpt-4")
.with_custom_prompt("""
Focus on extracting:
1. Financial metrics and numbers
2. Company performance indicators
3. Market trends and predictions
4. Executive quotes and statements
""")
.build())
```
## ๐๏ธ How It Works
```mermaid
graph LR
A[URL] --> B[Playwright Scraper]
B --> C[Content Cleaning]
C --> D[LLM Processing]
D --> E[Structured Data]
B --> F[JavaScript Handling]
C --> G[Ad/Nav Removal]
D --> H[JSON Validation]
E --> I[Confidence Scoring]
```
1. **Modern Web Scraping**: Playwright handles JavaScript, SPAs, and modern websites
2. **Intelligent Content Processing**: Removes ads, navigation, focuses on main content
3. **AI Analysis**: Your chosen LLM extracts structured information
4. **Quality Assurance**: Validates output format and calculates confidence scores
## ๐ก๏ธ Requirements
- **Python 3.8+**
- **One of:**
- **Ollama** running locally (free, private)
- **OpenAI API key** (paid, powerful)
- **Anthropic API key** (paid, great reasoning)
### Installing Ollama (Recommended for beginners)
```bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull llama3.2
# Start the service
ollama serve
```
## ๐ฏ Use Cases
- **๐ฐ News Monitoring**: Extract key information from news articles
- **๐ฌ Research**: Process academic papers and technical documents
- **๐ E-commerce**: Monitor product prices, reviews, specifications
- **๐ Market Research**: Analyze competitor websites and industry trends
- **๐ Content Curation**: Summarize and categorize web content
- **๐ค AI Training**: Generate structured datasets from web content
## ๐งช Testing Your Setup
```bash
# Test connection and model availability
llm-webextract test
# Test with a specific URL
llm-webextract extract "https://example.com" --format pretty
# Check available providers
python -c "
from webextract.core.llm_factory import get_available_providers
import json
print(json.dumps(get_available_providers(), indent=2))
"
```
## ๐ค Contributing
We welcome contributions! Here's how to get started:
### For Contributors
- ๐ Read our [Development Guide](DEVELOPMENT.md) for commit conventions and processes
- ๐ Report bugs by opening an issue with detailed reproduction steps
- ๐ก Suggest features through GitHub discussions
- ๐ง Submit PRs following our coding standards
### Quick Start for Development
```bash
# Fork and clone
git clone https://github.com/HimashaHerath/webextract.git
cd webextract
# Install in development mode
pip install -e ".[dev]"
# Run tests and quality checks
python -m pytest
python -m black --check .
python -m flake8 --config .flake8
```
## ๐ Troubleshooting
### Common Issues
**"Model not available"**
```bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Pull the model if missing
ollama pull llama3.2
```
**"Connection refused"**
- Ensure Ollama is running: `ollama serve`
- Check firewall settings
- Verify the base URL in configuration
**"Rate limit exceeded"**
- Add delays between requests
- Use batch processing with lower concurrency
- Check your API plan limits
**"Content too short"**
- Site might be blocking scrapers
- Try different user agents
- Check if site requires JavaScript (we handle this)
## ๐ License
MIT License - feel free to use this in your projects!
## ๐ Acknowledgments
Built with these amazing tools:
- [Ollama](https://ollama.ai/) - Local LLM inference
- [Playwright](https://playwright.dev/) - Modern web scraping
- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - HTML parsing
- [Pydantic](https://pydantic.dev/) - Data validation
- [Typer](https://typer.tiangolo.com/) - CLI framework
## ๐ Support
- **๐ซ Email**: [himasha626@gmail.com](mailto:himasha626@gmail.com)
- **๐ Issues**: [GitHub Issues](https://github.com/HimashaHerath/webextract/issues)
- **๐ฌ Discussions**: [GitHub Discussions](https://github.com/HimashaHerath/webextract/discussions)
---
**Got questions?** Open an issue - I'm happy to help!
**Find this useful?** Give it a โญ - it really helps!