{"id":28719925,"url":"https://github.com/himashaherath/webextract","last_synced_at":"2026-05-08T15:44:06.640Z","repository":{"id":298554070,"uuid":"1000338400","full_name":"HimashaHerath/webextract","owner":"HimashaHerath","description":null,"archived":false,"fork":false,"pushed_at":"2025-06-11T16:16:08.000Z","size":22,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-11T19:09:21.840Z","etag":null,"topics":["ai","json","langchain","llm","scraper","web","webscraper","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HimashaHerath.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-11T16:13:32.000Z","updated_at":"2025-06-11T16:17:21.000Z","dependencies_parsed_at":"2025-06-11T19:09:25.723Z","dependency_job_id":"57299285-4176-40f3-b81b-1ed295544bae","html_url":"https://github.com/HimashaHerath/webextract","commit_stats":null,"previous_names":["himashaherath/webextract"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/HimashaHerath/webextract","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HimashaHerath%2Fwebextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HimashaHerath%2Fwebextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HimashaHerath%2Fwebextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HimashaHerath%2Fwebextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HimashaHerath","download_url":"https://codeload.github.com/HimashaHerath/webextract/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HimashaHerath%2Fwebextract/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259929973,"owners_count":22933536,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","json","langchain","llm","scraper","web","webscraper","webscraping"],"created_at":"2025-06-15T06:06:09.368Z","updated_at":"2026-05-08T15:44:06.623Z","avatar_url":"https://github.com/HimashaHerath.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🤖 LLM WebExtract\n\n\u003e **AI-Powered Web Content Extraction** - Turn any website into structured data using Large Language Models\n\n[![PyPI version](https://badge.fury.io/py/llm-webextract.svg)](https://badge.fury.io/py/llm-webextract)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nEver wanted to extract meaningful information from websites but got tired of parsing HTML and dealing with messy data? **LLM WebExtract** combines modern web scraping with Large Language Models to intelligently extract structured information from any webpage.\n\n## 🎯 What Does This Do?\n\nInstead of writing complex parsing rules for every website, this tool:\n\n1. **🌐 Scrapes webpages** using Playwright (handles modern JavaScript sites)\n2. **🧠 Feeds content to AI** (local via Ollama, or cloud via OpenAI/Anthropic)\n3. **📊 Returns structured data** - topics, entities, summaries, key facts, and more\n\nThink of it as having an AI assistant that reads web pages and summarizes them for you.\n\n## ⭐ Key Features\n\n- **🔄 Multi-Provider Support**: Works with Ollama (local), OpenAI, and Anthropic\n- **🚀 Modern Web Scraping**: Handles JavaScript-heavy sites with Playwright\n- **📋 Pre-built Profiles**: Ready configurations for news, research, e-commerce\n- **🛡️ Robust Error Handling**: Specific exceptions for different failure types\n- **⚡ Batch Processing**: Extract from multiple URLs concurrently\n- **🎛️ Flexible Configuration**: Environment variables, custom prompts, schemas\n- **💾 Smart Caching**: Avoid re-processing the same URLs\n\n## 🚀 Quick Start\n\n### Installation\n\n```bash\n# Basic installation\npip install llm-webextract\nplaywright install chromium\n\n# With cloud providers\npip install llm-webextract[openai]     # For GPT models\npip install llm-webextract[anthropic]  # For Claude models\npip install llm-webextract[all]        # Everything\n```\n\n### 30-Second Example\n\n```bash\n# Command line (requires local Ollama)\nllm-webextract extract \"https://news.ycombinator.com\"\n\n# Test your setup\nllm-webextract test\n```\n\n```python\n# Python - Local Ollama\nimport webextract\n\nresult = webextract.quick_extract(\"https://techcrunch.com\")\nprint(f\"Summary: {result.get_summary()}\")\nprint(f\"Topics: {result.get_topics()}\")\n\n# Or use the dedicated Ollama function\nresult = webextract.extract_with_ollama(\"https://techcrunch.com\", model=\"llama3.2\")\n```\n\n## 🛠️ Configuration \u0026 Usage\n\n### Provider Setup\n\n#### 🏠 Local with Ollama (Free \u0026 Private)\n```python\nfrom webextract import WebExtractor, ConfigBuilder, extract_with_ollama\n\n# Using ConfigBuilder\nextractor = WebExtractor(\n    ConfigBuilder()\n    .with_ollama(\"llama3.2\")  # or any model you have\n    .build()\n)\n\nresult = extractor.extract(\"https://example.com\")\n\n# Quick one-liner\nresult = extract_with_ollama(\"https://example.com\", model=\"llama3.2\")\n```\n\n#### ☁️ OpenAI GPT\n```python\nfrom webextract import extract_with_openai\n\n# Quick one-liner\nresult = extract_with_openai(\"https://example.com\", api_key=\"sk-...\", model=\"gpt-4o-mini\")\n\n# Using ConfigBuilder\nextractor = WebExtractor(\n    ConfigBuilder()\n    .with_openai(api_key=\"sk-...\", model=\"gpt-4o-mini\")\n    .build()\n)\n```\n\n#### 🧠 Anthropic Claude\n```python\nfrom webextract import extract_with_anthropic\n\n# Quick one-liner\nresult = extract_with_anthropic(\"https://example.com\", api_key=\"sk-ant-...\", model=\"claude-3-5-sonnet-20241022\")\n\n# Using ConfigBuilder\nextractor = WebExtractor(\n    ConfigBuilder()\n    .with_anthropic(api_key=\"sk-ant-...\", model=\"claude-3-5-sonnet-20241022\")\n    .build()\n)\n```\n\n### Pre-built Profiles\n\n```python\nfrom webextract import ConfigProfiles, WebExtractor\n\n# Optimized for different content types\nnews_extractor = WebExtractor(ConfigProfiles.news_scraping())\nresearch_extractor = WebExtractor(ConfigProfiles.research_papers())\nshop_extractor = WebExtractor(ConfigProfiles.ecommerce())\n```\n\n### Environment Variables\n\nSet defaults to avoid repeating configuration:\n\n```bash\nexport WEBEXTRACT_LLM_PROVIDER=\"openai\"\nexport WEBEXTRACT_MODEL=\"gpt-4o-mini\"\nexport WEBEXTRACT_API_KEY=\"sk-your-key\"\nexport WEBEXTRACT_MAX_CONTENT=\"8000\"\nexport WEBEXTRACT_REQUEST_TIMEOUT=\"45\"\n```\n\n## 📊 What You Get Back\n\nThe AI analyzes content and returns structured data:\n\n```json\n{\n  \"summary\": \"Article discusses the latest developments in AI technology...\",\n  \"topics\": [\"artificial intelligence\", \"machine learning\", \"tech industry\"],\n  \"entities\": {\n    \"people\": [\"Sam Altman\", \"Satya Nadella\"],\n    \"organizations\": [\"OpenAI\", \"Microsoft\", \"Google\"],\n    \"locations\": [\"San Francisco\", \"Silicon Valley\"]\n  },\n  \"sentiment\": \"positive\",\n  \"key_facts\": [\n    \"New model shows 40% improvement in reasoning\",\n    \"Beta testing starts next month\",\n    \"Open source version planned for 2024\"\n  ],\n  \"category\": \"technology\",\n  \"important_dates\": [\"2024-03-15\", \"Q2 2024\"],\n  \"statistics\": [\"40% improvement\", \"$10B investment\"],\n  \"confidence\": 0.89\n}\n```\n\n## 🔧 Advanced Usage\n\n### Custom Extraction Schema\n\n```python\nschema = {\n    \"product_name\": \"Extract the main product name\",\n    \"price\": \"Extract the current price\",\n    \"rating\": \"Extract average rating (number only)\",\n    \"reviews_count\": \"Extract total number of reviews\",\n    \"key_features\": \"List main product features\"\n}\n\nresult = extractor.extract_with_custom_schema(\n    \"https://amazon.com/product/...\",\n    schema\n)\n```\n\n### Batch Processing\n\n```python\nurls = [\n    \"https://techcrunch.com/article1\",\n    \"https://venturebeat.com/article2\",\n    \"https://theverge.com/article3\"\n]\n\nresults = extractor.extract_batch(urls, max_workers=3)\nfor result in results:\n    if result and result.is_successful:\n        print(f\"{result.url}: {result.get_summary()}\")\n```\n\n### Error Handling\n\n```python\nfrom webextract import (\n    WebExtractor,\n    ExtractionError,\n    ScrapingError,\n    LLMError,\n    AuthenticationError\n)\n\ntry:\n    result = extractor.extract(\"https://problematic-site.com\")\nexcept AuthenticationError:\n    print(\"Invalid API key\")\nexcept ScrapingError as e:\n    print(f\"Failed to scrape website: {e}\")\nexcept LLMError as e:\n    print(f\"AI processing failed: {e}\")\nexcept ExtractionError as e:\n    print(f\"General extraction error: {e}\")\n```\n\n### Custom Prompts\n\n```python\nconfig = (ConfigBuilder()\n    .with_openai(\"sk-...\", \"gpt-4\")\n    .with_custom_prompt(\"\"\"\n        Focus on extracting:\n        1. Financial metrics and numbers\n        2. Company performance indicators\n        3. Market trends and predictions\n        4. Executive quotes and statements\n    \"\"\")\n    .build())\n```\n\n## 🏗️ How It Works\n\n```mermaid\ngraph LR\n    A[URL] --\u003e B[Playwright Scraper]\n    B --\u003e C[Content Cleaning]\n    C --\u003e D[LLM Processing]\n    D --\u003e E[Structured Data]\n\n    B --\u003e F[JavaScript Handling]\n    C --\u003e G[Ad/Nav Removal]\n    D --\u003e H[JSON Validation]\n    E --\u003e I[Confidence Scoring]\n```\n\n1. **Modern Web Scraping**: Playwright handles JavaScript, SPAs, and modern websites\n2. **Intelligent Content Processing**: Removes ads, navigation, focuses on main content\n3. **AI Analysis**: Your chosen LLM extracts structured information\n4. **Quality Assurance**: Validates output format and calculates confidence scores\n\n## 🛡️ Requirements\n\n- **Python 3.8+**\n- **One of:**\n  - **Ollama** running locally (free, private)\n  - **OpenAI API key** (paid, powerful)\n  - **Anthropic API key** (paid, great reasoning)\n\n### Installing Ollama (Recommended for beginners)\n\n```bash\n# Install Ollama\ncurl -fsSL https://ollama.ai/install.sh | sh\n\n# Pull a model\nollama pull llama3.2\n\n# Start the service\nollama serve\n```\n\n## 🎯 Use Cases\n\n- **📰 News Monitoring**: Extract key information from news articles\n- **🔬 Research**: Process academic papers and technical documents\n- **🛒 E-commerce**: Monitor product prices, reviews, specifications\n- **📈 Market Research**: Analyze competitor websites and industry trends\n- **📋 Content Curation**: Summarize and categorize web content\n- **🤖 AI Training**: Generate structured datasets from web content\n\n## 🧪 Testing Your Setup\n\n```bash\n# Test connection and model availability\nllm-webextract test\n\n# Test with a specific URL\nllm-webextract extract \"https://example.com\" --format pretty\n\n# Check available providers\npython -c \"\nfrom webextract.core.llm_factory import get_available_providers\nimport json\nprint(json.dumps(get_available_providers(), indent=2))\n\"\n```\n\n## 🤝 Contributing\n\nWe welcome contributions! Here's how to get started:\n\n### For Contributors\n\n- 📖 Read our [Development Guide](DEVELOPMENT.md) for commit conventions and processes\n- 🐛 Report bugs by opening an issue with detailed reproduction steps\n- 💡 Suggest features through GitHub discussions\n- 🔧 Submit PRs following our coding standards\n\n### Quick Start for Development\n\n```bash\n# Fork and clone\ngit clone https://github.com/HimashaHerath/webextract.git\ncd webextract\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Run tests and quality checks\npython -m pytest\npython -m black --check .\npython -m flake8 --config .flake8\n```\n\n## 🔍 Troubleshooting\n\n### Common Issues\n\n**\"Model not available\"**\n```bash\n# Check if Ollama is running\ncurl http://localhost:11434/api/tags\n\n# Pull the model if missing\nollama pull llama3.2\n```\n\n**\"Connection refused\"**\n- Ensure Ollama is running: `ollama serve`\n- Check firewall settings\n- Verify the base URL in configuration\n\n**\"Rate limit exceeded\"**\n- Add delays between requests\n- Use batch processing with lower concurrency\n- Check your API plan limits\n\n**\"Content too short\"**\n- Site might be blocking scrapers\n- Try different user agents\n- Check if site requires JavaScript (we handle this)\n\n## 📄 License\n\nMIT License - feel free to use this in your projects!\n\n## 🙏 Acknowledgments\n\nBuilt with these amazing tools:\n- [Ollama](https://ollama.ai/) - Local LLM inference\n- [Playwright](https://playwright.dev/) - Modern web scraping\n- [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - HTML parsing\n- [Pydantic](https://pydantic.dev/) - Data validation\n- [Typer](https://typer.tiangolo.com/) - CLI framework\n\n## 📞 Support\n\n- **📫 Email**: [himasha626@gmail.com](mailto:himasha626@gmail.com)\n- **🐛 Issues**: [GitHub Issues](https://github.com/HimashaHerath/webextract/issues)\n- **💬 Discussions**: [GitHub Discussions](https://github.com/HimashaHerath/webextract/discussions)\n\n---\n\n**Got questions?** Open an issue - I'm happy to help!\n**Find this useful?** Give it a ⭐ - it really helps!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhimashaherath%2Fwebextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhimashaherath%2Fwebextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhimashaherath%2Fwebextract/lists"}