{"id":43013557,"url":"https://github.com/atasoglu/websense","last_synced_at":"2026-01-31T05:39:56.472Z","repository":{"id":334975777,"uuid":"1143608049","full_name":"atasoglu/websense","owner":"atasoglu","description":"A modular AI-powered web scraper for data pipelines.","archived":false,"fork":false,"pushed_at":"2026-01-27T21:02:49.000Z","size":98,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-28T07:26:54.718Z","etag":null,"topics":["ai","automation","crawler","data-extraction","llm","parsing","scraper","structured-output","web-scraping"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/websense","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/atasoglu.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-27T19:40:01.000Z","updated_at":"2026-01-27T21:02:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/atasoglu/websense","commit_stats":null,"previous_names":["atasoglu/websense"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/atasoglu/websense","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atasoglu%2Fwebsense","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atasoglu%2Fwebsense/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atasoglu%2Fwebsense/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atasoglu%2Fwebsense/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/atasoglu","download_url":"https://codeload.github.com/atasoglu/websense/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/atasoglu%2Fwebsense/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28930571,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-31T04:05:25.756Z","status":"ssl_error","status_checked_at":"2026-01-31T04:02:35.005Z","response_time":128,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","automation","crawler","data-extraction","llm","parsing","scraper","structured-output","web-scraping"],"created_at":"2026-01-31T05:39:55.866Z","updated_at":"2026-01-31T05:39:56.460Z","avatar_url":"https://github.com/atasoglu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WebSense\n\n[![CI](https://github.com/atasoglu/websense/actions/workflows/test.yml/badge.svg)](https://github.com/atasoglu/websense/actions/workflows/test.yml)\n[![PyPI version](https://img.shields.io/pypi/v/websense)](https://pypi.org/project/websense/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n\u003e **\"Making sense of the web.\"**\n\nWebSense is a Python library that transforms raw websites into structured, meaningful data. It leverages AI through the [ask2api](https://github.com/atasoglu/ask2api) library to semantically understand page content, allowing you to extract complex data structures without writing brittle CSS selectors or XPath expressions.\n\n## Features\n\n- **Semantic Understanding**: Uses LLMs to interpret content meaning, not just match patterns\n- **Resilient**: Adapts to layout changes—if the meaning is there, WebSense finds it\n- **Minimalist API**: Extract data in 3 lines of code\n- **Auto-Cleaning**: Intelligent noise removal filters focus on meaningful content\n- **Flexible Schemas**: Use JSON schemas or provide examples for schema inference\n- **Web Search Integration**: Search the web and scrape top results in one go\n- **Multi-Source Consolidation**: Aggregate information from multiple websites into one structured result\n- **Modular Design**: Fetch, search, clean, and parse stages can be customized independently\n\n## Installation\n\n```bash\npip install websense\n```\n\nFor development:\n\n```bash\ngit clone https://github.com/atasoglu/websense.git\ncd websense\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\nExtract data with just an example:\n\n```python\nfrom websense import Scraper\n\nscraper = Scraper()\n\ndata = scraper.scrape(\n    \"https://github.com/atasoglu/ask2api\",\n    example={\n        \"project_name\": \"string\",\n        \"description\": \"string\",\n        \"stars\": 0,\n        \"is_active\": True\n    }\n)\n\nprint(data)\n```\n\nYou can provide a strict JSON schema for validation:\n\n```python\nschema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"title\": {\"type\": \"string\"},\n        \"price\": {\"type\": \"number\"},\n        \"in_stock\": {\"type\": \"boolean\"}\n    },\n    \"required\": [\"title\", \"price\"]\n}\n\ndata = scraper.scrape(\"https://example.com/product\", schema=schema)\n```\nSpecify a different language model for extraction:\n\n```python\nscraper = Scraper(model=\"gpt-4\")\n```\n\n### Web Search \u0026 Consolidation\n\nSearch the web and consolidate information from the top 3 results:\n\n```python\ndata = scraper.search_and_scrape(\n    \"latest news about SpaceX Starship\",\n    max_results=3,\n    example={\n        \"status\": \"string\",\n        \"last_launch\": \"string\",\n        \"summary\": \"brief overview\"\n    }\n)\n```\n\nWebSense intelligently crawls multiple sources and uses an LLM-based \"judge\" to synthesize the most accurate data from all sources.\n\n## CLI Usage\n\nWebSense provides a command-line interface for quick data extraction:\n\n```bash\n# Extract structured data from a webpage\nwebsense scrape https://example.com --example schema.json --verbose\n\n# Search the web and consolidate top 3 results\nwebsense search-scrape \"Nvidia stock performance 2024\" --top-k 3 --example '{\"price\": \"str\"}'\n\n# Search search only (returns titles and URLs)\nwebsense search \"query\" --verbose\n\n# Get cleaned content only\nwebsense content https://example.com --output content.md\n```\n\nAvailable options for `scrape` command:\n\n| Option | Description |\n|--------|-------------|\n| `--model, -m` | LLM model name |\n| `--schema, -s` | JSON schema (file path or raw JSON string) |\n| `--example, -e` | JSON example (file path or raw JSON string) |\n| `--output, -o` | Output file path |\n| `--timeout, -t` | Request timeout (default: 10) |\n| `--retries, -r` | Retry attempts (default: 3) |\n| `--verbose, -v` | Enable verbose output |\n\n**Pro Tip**: You can pass raw JSON strings directly to the CLI:\n```bash\nwebsense scrape https://example.com -e '{\"title\": \"string\"}'\n```\n\n## How It Works\n\nWebSense follows a three-stage pipeline:\n\n1. **Fetch** (`fetcher.py`): Downloads and retrieves the webpage\n2. **Clean** (`cleaner.py`): Removes noise and extracts meaningful text\n3. **Parse** (`parser.py`): Uses AI to extract structured data based on your schema/example\n\n## Contributing\n\nContributions are welcome! Please:\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/my-feature`)\n3. Commit changes (`git commit -m 'Add my feature'`)\n4. Push to the branch (`git push origin feature/my-feature`)\n5. Open a Pull Request\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatasoglu%2Fwebsense","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fatasoglu%2Fwebsense","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fatasoglu%2Fwebsense/lists"}