{"id":29253507,"url":"https://github.com/maxonary/simple-crawler","last_synced_at":"2026-06-20T08:31:04.153Z","repository":{"id":301766238,"uuid":"1010233391","full_name":"maxonary/simple-crawler","owner":"maxonary","description":"Streamlit Webscraper","archived":false,"fork":false,"pushed_at":"2025-06-28T22:36:03.000Z","size":25,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-04T10:14:08.801Z","etag":null,"topics":["crawler","streamlit","webscraping"],"latest_commit_sha":null,"homepage":"https://simple-crawler.streamlit.app","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maxonary.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-28T16:27:20.000Z","updated_at":"2025-06-28T22:36:07.000Z","dependencies_parsed_at":"2025-06-28T18:26:40.758Z","dependency_job_id":"1dc8011e-572c-402c-b8de-1a353197363e","html_url":"https://github.com/maxonary/simple-crawler","commit_stats":null,"previous_names":["maxonary/simple-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/maxonary/simple-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxonary%2Fsimple-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxonary%2Fsimple-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxonary%2Fsimple-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxonary%2Fsimple-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maxonary","download_url":"https://codeload.github.com/maxonary/simple-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maxonary%2Fsimple-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34563535,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-20T02:00:06.407Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","streamlit","webscraping"],"created_at":"2025-07-04T02:03:05.665Z","updated_at":"2026-06-20T08:31:04.130Z","avatar_url":"https://github.com/maxonary.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Simple Web Crawler\n\nA lightweight web crawler with a beautiful Streamlit frontend that allows you to crawl multiple URLs and extract clean body content along with all discovered links.\n\n## Features\n\n- 🕷️ **Simple URL Input**: Enter single URLs or multiple URLs at once\n- 📄 **Clean Body Content**: Extract main content without scripts, styles, and navigation\n- 🔗 **Link Discovery**: Find all internal and external links on each page\n- 📊 **Smart Content Extraction**: Extracts clean main body content by removing scripts, styles, navigation, headers, and footers\n- 📥 **Export Results**: Download crawl results as JSON or individual text files\n- 🎨 **Beautiful UI**: Modern Streamlit interface with real-time statistics\n- ⚡ **Fast \u0026 Efficient**: Built with requests and BeautifulSoup for optimal performance\n- **Auto-Crawl Links**: Automatically crawl discovered links with bulk selection options\n- **Response Details**: Shows HTTP status codes, content types, encoding, and content length\n- **Export Options**: Download individual content or export all results as JSON\n- **User-Friendly Interface**: Clean Streamlit interface with expandable sections and metrics\n- **Error Handling**: Graceful handling of failed requests and invalid URLs\n- **Dual Crawl Modes**: Choose between \"Body Only\" (clean text content) or \"Full Page\" (complete HTML)\n- **LLM-Optimized Exports**: Multiple export formats specifically designed for LLM consumption\n\n## Installation\n\n1. **Clone the repository:**\n   ```bash\n   git clone \u003cyour-repo-url\u003e\n   cd simple-crawler\n   ```\n\n2. **Create a virtual environment (recommended):**\n   ```bash\n   python -m venv .venv\n   source .venv/bin/activate  # On Windows: .venv\\Scripts\\activate\n   ```\n\n3. **Install dependencies:**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n## Usage\n\n1. **Start the Application**:\n   ```bash\n   streamlit run app.py\n   ```\n\n2. **Select Crawl Mode**: Choose between \"Body Only\" (clean text) or \"Full Page\" (complete HTML) in the sidebar\n\n3. **Enter URLs**: Choose between single URL or multiple URLs input method\n\n4. **Crawl Initial Pages**: Click \"Start Crawling\" to analyze the pages\n\n5. **Auto-Crawl Discovered Links**: \n   - Use \"Select All Internal\" to crawl all internal links\n   - Use \"Select All External\" to crawl all external links\n   - Or manually enter specific URLs to crawl\n   - Click \"Crawl Selected Links\" to automatically crawl them\n\n6. **View Results**: \n   - Expand each result to see content and links\n   - Copy links from the text areas\n   - Download individual content or export all results\n\n7. **Export Data**: Use the export button to download all results as JSON\n\n## What the Crawler Extracts\n\nFor each successfully crawled URL, you'll get:\n\n### Content Information\n- **Main Body Content**: Clean text content from the main content area\n- **Content Length**: Total number of characters\n- **Response Status**: HTTP status code\n- **Content Type**: MIME type of the response\n- **Encoding**: Character encoding used\n\n### Link Discovery\n- **Internal Links**: All links pointing to the same domain\n- **External Links**: All links pointing to other domains\n- **Total Links**: Complete count of all discovered links\n- **Link Lists**: Expandable sections showing the actual URLs\n\n### Error Information\n- **Detailed error messages** for failed crawls\n- **Network timeout handling**\n- **Graceful fallbacks** for parsing issues\n\n## Smart Content Extraction\n\nThe crawler intelligently extracts content by:\n\n1. **Removing unwanted elements**: scripts, styles, navigation, headers, footers\n2. **Targeting main content areas**: looks for `\u003cmain\u003e`, `\u003carticle\u003e`, `.content`, etc.\n3. **Falling back gracefully**: uses body content if no specific content area is found\n4. **Cleaning up text**: removes extra whitespace and formats nicely\n\n## Link Discovery Features\n\nThe crawler discovers and categorizes all links:\n\n- **Internal Links**: Links to the same domain (useful for site mapping)\n- **External Links**: Links to other domains (useful for backlink analysis)\n- **Duplicate Removal**: Automatically removes duplicate links\n- **URL Normalization**: Converts relative URLs to absolute URLs\n- **Smart Filtering**: Skips javascript:, mailto:, tel:, and other non-HTTP links\n\n## Example Usage\n\n### Input URLs:\n```\nexample.com\nhttps://github.com\nhttps://docs.python.org\n```\n\n### Sample Output:\n```json\n{\n  \"url\": \"https://example.com\",\n  \"status_code\": 200,\n  \"content\": \"Example Domain This domain is for use in illustrative examples...\",\n  \"content_type\": \"text/html; charset=UTF-8\",\n  \"encoding\": \"UTF-8\",\n  \"content_length\": 1234,\n  \"links\": {\n    \"internal\": [\"https://example.com/page1\", \"https://example.com/page2\"],\n    \"external\": [\"https://www.iana.org/domains/example\"],\n    \"all\": [\"https://example.com/page1\", \"https://example.com/page2\", \"https://www.iana.org/domains/example\"]\n  },\n  \"internal_links_count\": 2,\n  \"external_links_count\": 1,\n  \"total_links_count\": 3,\n  \"success\": true\n}\n```\n\n## Use Cases\n\n- **Content Analysis**: Extract clean text from web pages for analysis\n- **Site Mapping**: Discover all pages on a website through internal links\n- **Link Research**: Analyze external links and backlinks\n- **SEO Analysis**: Understand internal linking patterns\n- **Content Monitoring**: Track changes in web page content\n- **Data Collection**: Gather text content from multiple sources\n\n## Technical Details\n\n- **Backend**: Python with requests and BeautifulSoup\n- **Frontend**: Streamlit for the web interface\n- **Content Parsing**: HTML parser (built into Python, no external dependencies)\n- **Link Processing**: URL normalization and categorization\n- **Rate Limiting**: 1-second delay between requests to be respectful to servers\n\n## Important Notes\n\n⚠️ **Please be respectful when crawling websites:**\n- Check the website's `robots.txt` file\n- Don't overwhelm servers with too many requests\n- Consider the website's terms of service\n- The crawler includes a 1-second delay between requests by default\n\n## Requirements\n\n- Python 3.7+\n- See `requirements.txt` for specific package versions\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Dependency Installation Fails**: \n   - Make sure you're using a virtual environment\n   - Try updating pip: `pip install --upgrade pip`\n\n2. **Streamlit Not Starting**:\n   - Check if port 8501 is available\n   - Try a different port: `streamlit run app.py --server.port 8502`\n\n3. **Crawling Fails**:\n   - Check your internet connection\n   - Some sites may block automated requests\n   - Try with different URLs\n\n## License\n\nThis project is open source and available under the MIT License.\n\n## LLM Integration\n\nThe crawler includes specialized export options optimized for Large Language Model consumption:\n\n### Export Formats for LLMs\n\n1. **🤖 LLM Text Export**: Clean, structured text format with metadata\n2. **📝 LLM Markdown Export**: Markdown-formatted content for better LLM parsing\n3. **🔧 Structured JSON Export**: API-ready JSON with cleaned content and metadata\n\n### Best Practices for LLM Usage\n\n- **Content Length**: Most LLMs work best with 4K-8K tokens per context\n- **Mode Selection**: Use \"Body Only\" for analysis tasks, \"Full Page\" for web scraping\n- **Content Cleaning**: Automatically removes scripts, styles, and navigation elements\n- **Link Limiting**: Includes only the most relevant links to prevent context overflow\n- **Metadata Preservation**: Maintains URL, status, and content type information\n\n### LLM Utilities\n\nThe `llm_utils.py` module provides additional utilities:\n- Content cleaning and optimization\n- Prompt context generation\n- Structured data creation\n- Best practices documentation\n\nClick \"📚 Show LLM Best Practices\" in the sidebar for detailed guidelines. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxonary%2Fsimple-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaxonary%2Fsimple-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaxonary%2Fsimple-crawler/lists"}