{"id":22808191,"url":"https://github.com/simonpierreboucher/crawler","last_synced_at":"2025-03-30T20:53:27.521Z","repository":{"id":262877591,"uuid":"888046091","full_name":"simonpierreboucher/Crawler","owner":"simonpierreboucher","description":"A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.","archived":false,"fork":false,"pushed_at":"2024-11-18T18:44:04.000Z","size":90,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T00:23:39.203Z","etag":null,"topics":["concurrent-crawling","content-extraction","data-collection","data-extraction-pipeline","data-preservation-and-recovery","data-scraping","error-handling","html-parsing","http-requests","metadata-storage","modular-design","pdf-text-extraction","python-crawler","rate-limiting","structured-data-storage","text-processing","url-normalization","web-crawling","yaml-configuration"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonpierreboucher.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-13T18:07:35.000Z","updated_at":"2024-11-18T18:44:07.000Z","dependencies_parsed_at":"2025-03-10T18:00:44.783Z","dependency_job_id":null,"html_url":"https://github.com/simonpierreboucher/Crawler","commit_stats":null,"previous_names":["simonpierreboucher/crawler"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonpierreboucher","download_url":"https://codeload.github.com/simonpierreboucher/Crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246379378,"owners_count":20767696,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["concurrent-crawling","content-extraction","data-collection","data-extraction-pipeline","data-preservation-and-recovery","data-scraping","error-handling","html-parsing","http-requests","metadata-storage","modular-design","pdf-text-extraction","python-crawler","rate-limiting","structured-data-storage","text-processing","url-normalization","web-crawling","yaml-configuration"],"created_at":"2024-12-12T11:08:25.828Z","updated_at":"2025-03-30T20:53:27.487Z","avatar_url":"https://github.com/simonpierreboucher.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Crawler for Text Extraction\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python Version](https://img.shields.io/badge/python-3.7%2B-blue.svg)](https://www.python.org/downloads/)\n[![GitHub Issues](https://img.shields.io/github/issues/simonpierreboucher/llm-generate-function)](https://github.com/simonpierreboucher/llm-generate-function/issues)\n[![GitHub Forks](https://img.shields.io/github/forks/simonpierreboucher/llm-generate-function)](https://github.com/simonpierreboucher/llm-generate-function/network)\n[![GitHub Stars](https://img.shields.io/github/stars/simonpierreboucher/llm-generate-function)](https://github.com/simonpierreboucher/llm-generate-function/stargazers)\n\nA robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.\n\n## Features\n\n- Extracts and saves text content from HTML and PDF files\n- Adds metadata (URL and timestamp) to each saved file\n- Concurrent crawling with configurable workers\n- Robust error handling and detailed logging\n- Configurable through YAML files\n- URL sanitization and normalization\n- State preservation and recovery\n- Rate limiting and polite crawling\n- Command-line interface\n- Saves images (PNG, JPG, JPEG) in the `image` folder with their respective formats\n- Displays ASCII art (\"POWERED\", \"BY\", \"M-LAI\") every 5 steps during the crawl\n\n## Installation\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/simonpierreboucher/Crawler.git\n   cd Crawler\n   ```\n\n2. Create and activate a virtual environment:\n   ```bash\n   # On Unix/MacOS\n   python3 -m venv venv\n   source venv/bin/activate\n\n   # On Windows\n   python -m venv venv\n   .\\venv\\Scripts\\activate\n   ```\n\n3. Install the package and dependencies:\n   ```bash\n   pip install -e .\n   ```\n\n## Output Format\n\nEach extracted page is saved as a text file with the following format:\n\n```text\nURL: https://www.example.com/page\nTimestamp: 2024-11-12 23:45:12\n====================================================================================================\n\n[Extracted content from the page]\n\n====================================================================================================\nEnd of content from: https://www.example.com/page\n```\n\nImages are saved in the `image` folder with the respective format (PNG, JPG, JPEG).\n\n## Configuration\n\nConfigure the crawler through `config/settings.yaml`:\n\n```yaml\ndomain:\n  name: \"www.example.com\"\n  start_url: \"https://www.example.com\"\n\ntimeouts:\n  connect: 10\n  read: 30\n  max_retries: 3\n  max_redirects: 5\n\ncrawler:\n  max_workers: 5\n  max_queue_size: 10000\n  chunk_size: 8192\n  delay_min: 1\n  delay_max: 3\n\nfiles:\n  max_length: 200\n  max_url_length: 2000\n  max_log_size: 10485760  # 10MB\n  max_log_backups: 5\n\nexcluded:\n  extensions:\n    - \".jpg\"\n    - \".jpeg\"\n    - \".png\"\n    - \".gif\"\n    - \".css\"\n    - \".js\"\n    - \".ico\"\n    - \".xml\"\n  \n  patterns:\n    - \"login\"\n    - \"logout\"\n    - \"signin\"\n    - \"signup\"\n```\n\n## Usage\n\n### Basic Usage\n\n```bash\npython run.py\n```\n\n### With Custom Configuration\n\n```bash\npython run.py --config path/to/config.yaml --output path/to/output\n```\n\n### Resume Previous Crawl\n\n```bash\npython run.py --resume\n```\n\n### Command-line Options\n\n- `--config, -c`: Path to configuration file (default: config/settings.yaml)\n- `--output, -o`: Output directory for crawled content (default: text)\n- `--resume, -r`: Resume from previous crawl state\n\n## Project Structure\n\n```\ncrawler/\n│\n├── config/\n│   ├── __init__.py\n│   └── settings.yaml\n│\n├── src/\n│   ├── __init__.py\n│   ├── constants.py\n│   ├── session.py\n│   ├── extractors.py\n│   ├── processors.py\n│   ├── crawler.py\n│   └── utils.py\n│\n├── requirements.txt\n├── setup.py\n└── run.py\n```\n\n## Dependencies\n\n- requests\u003e=2.31.0\n- beautifulsoup4\u003e=4.12.2\n- PyPDF2\u003e=3.0.1\n- fake-useragent\u003e=1.1.1\n- tldextract\u003e=5.0.1\n- urllib3\u003e=2.0.7\n- pyyaml\u003e=6.0.1\n- click\u003e=8.1.7\n\n## Error Handling\n\nThe crawler includes:\n- Automatic retries for failed requests\n- Detailed logging of all errors\n- Graceful shutdown on interruption\n- State preservation on errors\n\n## Contributing\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Authors\n\n- **Simon-Pierre Boucher** - *Initial work* - [Github](https://github.com/simonpierreboucher)\n\n## Version History\n\n* 0.2\n    * Added metadata to saved files\n    * Improved error handling\n    * Enhanced logging system\n    * Display ASCII art at regular intervals (every 5 steps)\n\n* 0.1\n    * Initial Release\n    * Basic functionality with HTML and PDF support\n    * Configurable crawling parameters\n\n## Contact\n\nProject Link: [https://github.com/simonpierreboucher/Crawler](https://github.com/simonpierreboucher/Crawler)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonpierreboucher%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonpierreboucher%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonpierreboucher%2Fcrawler/lists"}