https://github.com/anlaki-py/web-crawler
Web Crawler and GitHub Documentation Crawler
https://github.com/anlaki-py/web-crawler
crawler github-api github-crawler web-crawler web-crawler-python
Last synced: 3 months ago
JSON representation
Web Crawler and GitHub Documentation Crawler
- Host: GitHub
- URL: https://github.com/anlaki-py/web-crawler
- Owner: anlaki-py
- License: mit
- Created: 2024-12-14T13:33:55.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-04-03T21:49:50.000Z (3 months ago)
- Last Synced: 2025-04-03T22:32:55.546Z (3 months ago)
- Topics: crawler, github-api, github-crawler, web-crawler, web-crawler-python
- Language: Python
- Homepage:
- Size: 1.25 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Crawler and GitHub Documentation Crawler
This repository contains two Python scripts for crawling web pages and GitHub repositories to extract and store relevant content. Below is a brief overview of each script's capabilities.
## 1. web_crawler.py
### Overview
The `web_crawler.py` script is designed to crawl a specified website, extract page content, and save the data in JSON format. It respects `robots.txt` rules and allows customization of crawl depth and chunk size.### Features
- **Domain-Specific Crawling**: Crawls only the specified domain and path.
- **Robots.txt Compliance**: Respects the rules defined in the website's `robots.txt` file.
- **Chunked Output**: Saves crawled data in JSON chunks for easier processing.
- **Customizable Depth**: Allows setting a maximum crawl depth.
- **Exclusion Rules**: Excludes URLs with specific patterns (e.g., login pages, static assets).### Usage
1. Run the script and provide the website URL.
2. Optionally, set the chunk size and maximum crawl depth.
3. The script will save the crawled data in the `web_crawled_data` directory.---
## 2. gitHub_docs_crawler.py
### Overview
The `gitHub_docs_crawler.py` script is designed to crawl a GitHub repository, specifically targeting documentation files (e.g., `.md`, `.txt`, `.html`). It extracts file content and metadata, saving the data in JSON format.### Features
- **GitHub API Integration**: Uses the GitHub API to fetch repository contents.
- **File Type Filtering**: Targets specific file extensions (e.g., `.md`, `.html`).
- **Rate Limit Handling**: Automatically pauses when GitHub API rate limits are approached.
- **Chunked Output**: Saves crawled data in JSON chunks for easier processing.
- **Customizable Depth**: Allows setting a maximum directory recursion depth.### Usage
1. Run the script and provide the GitHub repository URL.
2. Optionally, provide a GitHub token for authenticated requests.
3. The script will save the crawled data in the `github_api_crawled_data` directory.---
## Credits
- The GitHub Documentation Crawler is inspired by the original work by [rsain/GitHub-Crawler](https://github.com/rsain/GitHub-Crawler).