https://github.com/anlaki-py/web-crawler

Web Crawler and GitHub Documentation Crawler
https://github.com/anlaki-py/web-crawler

crawler github-api github-crawler web-crawler web-crawler-python

Last synced: 3 months ago
JSON representation

Web Crawler and GitHub Documentation Crawler

Host: GitHub
URL: https://github.com/anlaki-py/web-crawler
Owner: anlaki-py
License: mit
Created: 2024-12-14T13:33:55.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-04-03T21:49:50.000Z (3 months ago)
Last Synced: 2025-04-03T22:32:55.546Z (3 months ago)
Topics: crawler, github-api, github-crawler, web-crawler, web-crawler-python
Language: Python
Homepage:
Size: 1.25 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Web Crawler and GitHub Documentation Crawler

This repository contains two Python scripts for crawling web pages and GitHub repositories to extract and store relevant content. Below is a brief overview of each script's capabilities.

## 1. web_crawler.py

### Overview
The `web_crawler.py` script is designed to crawl a specified website, extract page content, and save the data in JSON format. It respects `robots.txt` rules and allows customization of crawl depth and chunk size.

### Features
- **Domain-Specific Crawling**: Crawls only the specified domain and path.
- **Robots.txt Compliance**: Respects the rules defined in the website's `robots.txt` file.
- **Chunked Output**: Saves crawled data in JSON chunks for easier processing.
- **Customizable Depth**: Allows setting a maximum crawl depth.
- **Exclusion Rules**: Excludes URLs with specific patterns (e.g., login pages, static assets).

### Usage
1. Run the script and provide the website URL.
2. Optionally, set the chunk size and maximum crawl depth.
3. The script will save the crawled data in the `web_crawled_data` directory.

---

## 2. gitHub_docs_crawler.py

### Overview
The `gitHub_docs_crawler.py` script is designed to crawl a GitHub repository, specifically targeting documentation files (e.g., `.md`, `.txt`, `.html`). It extracts file content and metadata, saving the data in JSON format.

### Features
- **GitHub API Integration**: Uses the GitHub API to fetch repository contents.
- **File Type Filtering**: Targets specific file extensions (e.g., `.md`, `.html`).
- **Rate Limit Handling**: Automatically pauses when GitHub API rate limits are approached.
- **Chunked Output**: Saves crawled data in JSON chunks for easier processing.
- **Customizable Depth**: Allows setting a maximum directory recursion depth.

### Usage
1. Run the script and provide the GitHub repository URL.
2. Optionally, provide a GitHub token for authenticated requests.
3. The script will save the crawled data in the `github_api_crawled_data` directory.

---

## Credits
- The GitHub Documentation Crawler is inspired by the original work by [rsain/GitHub-Crawler](https://github.com/rsain/GitHub-Crawler).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anlaki-py/web-crawler

Awesome Lists containing this project

README