https://github.com/zebbern/regex-crawler

Regex Web Crawler that searches on custom regexes meanwhile crawling each site to find the information your looking for!
https://github.com/zebbern/regex-crawler

bug-bounty bugbounty crawler information-gathering information-retrieval osint osint-tool pentest python regex regex-engine regex-match regex-pattern regex-tool toolkit tools website

Last synced: 2 months ago
JSON representation

Regex Web Crawler that searches on custom regexes meanwhile crawling each site to find the information your looking for!

Host: GitHub
URL: https://github.com/zebbern/regex-crawler
Owner: zebbern
License: mit
Created: 2025-02-12T09:02:25.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-02-26T20:13:06.000Z (4 months ago)
Last Synced: 2025-04-13T16:51:09.347Z (2 months ago)
Topics: bug-bounty, bugbounty, crawler, information-gathering, information-retrieval, osint, osint-tool, pentest, python, regex, regex-engine, regex-match, regex-pattern, regex-tool, toolkit, tools, website
Language: Python
Homepage:
Size: 39.1 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Regex Web Crawler

![Python](https://img.shields.io/badge/Python-3.x-blue)
![Status](https://img.shields.io/badge/Status-Active-green)
![License](https://img.shields.io/badge/License-MIT-brightgreen)

**An advanced web crawler built for bug bounty hunters!**

**Tool recursively crawls a target website, performs regex-based content searches, and saves results in structured YAML files.**

**Includes optional security analysis for reconnaissance.**

---

### `Features:`

Validate URLs before crawling to prevent errors.

Extract all internal links recursively up to a specified depth.

Perform regex-based searches on each page's content using a user-defined regex list.

Optionally enable advanced security checks such as scanning HTTP headers and HTML comments for potential leaks.

Store all crawled URLs and results in structured YAML format for easy analysis.

---

How To Run

**Step 1: Configure the `config.yaml` file to set up the target URL and crawling options.**
**Step 2: Run the Python script and let it crawl the target website while extracting valuable information.**
**Step 3: Review the structured results saved in `results.yaml`.**

## Requirements:
```
requests
beautifulsoup4
pyyaml
```
Install the required dependencies with:
```
pip install -r requirements.txt
```

## Usage:
1. Set up your configuration in `config.yaml`:
```yaml
base_url: "https://example.com"
crawl_depth: 1
advanced: true
regex_file: "regex_patterns.txt"
output_file: "results.yaml"
```
2. Create or edit your regex patterns in `regex_patterns.txt` (one per line):
```txt
(?i)password\s*[:=]\s*['"][^'"]+['"]
(?i)secret\s*[:=]\s*['"][^'"]+['"]
```
3. Run the script:
```bash
python para.py
```

## Contribute:
Feel free to suggest improvements or contribute by visiting [https://github.com/zebbern/regex-crawler](https://github.com/zebbern/regex-crawler).

> [!WARNING]
> This tool is intended for ethical hacking and bug bounty purposes only. Unauthorized scanning of third-party websites is illegal and unethical. Always obtain explicit permission before testing any target.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zebbern/regex-crawler

Awesome Lists containing this project

README