https://github.com/oceanside-chess/email-scraper
Scrape emails from a website using recursive crawling, the best anti-obfuscation techniques, and validate all addresses before saving to a file.
https://github.com/oceanside-chess/email-scraper
bot email-extraction email-extractor email-scraper email-validation go go-package golang spider web-crawler web-scraper web-scraping web-scraping-software website-scraper
Last synced: 3 months ago
JSON representation
Scrape emails from a website using recursive crawling, the best anti-obfuscation techniques, and validate all addresses before saving to a file.
- Host: GitHub
- URL: https://github.com/oceanside-chess/email-scraper
- Owner: oceanside-chess
- License: agpl-3.0
- Created: 2024-12-22T06:37:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-29T23:34:55.000Z (about 1 year ago)
- Last Synced: 2026-01-02T10:32:42.402Z (5 months ago)
- Topics: bot, email-extraction, email-extractor, email-scraper, email-validation, go, go-package, golang, spider, web-crawler, web-scraper, web-scraping, web-scraping-software, website-scraper
- Language: Go
- Homepage:
- Size: 25.4 KB
- Stars: 4
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Email Scraper
This project is designed to defeat as many email obfuscation methods as possible, creating a single bot capable of crawling the web and harvesting emails. It supports common and uncommon obfuscation methods such as Cloudflare email protection, ROT Cipher, HTML entity decoding, RTL (Right-to-Left) obfuscation, JavaScript-based obfuscation, SVG-encoded emails, Hex and Unicode obfuscation, object and iframe embedded addresses, JavaScript hrefs, splitting addresses with comments, Base64 encoding, basic AJAX and API request obfuscation, text-based obfuscation, and many more coming soon!
## Features
- **Email Extraction**: Scrapes email addresses from HTML content.
- **Obfuscation Handling**: Decodes obfuscated emails, including JavaScript-based methods.
- **Depth-based Crawling**: Crawls through websites up to a specified depth, staying within the domain or subdirectories.
- **Email Validation**: Validates email addresses against known standards and checks DNS records for each domain.
- **Logging**: Outputs logs to a file for debugging and analysis.
## Installation
1. Ensure Go is installed on your system. [Download Go](https://golang.org/dl/).
2. Clone the repository or download the source code.
```bash
git clone https://github.com/Pythoript/email-scraper.git
cd email-scraper
```
3. Install dependencies:
```bash
go mod tidy
```
4. Compile the project:
```bash
go build -o run
```
### Command-Line Arguments
- `URL` (required): The URL where the crawl starts.
- `-v`, `--verbose`: Enable verbose logging.
- `--disable-cookies`: Disable cookies during requests.
- `--log `: Log output to the specified file.
- `-o`, `--output `: Output file to save scraped emails (default: `emails.txt`).
- `--skip-validation`: Skip the email validation.
- `--user-agent `: Custom User-Agent string for requests.
- `--max-depth `: Set the maximum crawling depth (default: 3).
- `--domain-mode `: Set crawling domain mode:
- `1`: Stay within the current site (default).
- `2`: Explore subdirectories.
- `3`: Unrestricted.
### Example
To run the crawler with verbose output, skip email validation, and save emails to a file:
```bash
./run https://example.com --verbose --skip-validation --output emails.txt
```
## Functionality Breakdown
### Email Extraction
- Extracts emails from:
- Normal email addresses found in the page content.
- Obfuscated emails (like `data-cfemail` attributes).
- Emails encoded in SVG images.
- Emails obfuscated in JavaScript.
### Depth-based Crawling
The crawler supports multiple levels of recursion, allowing it to traverse deeper into a website. The `--max-depth` flag controls how many levels deep the crawler will go.
### Logging
Logs are generated for important actions, errors, and other debugging information. You can specify a log file using the `--log` flag.
## TODO
- Add OCR support.
- Capture redirects to `mailto`.
- Support CSS pseudo-element encoding.
- Remove non-visible HTML elements.
## License
This project is licensed under the AGPL-3.0 License - see the [LICENSE](LICENSE) file for details.