https://github.com/nobrainghost/golamv2

Lightweight Web Crawler for Emails,Keywords,Deadlinks,Dead Domains written in Go. Suitable for low resource environments
https://github.com/nobrainghost/golamv2
golang webcrawler webcrawling
Last synced: about 1 year ago
JSON representation
Lightweight Web Crawler for Emails,Keywords,Deadlinks,Dead Domains written in Go. Suitable for low resource environments
Host: GitHub
URL: https://github.com/nobrainghost/golamv2
Owner: nobrainghost
Created: 2025-06-09T15:21:44.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-09T17:02:12.000Z (about 1 year ago)
Last Synced: 2025-06-09T18:20:43.067Z (about 1 year ago)
Topics: golang, webcrawler, webcrawling
Language: Go
Homepage:
Size: 10.5 MB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # GolamV2 - A LightWeight Web Crawler for Emails/Keywords/Dead Backlinks/Dead Domains

GolamV2 is a high-performance, low-memory web crawler designed for maximum throughput in resource-constrained environments. It supports multiple hunting modes including email extraction, keyword searching, and dead link detection. It is a rewrite of the Python Version Gollum Spyder [here](https://github.com/nobrainghost/Keyword-Web-Crawler). Includes a Custom Interactive CLI Explore for its BadgerDB database

## Features

- **Multi-Purpose Crawling**: Email hunting, keyword searching, dead link detection

- **Memory Efficiency**: Can run with decent through put on low resource environments

- **Robots.txt Compliant**: Respects robots.txt and crawl delays

- **Real-time Dashboard**: Web-based monitoring interface

- **Interactive CLI Explorer**: Comprehensive data exploration and analysis tool

- **Clean Architecture**: Modular, maintainable codebase

- **Efficient Storage**: BadgerDB for persistent storage

- **Bloom Filter**: Memory-efficient duplicate URL detection

- **Priority Queue**: Smart URL queuing with database fallback

## Architecture

### Core Components

1. **URL Queue**: Priority-based queue (100k URLs limit) with automatic database refilling and spilling

2. **Bloom Filter**: To dedupe

3. **Storage Layer**: BadgerDB for persistent URL and result storage

4. **Worker Pool**: Configurable concurrent workers

5. **Content Extractor**:

6. **Robots Checker**: Compliant robots.txt parsing and enforcement. Also parses sitemaps

7. **Dashboard**: Real-time web interface for monitoring

## Installation

```bash

# Clone the repository

git clone https://github.com/nobrainghost/golamv2

cd GolamV2

# Install dependencies

go mod tidy

# Build the application

go build -o golamv2 main.go

```

## Usage

### Basic Email Hunting

```bash

./golamv2 --email --url https://example.com --workers 25

```

### Keyword Searching

```bash

./golamv2 --keywords "password,login,admin" --url https://example.com --workers 30

```

### Dead Link Detection

```bash

./golamv2 --domains --url https://example.com --workers 20

```

### All-in-One Mode

```bash

./golamv2 --email --domains --keywords "smeagol,ring" --url https://example.com --workers 40

```

### Data Exploration

```bash

# Explore crawl data interactively

./golamv2 explore

# Explore with custom data directory

./golamv2 explore --data /path/to/data

```

### Advanced Options

```bash

./golamv2 \

  --email \

  --url https://example.com \

  --workers 50 \

  --memory 400 \

  --depth 5 \

  --dashboard 8080

```

## Command Line Options

| Flag | Description | Default |

|------|-------------|---------|

| `--email` | Hunt for email addresses | false |

| `--domains` | Hunt for dead URLs and domains | false |

| `--keywords` | Hunt for specific keywords (comma-separated) | [] |

| `--url` | Starting URL to crawl (required) | - |

| `--workers` | Maximum number of concurrent workers | 50 |

| `--memory` | Maximum memory usage in MB | 500 |

| `--depth` | Maximum crawling depth | 5 |

| `--dashboard` | Dashboard port | 8080 |

## Dashboard

Access the real-time dashboard at `http://localhost:8080` (or your specified port). The paths /db currently dont work

### Dashboard Features

- **Real-time Metrics**: Live updates via a WebSocket

- **Performance Monitoring**: URLs/second, memory usage, uptime

- **Queue Status**: URLs in queue, database, active workers

- **Findings Summary**: Emails, keywords, dead links found

- **Success Rate**: Error tracking and success percentage

## CLI Data Explorer

GolamV2 includes an interactive CLI tool for exploring and analyzing crawl data stored in its BadgerDB databases.

### Starting the Explorer

```bash

# Use default data directory (golamv2_data)

./golamv2 explore

# Specify custom data directory

./golamv2 explore --data /path/to/data

# With output file for exports

./golamv2 explore --output results.json

```

### Available Commands

| Command | Description | Example |

|---------|-------------|---------|

| `help` | Show all available commands | `help` |

| `stats` | Display database statistics | `stats` |

| `urls [limit]` | List URLs (default: 10) | `urls 20` |

| `results [limit]` | List crawl results (default: 10) | `results 50` |

| `search ` | Search in results content | `search "admin panel"` |

| `emails [limit]` | Show found emails | `emails 25` |

| `keywords [limit]` | Show found keywords | `keywords 15` |

| `deadlinks [limit]` | Show dead links found | `deadlinks 30` |

| `export ` | Export data to JSON | `export emails` |

| `raw ` | Show raw data for specific key | `raw url:example.com` |

| `analyze` | Detailed data analysis | `analyze` |

| `timeline` | Show crawling timeline | `timeline` |

| `domains` | Show domain statistics | `domains` |

| `clear` | Clear terminal screen | `clear` |

| `quit/exit` | Exit explorer | `quit` |

### Explorer Features

#### Data Search and Filtering

- Full-text search across all results

- Search in titles, content, emails, and keywords

- Filter by status, domain, or content type

#### Export Capabilities

- Export URLs, results, emails, or keywords to JSON

- Configurable output files

- Data formatting for further analysis

##NOTE : NOT FULLY TESTED

#### Advanced Analysis

- Domain-based statistics and analysis

- Timeline visualization of crawling activity

- Success rate analysis by domain

- Error categorization and reporting

- Performance metrics and trends

### Example Explorer Session

```bash

$ ./golamv2 explore

🕸️  GolamV2 Data Explorer

========================

Interactive tool to explore crawl data

Data path: golamv2_data

golamv2> stats

 Database Statistics

=====================

URLs in database:      2,767

Results in database:   37,635

Emails found:          118,613

Keywords found:        1,258

Dead links found:      20,422

Errors encountered:    226

Success rate:          99.4%

golamv2> search "login"

 Search Results for "login":

=============================

Found 45 results containing "login"

- example.com/admin - Admin Login Portal

- test.org/user - User Login Page

...

golamv2> export emails

 Exporting emails...

Exported 118,613 emails to emails_export.json

golamv2> quit

Goodbye! [waveEmoji]

```

### Command Line Options

| Flag | Short | Description | Default |

|------|-------|-------------|---------|

| `--data` | `-d` | Path to GolamV2 data directory | `golamv2_data` |

| `--output` | `-o` | Output file for exports | (none) |

## Database Storage

### URL Database (`urls/`)

- Stores pending URLs for crawling

- Automatic queue refilling when memory queue is <40% full

- Optimized for fast retrieval and batch operations

### Results Database (`finds_*`)

- Stores crawling results based on mode:

  - `finds_email`: Email hunting results

  - `finds_keywords`: Keyword search results  

  - `finds_domains`: Dead link detection results

  - `finds`: All-mode results

## Performance Optimization

### Memory Management

- **Bloom Filter**: 10M URL capacity, 1% false positive rate

- **Priority Queue**: 100k URL limit with smart refilling

- **BadgerDB**: Tuned for low memory - can increase to suit your environment

- **HTTP Responses**: 10MB size limit to prevent memory exhaustion

### Throughput Optimization

- **Worker Pool**: Configurable concurrent processing

- **Rate Limiting**: Respectful crawling (10 req/sec default)

- **Batch Operations**: Efficient database operations

- **Connection Pooling**: Reused HTTP connections

### Robots.txt Compliance

- **Automatic Parsing**: Fetches and caches robots.txt

- **Crawl Delays**: Respects specified delays

- **Sitemap Discovery**: Extracts sitemap URLs for better crawling

- **User-Agent Specific**: Follows rules for GolamV2-Crawler/1.0

## Configuration

### Environment Variables

```bash

export GOLAMV2_DB_PATH="./golamv2_data"

export GOLAMV2_USER_AGENT="GolamV2-Crawler/1.0"

export GOLAMV2_RATE_LIMIT="10"

```

### Memory Allocation

- **70%**: URL storage and processing

- **30%**: Results storage and caching

## Troubleshooting

### Common Issues

1. **Memory Usage Too High**

   - Reduce `--workers` count

   - Lower `--memory` limit

   - Reduce crawling `--depth`

2. **Slow Performance**

   - Increase `--workers` count

   - Check network connectivity

   - Monitor robots.txt delays

3. **Database Issues**

   - Ensure sufficient disk space

   - Check file permissions

   - Restart application for corruption

### Performance Tuning

1. **For High-Memory Systems**

   ```bash

   ./golamv2 --workers 100 --memory 800 --url https://example.com

   ```

2. **For Low-Memory Systems**

   ```bash

   ./golamv2 --workers 20 --memory 200 --url https://example.com

   ```

3. **For Maximum Throughput**

   ```bash

   ./golamv2 --workers 200 --memory 400 --depth 3 --url https://example.com

   ```

## License

MIT License - see LICENSE file for details.

## Support

For issues, questions, or contributions, please use the GitHub issue tracker or mailto:golam@benar.me
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nobrainghost/golamv2

Awesome Lists containing this project

README