{"id":28546242,"url":"https://github.com/nobrainghost/golamv2","last_synced_at":"2025-06-16T05:01:58.468Z","repository":{"id":298154230,"uuid":"998996978","full_name":"nobrainghost/golamv2","owner":"nobrainghost","description":"Lightweight Web Crawler for Emails,Keywords,Deadlinks,Dead Domains written in Go. Suitable for low resource environments","archived":false,"fork":false,"pushed_at":"2025-06-09T17:02:12.000Z","size":10999,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-09T18:20:43.067Z","etag":null,"topics":["golang","webcrawler","webcrawling"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nobrainghost.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-09T15:21:44.000Z","updated_at":"2025-06-09T17:03:39.000Z","dependencies_parsed_at":"2025-06-09T18:33:15.422Z","dependency_job_id":null,"html_url":"https://github.com/nobrainghost/golamv2","commit_stats":null,"previous_names":["nobrainghost/golamv2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobrainghost%2Fgolamv2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobrainghost%2Fgolamv2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobrainghost%2Fgolamv2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobrainghost%2Fgolamv2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nobrainghost","download_url":"https://codeload.github.com/nobrainghost/golamv2/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nobrainghost%2Fgolamv2/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":258971326,"owners_count":22786066,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["golang","webcrawler","webcrawling"],"created_at":"2025-06-09T23:09:08.719Z","updated_at":"2025-06-14T03:01:33.015Z","avatar_url":"https://github.com/nobrainghost.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GolamV2 - A LightWeight Web Crawler for Emails/Keywords/Dead Backlinks/Dead Domains\n\nGolamV2 is a high-performance, low-memory web crawler designed for maximum throughput in resource-constrained environments. It supports multiple hunting modes including email extraction, keyword searching, and dead link detection. It is a rewrite of the Python Version Gollum Spyder [here](https://github.com/nobrainghost/Keyword-Web-Crawler). Includes a Custom Interactive CLI Explore for its BadgerDB database\n\n## Features\n\n- **Multi-Purpose Crawling**: Email hunting, keyword searching, dead link detection\n- **Memory Efficiency**: Can run with decent through put on low resource environments\n- **Robots.txt Compliant**: Respects robots.txt and crawl delays\n- **Real-time Dashboard**: Web-based monitoring interface\n- **Interactive CLI Explorer**: Comprehensive data exploration and analysis tool\n- **Clean Architecture**: Modular, maintainable codebase\n- **Efficient Storage**: BadgerDB for persistent storage\n- **Bloom Filter**: Memory-efficient duplicate URL detection\n- **Priority Queue**: Smart URL queuing with database fallback\n\n## Architecture\n\n### Core Components\n\n1. **URL Queue**: Priority-based queue (100k URLs limit) with automatic database refilling and spilling\n2. **Bloom Filter**: To dedupe\n3. **Storage Layer**: BadgerDB for persistent URL and result storage\n4. **Worker Pool**: Configurable concurrent workers\n5. **Content Extractor**:\n6. **Robots Checker**: Compliant robots.txt parsing and enforcement. Also parses sitemaps\n7. **Dashboard**: Real-time web interface for monitoring\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/nobrainghost/golamv2\ncd GolamV2\n\n# Install dependencies\ngo mod tidy\n\n# Build the application\ngo build -o golamv2 main.go\n```\n\n## Usage\n\n### Basic Email Hunting\n```bash\n./golamv2 --email --url https://example.com --workers 25\n```\n\n### Keyword Searching\n```bash\n./golamv2 --keywords \"password,login,admin\" --url https://example.com --workers 30\n```\n\n### Dead Link Detection\n```bash\n./golamv2 --domains --url https://example.com --workers 20\n```\n\n### All-in-One Mode\n```bash\n./golamv2 --email --domains --keywords \"smeagol,ring\" --url https://example.com --workers 40\n```\n\n### Data Exploration\n```bash\n# Explore crawl data interactively\n./golamv2 explore\n\n# Explore with custom data directory\n./golamv2 explore --data /path/to/data\n```\n\n### Advanced Options\n```bash\n./golamv2 \\\n  --email \\\n  --url https://example.com \\\n  --workers 50 \\\n  --memory 400 \\\n  --depth 5 \\\n  --dashboard 8080\n```\n\n## Command Line Options\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--email` | Hunt for email addresses | false |\n| `--domains` | Hunt for dead URLs and domains | false |\n| `--keywords` | Hunt for specific keywords (comma-separated) | [] |\n| `--url` | Starting URL to crawl (required) | - |\n| `--workers` | Maximum number of concurrent workers | 50 |\n| `--memory` | Maximum memory usage in MB | 500 |\n| `--depth` | Maximum crawling depth | 5 |\n| `--dashboard` | Dashboard port | 8080 |\n\n## Dashboard\n\nAccess the real-time dashboard at `http://localhost:8080` (or your specified port). The paths /db currently dont work\n\n### Dashboard Features\n\n- **Real-time Metrics**: Live updates via a WebSocket\n- **Performance Monitoring**: URLs/second, memory usage, uptime\n- **Queue Status**: URLs in queue, database, active workers\n- **Findings Summary**: Emails, keywords, dead links found\n- **Success Rate**: Error tracking and success percentage\n\n\n## CLI Data Explorer\n\nGolamV2 includes an interactive CLI tool for exploring and analyzing crawl data stored in its BadgerDB databases.\n\n### Starting the Explorer\n\n```bash\n# Use default data directory (golamv2_data)\n./golamv2 explore\n\n# Specify custom data directory\n./golamv2 explore --data /path/to/data\n\n# With output file for exports\n./golamv2 explore --output results.json\n```\n\n### Available Commands\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `help` | Show all available commands | `help` |\n| `stats` | Display database statistics | `stats` |\n| `urls [limit]` | List URLs (default: 10) | `urls 20` |\n| `results [limit]` | List crawl results (default: 10) | `results 50` |\n| `search \u003cterm\u003e` | Search in results content | `search \"admin panel\"` |\n| `emails [limit]` | Show found emails | `emails 25` |\n| `keywords [limit]` | Show found keywords | `keywords 15` |\n| `deadlinks [limit]` | Show dead links found | `deadlinks 30` |\n| `export \u003ctype\u003e` | Export data to JSON | `export emails` |\n| `raw \u003ckey\u003e` | Show raw data for specific key | `raw url:example.com` |\n| `analyze` | Detailed data analysis | `analyze` |\n| `timeline` | Show crawling timeline | `timeline` |\n| `domains` | Show domain statistics | `domains` |\n| `clear` | Clear terminal screen | `clear` |\n| `quit/exit` | Exit explorer | `quit` |\n\n### Explorer Features\n\n#### Data Search and Filtering\n- Full-text search across all results\n- Search in titles, content, emails, and keywords\n- Filter by status, domain, or content type\n\n#### Export Capabilities\n- Export URLs, results, emails, or keywords to JSON\n- Configurable output files\n- Data formatting for further analysis\n##NOTE : NOT FULLY TESTED\n\n#### Advanced Analysis\n- Domain-based statistics and analysis\n- Timeline visualization of crawling activity\n- Success rate analysis by domain\n- Error categorization and reporting\n- Performance metrics and trends\n\n### Example Explorer Session\n\n```bash\n$ ./golamv2 explore\n🕸️  GolamV2 Data Explorer\n========================\nInteractive tool to explore crawl data\nData path: golamv2_data\n\ngolamv2\u003e stats\n Database Statistics\n=====================\nURLs in database:      2,767\nResults in database:   37,635\nEmails found:          118,613\nKeywords found:        1,258\nDead links found:      20,422\nErrors encountered:    226\nSuccess rate:          99.4%\n\ngolamv2\u003e search \"login\"\n Search Results for \"login\":\n=============================\nFound 45 results containing \"login\"\n- example.com/admin - Admin Login Portal\n- test.org/user - User Login Page\n...\n\ngolamv2\u003e export emails\n Exporting emails...\nExported 118,613 emails to emails_export.json\n\ngolamv2\u003e quit\nGoodbye! [waveEmoji]\n```\n\n### Command Line Options\n\n| Flag | Short | Description | Default |\n|------|-------|-------------|---------|\n| `--data` | `-d` | Path to GolamV2 data directory | `golamv2_data` |\n| `--output` | `-o` | Output file for exports | (none) |\n\n## Database Storage\n\n### URL Database (`urls/`)\n- Stores pending URLs for crawling\n- Automatic queue refilling when memory queue is \u003c40% full\n- Optimized for fast retrieval and batch operations\n\n### Results Database (`finds_*`)\n- Stores crawling results based on mode:\n  - `finds_email`: Email hunting results\n  - `finds_keywords`: Keyword search results  \n  - `finds_domains`: Dead link detection results\n  - `finds`: All-mode results\n\n## Performance Optimization\n\n### Memory Management\n- **Bloom Filter**: 10M URL capacity, 1% false positive rate\n- **Priority Queue**: 100k URL limit with smart refilling\n- **BadgerDB**: Tuned for low memory - can increase to suit your environment\n- **HTTP Responses**: 10MB size limit to prevent memory exhaustion\n\n### Throughput Optimization\n- **Worker Pool**: Configurable concurrent processing\n- **Rate Limiting**: Respectful crawling (10 req/sec default)\n- **Batch Operations**: Efficient database operations\n- **Connection Pooling**: Reused HTTP connections\n\n### Robots.txt Compliance\n- **Automatic Parsing**: Fetches and caches robots.txt\n- **Crawl Delays**: Respects specified delays\n- **Sitemap Discovery**: Extracts sitemap URLs for better crawling\n- **User-Agent Specific**: Follows rules for GolamV2-Crawler/1.0\n\n## Configuration\n\n### Environment Variables\n```bash\nexport GOLAMV2_DB_PATH=\"./golamv2_data\"\nexport GOLAMV2_USER_AGENT=\"GolamV2-Crawler/1.0\"\nexport GOLAMV2_RATE_LIMIT=\"10\"\n```\n\n### Memory Allocation\n- **70%**: URL storage and processing\n- **30%**: Results storage and caching\n## Troubleshooting\n\n### Common Issues\n\n1. **Memory Usage Too High**\n   - Reduce `--workers` count\n   - Lower `--memory` limit\n   - Reduce crawling `--depth`\n\n2. **Slow Performance**\n   - Increase `--workers` count\n   - Check network connectivity\n   - Monitor robots.txt delays\n\n3. **Database Issues**\n   - Ensure sufficient disk space\n   - Check file permissions\n   - Restart application for corruption\n\n### Performance Tuning\n\n1. **For High-Memory Systems**\n   ```bash\n   ./golamv2 --workers 100 --memory 800 --url https://example.com\n   ```\n\n2. **For Low-Memory Systems**\n   ```bash\n   ./golamv2 --workers 20 --memory 200 --url https://example.com\n   ```\n\n3. **For Maximum Throughput**\n   ```bash\n   ./golamv2 --workers 200 --memory 400 --depth 3 --url https://example.com\n   ```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Support\n\nFor issues, questions, or contributions, please use the GitHub issue tracker or mailto:golam@benar.me \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnobrainghost%2Fgolamv2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnobrainghost%2Fgolamv2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnobrainghost%2Fgolamv2/lists"}