{"id":28205237,"url":"https://github.com/cripterhack/business-address-scrapper","last_synced_at":"2025-06-14T12:30:42.593Z","repository":{"id":275060502,"uuid":"922013317","full_name":"CripterHack/business-address-scrapper","owner":"CripterHack","description":"Python+Scrapy - Distributed scraping system with cache for business information extraction.","archived":false,"fork":false,"pushed_at":"2025-02-11T02:02:27.000Z","size":272,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-17T08:13:20.920Z","etag":null,"topics":["cuda","ollama","postgresql","python","redis","scraper","scraping","scrapy","tesseract"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CripterHack.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-25T04:53:32.000Z","updated_at":"2025-03-25T08:23:22.000Z","dependencies_parsed_at":"2025-01-31T00:24:56.107Z","dependency_job_id":"c4c48bfa-dad2-4f80-8566-569527d29e08","html_url":"https://github.com/CripterHack/business-address-scrapper","commit_stats":null,"previous_names":["cripterhack/business-address-scrapper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/CripterHack/business-address-scrapper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CripterHack%2Fbusiness-address-scrapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CripterHack%2Fbusiness-address-scrapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CripterHack%2Fbusiness-address-scrapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CripterHack%2Fbusiness-address-scrapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CripterHack","download_url":"https://codeload.github.com/CripterHack/business-address-scrapper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CripterHack%2Fbusiness-address-scrapper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259816131,"owners_count":22915815,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","ollama","postgresql","python","redis","scraper","scraping","scrapy","tesseract"],"created_at":"2025-05-17T08:14:01.349Z","updated_at":"2025-06-14T12:30:42.577Z","avatar_url":"https://github.com/CripterHack.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Business Address Scraper\n\nDistributed scraping system with cache for business information extraction.\n\n## Main Features\n\n### Base System\n- Scalable distributed architecture\n- Configurable multi-threaded processing\n- Advanced logging system with customizable levels\n- Intelligent error handling and automatic recovery\n- Efficient system resource management\n\n### Distributed Cache\n- Support for multiple backends (Redis, Memcached)\n- Configurable compression and encryption\n- Policy-based automatic cleanup system\n- Configurable replication and consistency\n- Intelligent memory and space management\n\n### Alert System\n- Real-time monitoring of critical events\n- Configurable severity levels\n- Detailed alert history with metadata\n- Detection and grouping of duplicate alerts\n- Integration with metrics system\n\n### Metrics and Monitoring\n- Automatic system metrics collection\n- Performance and resource monitoring\n- Detailed operation statistics\n- Configurable log rotation system\n- Standard format metrics export\n\n### Security\n- Configurable authentication system\n- Protection against brute force attacks\n- Token and session management\n- Sensitive data encryption\n- Configurable access policies\n\n### Advanced Processing\n- OCR Integration (Tesseract)\n- AI capabilities with LLaMA model\n- Parallel data processing\n- Configurable extraction pipeline\n- Data validation and cleaning\n\n### Resource Management\n- Automatic temporary resource cleanup\n- Configurable backup management\n- CPU and memory usage control\n- Disk space monitoring\n- Automatic failure recovery\n\n## Project Structure\n\n```\nscraper/\n├── __init__.py\n├── alerts/\n│   ├── __init__.py\n│   ├── manager.py\n│   ├── handlers.py\n│   └── metrics.py\n├── cache/\n│   ├── __init__.py\n│   ├── distributed.py\n│   ├── cleaner.py\n│   ├── compression.py\n│   ├── encryption.py\n│   └── priority.py\n├── core/\n│   ├── __init__.py\n│   ├── config.py\n│   ├── logging.py\n│   ├── metrics.py\n│   └── utils.py\n├── db/\n│   ├── __init__.py\n│   ├── models.py\n│   ├── session.py\n│   └── operations.py\n├── extractors/\n│   ├── __init__.py\n│   ├── base.py\n│   ├── text.py\n│   ├── ocr.py\n│   └── ai.py\n├── monitor/\n│   ├── __init__.py\n│   ├── system.py\n│   ├── resources.py\n│   └── alerts.py\n├── security/\n│   ├── __init__.py\n│   ├── auth.py\n│   ├── encryption.py\n│   └── tokens.py\n└── utils/\n    ├── __init__.py\n    ├── validation.py\n    ├── formatting.py\n    └── helpers.py\n\nconfig/\n├── logging.yaml\n├── cache.yaml\n├── alerts.yaml\n├── metrics.yaml\n└── security.yaml\n\ntests/\n├── unit/\n├── integration/\n└── performance/\n\ndocs/\n├── api/\n├── setup/\n└── examples/\n```\n\n### Distributed Cache System\n\n- **Authentication**: Role and token-based access control\n- **Compression**: Automatic compression based on data type and size\n- **Encryption**: Transparent sensitive data encryption\n- **Events**: Pub/sub system for monitoring and reaction\n- **Partitioning**: Consistent data distribution\n- **Replication**: Redundant copies for high availability\n- **Circuit Breakers**: Protection against cascade failures\n- **Cleanup**: Automatic data aging management\n- **Error Handling**: Unified system with:\n  - Detailed logging\n  - Error metrics\n  - Automatic notifications\n  - Intelligent recovery\n- **Resource Management**:\n  - Automatic connection closure\n  - Resource cleanup\n  - Context managers\n  - Lifecycle management\n- **Statistics**:\n  - Node performance\n  - Resource usage\n  - Operations by type\n  - Temporal analysis\n\n### Event System\n\nThe system uses a centralized event manager to monitor and react to different situations:\n\n#### Event Types\n\n- **Critical** (High Priority):\n  - Errors\n  - Node failures\n  - Recovery/migration failures\n  \n- **Operational** (Medium Priority):\n  - Warnings\n  - Migrations\n  - Rebalancing\n  - Backups/Restorations\n  \n- **Informational** (Low Priority):\n  - GET/SET operations\n  - Informational logs\n  - Metrics\n\n### Alert System\n\n- **Configuration**:\n  - Customizable thresholds by alert type\n  - Configurable severity levels\n  - Related alert grouping\n  - Configurable duplication windows\n  \n- **Monitoring**:\n  - Detailed alert history\n  - Severity statistics\n  - Filtering and search\n  - Alert metrics\n  - Automatic history cleanup\n  \n- **Notifications**:\n  - System event integration\n  - Similar alert aggregation\n  - Alert storm prevention\n  - Duplicate detection\n  - Silence windows\n\n- **Resource Management**:\n  - Automatic periodic cleanup\n  - Memory management\n  - Context managers\n  - Orderly shutdown\n\n- **Statistics**:\n  - Period summaries\n  - Severity distribution\n  - Trend analysis\n  - Duplication metrics\n  - Cleanup efficiency\n\n### Monitoring System\n\n- **Real-time Metrics**:\n  - Operation latency\n  - Success/error rates\n  - Resource usage\n  - Node statistics\n  - Access patterns\n  \n- **Configurable Alerts**:\n  - Dynamic thresholds\n  - Event correlation\n  - Trend analysis\n  \n- **Reports**:\n  - Historical performance\n  - Error analysis\n  - Resource usage\n  - Access patterns\n  - Periodic summaries\n\n## Installation\n\n### Prerequisites\n- Python 3.8+\n- Redis 6.0+ or Memcached 1.6+\n- PostgreSQL 12+ (optional)\n- Tesseract 4.1+ (optional for OCR)\n- CUDA 11.0+ (optional for AI)\n\n### Basic Installation\n```bash\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # Linux/Mac\n.\\venv\\Scripts\\activate   # Windows\n\n# Install dependencies\npip install -r requirements.txt\n\n# Initial setup\npython setup.py install\n```\n\n### Installation with Optional Features\n```bash\n# OCR\npip install -r requirements-ocr.txt\n\n# AI\npip install -r requirements-ai.txt\n\n# Database\npip install -r requirements-db.txt\n```\n\n## Configuration\n\n### Basic Configuration\n1. Copy example files:\n```bash\ncp config/*.yaml.example config/*.yaml\n```\n\n2. Configure environment variables:\n```bash\ncp .env.example .env\n# Edit .env with your values\n```\n\n### Advanced Configuration\n\n#### Cache\n1. Choose backend (Redis/Memcached)\n2. Configure parameters in `config/cache.yaml`\n3. Adjust related environment variables\n\n#### Alert System\n1. Define severity levels\n2. Configure thresholds in `config/alerts.yaml`\n3. Set notification policies\n\n#### Metrics\n1. Enable metrics collection\n2. Configure intervals in `config/metrics.yaml`\n3. Define log rotation policies\n\n#### Security\n1. Generate encryption keys\n2. Configure policies in `config/security.yaml`\n3. Set authentication parameters\n\n## Usage\n\n### Start the System\n```bash\n# Start the web interface\nstreamlit run app.py\n\n# Run the scraper only\npython run_scraper.py\n```\n\n### Monitoring\n```bash\n# View real-time metrics\npython -m scraper.monitor metrics\n\n# View system status\npython -m scraper.monitor status\n\n# View active alerts\npython -m scraper.monitor alerts\n```\n\n### Maintenance\n```bash\n# Clean cache\npython -m scraper.cache clean\n\n# Rotate logs\npython -m scraper.utils rotate-logs\n\n# Data backup\npython -m scraper.utils backup\n```\n\n## Tests\n\n### Run Tests\n```bash\n# Unit tests\npython -m pytest tests/unit\n\n# Integration tests\npython -m pytest tests/integration\n\n# Performance tests\npython -m pytest tests/performance\n\n# All tests with coverage\npython -m pytest --cov=scraper tests/\n```\n\n### Specific Tests\n```bash\n# Cache system tests\npython -m pytest tests/unit/test_cache.py\n\n# Alert system tests\npython -m pytest tests/unit/test_alerts.py\n\n# Cache performance tests\npython -m pytest tests/performance/test_cache_performance.py\n```\n\n### Code Analysis\n```bash\n# Static analysis\nflake8 scraper\n\n# Type checking\nmypy scraper\n\n# Code formatting\nblack scraper\n```\n\n## Contributing\n\n### Contribution Guide\n\n1. Fork the repository\n2. Create a branch for your feature: `git checkout -b feature/feature-name`\n3. Implement your changes following style guides\n4. Ensure all tests pass\n5. Update documentation if necessary\n6. Create a pull request\n\n### Code Standards\n\n- Follow PEP 8 for Python code style\n- Document all functions and classes with docstrings\n- Maintain test coverage \u003e 80%\n- Use type hints in all functions\n- Maintain cyclomatic complexity \u003c 10\n\n### Development Flow\n\n1. Create issue describing the change\n2. Discuss implementation in the issue\n3. Implement changes in a branch\n4. Run complete test suite\n5. Create pull request\n6. Code review and approval\n7. Merge to main\n\n### Report Bugs\n\n- Use GitHub's issue system\n- Include steps to reproduce\n- Attach relevant logs\n- Specify system version\n- Describe expected vs actual behavior\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contact and Support\n\n### Communication Channels\n- **GitHub Issues**: For bug reports and feature requests\n- **Discussions**: For general questions and discussions\n- **Wiki**: For extended documentation and guides\n\n### Additional Resources\n- [API Documentation](docs/api/README.md)\n- [Development Guide](docs/development.md)\n- [Usage Examples](docs/examples/README.md)\n- [Troubleshooting Guide](docs/troubleshooting.md)\n\n### Maintainers\n- Keep code updated\n- Review pull requests\n- Respond to issues\n- Update documentation\n\n---\n**Note**: This project is in active development. Contributions are welcome.\n\n## Independent Simple Scraper Execution\n\n### Minimum Requirements for Simple Scraper\n- Python 3.8+\n- Google Chrome Browser\n- Git\n\n### Basic Installation (Windows/Linux/Mac)\n\n1. Clone the repository:\n```bash\ngit clone \u003crepository-url\u003e\ncd business-address-scrapper\n```\n\n2. Create and activate virtual environment:\n\nWindows:\n```powershell\npython -m venv venv\n.\\venv\\Scripts\\activate\n```\n\nLinux/Mac:\n```bash\npython -m venv venv\nsource venv/bin/activate\n```\n\n3. Install basic dependencies:\n```bash\npip install -r requirements.txt\n```\n\n4. Configure environment variables:\n```bash\n# Windows\ncopy .env.example .env\n\n# Linux/Mac\ncp .env.example .env\n```\n\n### Using the Simple Scraper\n\n1. Prepare input CSV file with business names in the first column\n\n2. Run the scraper:\n```bash\npython simple_scraper.py input.csv output.csv\n```\n\n3. Additional options:\n```bash\npython simple_scraper.py --input input.csv --output output.csv --retries 3 --wait 5\n```\n\n### Simple Scraper Configuration\n\nThe scraper can run in two modes:\n1. **Local Mode**: Uses local Chrome and webdriver-manager\n2. **Container Mode**: Uses pre-configured Chrome and ChromeDriver\n\nTo configure the mode:\n1. Edit `.env`:\n```env\n# Execution mode\nEXECUTION_ENV=local  # or 'container'\n\n# Browser settings\nCHROME_BINARY_PATH=  # Leave empty for local\nCHROME_DRIVER_PATH=  # Leave empty for local\nHEADLESS_MODE=false  # true/false\n```\n\n### Simple Scraper Troubleshooting\n\n1. Chrome/ChromeDriver Issues:\n   - Ensure Chrome is installed\n   - Update Chrome to latest version\n   - Clear browser cache/cookies\n\n2. Permission Issues:\n   - Verify write permissions in output directory\n   - Run with appropriate privileges\n\n3. Resource Issues:\n   - Increase system memory allocation\n   - Adjust scraping delays in .env\n\n4. Simple Scraper Logs:\n```bash\n# View recent logs\ntail -f logs/scraper.log\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcripterhack%2Fbusiness-address-scrapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcripterhack%2Fbusiness-address-scrapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcripterhack%2Fbusiness-address-scrapper/lists"}