An open API service indexing awesome lists of open source software.

https://github.com/wzdnzd/harvester

Intelligent data acquisition framework for GitHub and web sources
https://github.com/wzdnzd/harvester

ai anthropic deepseek gemini openai qwen

Last synced: about 2 months ago
JSON representation

Intelligent data acquisition framework for GitHub and web sources

Awesome Lists containing this project

README

          

# Harvester - Universal Data Acquisition Framework

**πŸ“– [δΈ­ζ–‡ζ–‡ζ‘£](README.zh-CN.md) | English | πŸ”— [More Tools](https://github.com/wzdnzd/ai-collector)**

A universal, adaptive data acquisition framework designed for comprehensive information acquisition from multiple sources including GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. While the current implementation focuses on AI service provider key discovery as a practical example, the framework is architected for extensibility to support diverse data acquisition scenarios.

---

⭐⭐⭐ **If this project helps you, please give it a star!** Your support motivates us to keep improving and adding new features.

---

## Table of Contents

- [Key Features](#key-features)
- [Quick Start](#quick-start)
- [Architecture](#architecture)
- [Directory Structure](#directory-structure)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)

## Project Goals

The system aims to build a **universal data acquisition framework** primarily targeting:

- **GitHub**: Code repositories, issues, commits, and API endpoints
- **Network Mapping Platforms**:
- [FOFA](https://fofa.info) - Cyberspace mapping and asset discovery
- [Shodan](https://www.shodan.io/) - Internet-connected device search engine
- **Arbitrary Web Endpoints**: Custom APIs, web services, and data sources
- **Extensible Architecture**: Plugin-based system for easy integration of new data sources

## Current Data Source Support

| Data Source | Status | Description |
| ----------- | ------------- | --------------------------------------- |
| GitHub API | βœ… Implemented | Full API integration with rate limiting |
| GitHub Web | βœ… Implemented | Web scraping with intelligent parsing |
| FOFA | 🚧 Planned | Cyberspace asset discovery integration |
| Shodan | 🚧 Planned | IoT and network device enumeration |
| Custom APIs | 🚧 Planned | Generic REST/GraphQL API adapter |

## Architecture

### Layered Architecture

```mermaid
graph TB
%% Entry Layer
subgraph Entry["Entry Layer"]
CLI["CLI Interface
(main.py)"]
App["Application Core
(main.py)"]
end

%% Management Layer
subgraph Management["Management Layer"]
TaskMgr["Task Manager
(manager/task.py)"]
Pipeline["Pipeline Manager
(manager/pipeline.py)"]
WorkerMgr["Worker Manager
(manager/worker.py)"]
QueueMgr["Queue Manager
(manager/queue.py)"]
StatusMgr["Status Manager
(manager/status.py)"]
Shutdown["Shutdown Coordinator
(manager/shutdown.py)"]
end

%% Processing Layer
subgraph Processing["Processing Layer"]
StageBase["Stage Framework
(stage/base.py)"]
StageImpl["Stage Implementations
(stage/definition.py)"]
StageReg["Stage Registry
(stage/registry.py)"]
StageFactory["Stage Factory
(stage/factory.py)"]
StageResolver["Dependency Resolver
(stage/resolver.py)"]
end

%% Service Layer
subgraph Service["Service Layer"]
SearchSvc["Search Service
(search/client.py)"]
SearchProviders["Search Providers
(search/provider/)"]
RefineSvc["Query Refinement
(refine/)"]
RefineEngine["Refine Engine
(refine/engine.py)"]
RefineOptimizer["Query Optimizer
(refine/optimizer.py)"]
end

%% Core Domain Layer
subgraph Core["Core Domain Layer"]
Models["Domain Models & Tasks
(core/models.py)"]
Types["Type System
(core/types.py)"]
Enums["Enumerations
(core/enums.py)"]
Metrics["Metrics
(core/metrics.py)"]
Auth["Authentication
(core/auth.py)"]
end

%% Infrastructure Layer
subgraph Infrastructure["Infrastructure Layer"]
Config["Configuration
(config/)"]
Tools["Tools & Utilities
(tools/)"]
Constants["Constants
(constant/)"]
Storage["Storage & Persistence
(storage/)"]
end

%% State Management Layer
subgraph StateLayer["State Management Layer"]
StateCollector["State Collector
(state/collector.py)"]
StateDisplay["Display Engine
(state/display.py)"]
StateBuilder["Status Builder
(state/builder.py)"]
StateModels["State Models
(state/models.py)"]
StateMonitor["State Monitor
(state/monitor.py)"]
StateEnums["State Enums
(state/enums.py)"]
StateTypes["State Types
(state/types.py)"]
end

%% External Systems
subgraph External["External Systems"]
GitHub["GitHub
(API + Web)"]
AIServices["AI Service
Providers"]
FileSystem["File System
(Local Storage)"]
end

%% Dependencies (Top-down)
Entry --> Management
Management --> Processing
Processing --> Service
Service --> Core

%% Infrastructure dependencies
Entry -.-> Infrastructure
Management -.-> Infrastructure
Processing -.-> Infrastructure
Service -.-> Infrastructure
Core -.-> Infrastructure

%% State management dependencies
Entry -.-> StateLayer
Management -.-> StateLayer

%% External dependencies
Service --> External
Infrastructure --> External
```

### System Architecture Overview

```mermaid
graph TB
%% User Interface Layer
subgraph UserLayer["User Interface Layer"]
User[User]
CLI[Command Line Interface]
ConfigMgmt[Configuration Management]
end

%% Application Management Layer
subgraph AppLayer["Application Management Layer"]
MainApp[Main Application]
TaskManager[Task Manager]
StatusManager[Status Manager]
ResourceManager[Resource Manager]
ShutdownManager[Shutdown Manager]
end

%% Core Pipeline Engine
subgraph PipelineCore["Pipeline Engine"]
%% Stage Management System
subgraph StageSystem["Stage Management System"]
StageRegistry[Stage Registry]
DependencyResolver[Dependency Resolver]
StageFactory[Stage Factory]
end

%% Queue Management System
subgraph QueueSystem["Queue Management System"]
QueueManager[Queue Manager]
WorkerManager[Worker Manager]
MonitoringSystem[System Monitor]
end

%% Processing Stages
subgraph ProcessingStages["Processing Stages"]
SearchStage[Search Stage]
GatherStage[Gather Stage]
CheckStage[Check Stage]
InspectStage[Inspect Stage]
end
end

%% Search Provider Ecosystem
subgraph ProviderEcosystem["Search Provider Ecosystem"]
ProviderRegistry[Provider Registry]
BaseProvider[Base Provider]
OpenAIProvider[OpenAI-like Provider]
CustomProviders[Custom Providers]
end

%% Advanced Processing Engines
subgraph ProcessingEngines["Processing Engines"]
SearchClient[Search Client]

%% Query Optimization Engine
subgraph QueryOptimizer["Query Optimization Engine"]
RefineEngine[Refine Engine]
RegexParser[Regex Parser]
SplittabilityAnalyzer[Splittability Analyzer]
EnumerationOptimizer[Enumeration Optimizer]
QueryGenerator[Query Generator]
OptimizationStrategies[Optimization Strategies]

%% Internal Flow
RefineEngine --> RegexParser
RegexParser --> SplittabilityAnalyzer
SplittabilityAnalyzer --> EnumerationOptimizer
EnumerationOptimizer --> OptimizationStrategies
OptimizationStrategies --> QueryGenerator
end

ValidationEngine[API Key Validation]
RecoveryEngine[Task Recovery]
end

%% State & Data Management
subgraph StateManagement["State & Data Management"]
StateCollector[State Collector]
DisplayEngine[Display Engine]
StatusBuilder[Status Builder]
StateMonitor[State Monitor]
PersistenceLayer[Persistence Layer]
SnapshotManager[Snapshot Manager]
ResultManager[Result Manager]
end

%% Infrastructure Services
subgraph Infrastructure["Infrastructure Services"]
RateLimiting[Rate Limiting]
CredentialMgmt[Credential Management]
AgentRotation[User Agent Rotation]
LoggingSystem[Logging System]
RetryFramework[Retry Framework]
ResourcePool[Resource Pool]
end

%% External Systems
subgraph External["External Systems"]
GitHubAPI[GitHub API]
GitHubWeb[GitHub Web Interface]
AIServiceAPIs[AI Service APIs]
FileSystem[Local File System]
end

%% User Interactions
User --> CLI
User --> ConfigMgmt
CLI --> MainApp
ConfigMgmt --> MainApp

%% Application Flow
MainApp --> TaskManager
MainApp --> StatusManager
MainApp --> ResourceManager
MainApp --> ShutdownManager
TaskManager --> StageRegistry
TaskManager --> QueueManager

%% Stage Management Flow
StageRegistry --> DependencyResolver
StageRegistry --> StageFactory
DependencyResolver --> ProcessingStages
StageFactory --> ProcessingStages

%% Queue Management Flow
QueueManager --> WorkerManager
QueueManager --> MonitoringSystem
WorkerManager --> ProcessingStages

%% Stage Dependencies (Pipeline)
SearchStage --> GatherStage
GatherStage --> CheckStage
CheckStage --> InspectStage

%% Processing Engine Integration
SearchStage --> SearchClient
SearchStage --> QueryOptimizer
CheckStage --> ValidationEngine
ProcessingStages --> RecoveryEngine

%% Provider Integration
SearchClient --> ProviderRegistry
ProviderRegistry --> BaseProvider
BaseProvider --> OpenAIProvider
BaseProvider --> CustomProviders

%% State Management Integration
ProcessingStages --> StateCollector
QueueManager --> StateCollector
StateCollector --> DisplayEngine
StateCollector --> StatusBuilder
StateMonitor --> DisplayEngine
ProcessingStages --> PersistenceLayer
PersistenceLayer --> SnapshotManager
PersistenceLayer --> ResultManager

%% Infrastructure Integration
SearchClient -.-> RateLimiting
ResourceManager -.-> CredentialMgmt
ResourceManager -.-> AgentRotation
MainApp -.-> LoggingSystem
ProcessingStages -.-> RetryFramework
Infrastructure -.-> ResourcePool

%% External Connections
SearchClient --> GitHubAPI
SearchClient --> GitHubWeb
ValidationEngine --> AIServiceAPIs
PersistenceLayer --> FileSystem

%% Styling
classDef userClass fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef appClass fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef coreClass fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
classDef providerClass fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef engineClass fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef stateClass fill:#f1f8e9,stroke:#689f38,stroke-width:2px
classDef infraClass fill:#f5f5f5,stroke:#616161,stroke-width:2px
classDef externalClass fill:#ffebee,stroke:#d32f2f,stroke-width:2px

class User,CLI,ConfigMgmt userClass
class MainApp,TaskManager,StatusManager,ResourceManager,ShutdownManager appClass
class StageRegistry,DependencyResolver,StageFactory,QueueManager,WorkerManager,MonitoringSystem,SearchStage,GatherStage,CheckStage,InspectStage coreClass
class ProviderRegistry,BaseProvider,OpenAIProvider,CustomProviders providerClass
class SearchClient,QueryOptimizer,ValidationEngine,RecoveryEngine engineClass
class StateCollector,StateMonitor,DisplayEngine,StatusBuilder,PersistenceLayer,SnapshotManager,ResultManager stateClass
class RateLimiting,CredentialMgmt,AgentRotation,LoggingSystem,RetryFramework,ResourcePool infraClass
class GitHubAPI,GitHubWeb,AIServiceAPIs,FileSystem externalClass
```

The project follows a layered architecture with the following core components:

### Multi-Stage Processing Flow

```mermaid
sequenceDiagram
participant CLI as CLI
participant App as Application
participant TM as TaskManager
participant Pipeline as Pipeline
participant Search as SearchStage
participant Gather as GatherStage
participant Check as CheckStage
participant Inspect as InspectStage
participant Storage as Storage
participant Monitor as StatusManager

%% Initialization Phase
CLI->>App: 1. Start Application
App->>App: 2. Load Configuration
App->>TM: 3. Create TaskManager
TM->>TM: 4. Initialize Providers
TM->>Pipeline: 5. Create Pipeline
Pipeline->>Search: 6. Register SearchStage
Pipeline->>Gather: 7. Register GatherStage
Pipeline->>Check: 8. Register CheckStage
Pipeline->>Inspect: 9. Register InspectStage
App->>Monitor: 10. Start Status Manager

%% Processing Phase
loop Multi-Stage Processing
TM->>Search: 11. Submit Search Tasks
Search->>Search: 12. Query GitHub with Optimization
Search->>Gather: 13. Forward Search Results

Gather->>Gather: 14. Acquire Detailed Information
Gather->>Check: 15. Forward Extracted Keys

Check->>Check: 16. Validate API Keys
Check->>Inspect: 17. Forward Valid Keys

Inspect->>Inspect: 18. Inspect API Capabilities
Inspect->>Storage: 19. Save Results

Pipeline->>Monitor: 20. Update Status
Monitor->>App: 21. Display Progress
end

%% Recovery and Persistence
loop Background Operations
Storage->>Storage: Auto-save Results
Storage->>Storage: Create Snapshots
Pipeline->>Pipeline: Task Recovery
Monitor->>Monitor: Collect Metrics
end

%% Completion Phase
Pipeline->>Pipeline: 22. Check Completion
Pipeline->>Storage: 23. Final Persistence
Pipeline->>Monitor: 24. Final Status Report
App->>TM: 25. Graceful Shutdown
TM->>Storage: 26. Save State
```

## Architecture Layers

### 1. **Presentation Layer**
- **CLI Interface** (`main.py`): Command-line entry point with argument parsing and application lifecycle
- **Configuration System** (`config/`): YAML-based configuration management with validation and schemas

### 2. **Application Layer**
- **Application Core** (`main.py`): Main application lifecycle and orchestration
- **Task Management** (`manager/task.py`): Provider coordination and task distribution
- **Resource Coordination** (`tools/coordinator.py`): Global resource management and coordination
- **Shutdown Management** (`manager/shutdown.py`): Graceful shutdown coordination
- **Status Management** (`manager/status.py`): Application status management and coordination
- **Worker Management** (`manager/worker.py`): Worker thread management and scaling
- **Queue Management** (`manager/queue.py`): Multi-queue coordination and management

### 3. **Business Service Layer**
- **Pipeline Engine** (`manager/pipeline.py`): Multi-stage processing orchestration with DAG execution
- **Stage System** (`stage/`): Pluggable processing stages with dependency resolution and factory pattern
- **Search Service** (`search/`): GitHub code search with provider abstraction and optimization
- **Query Refinement** (`refine/`): Intelligent query optimization with strategy pattern and mathematical foundations

### 4. **Domain Layer**
- **Core Models & Tasks** (`core/models.py`): Business domain objects, data structures, and task definitions
- **Type System** (`core/types.py`): Interface definitions and contracts
- **Business Enums** (`core/enums.py`): Domain enumerations and constants
- **Metrics & Analytics** (`core/metrics.py`): Performance measurement and KPI tracking
- **Authentication** (`core/auth.py`): Authentication and authorization logic
- **Custom Exceptions** (`core/exceptions.py`): Domain-specific exception handling
- **Custom Exceptions** (`core/exceptions.py`): Domain-specific exception handling

### 5. **Infrastructure Layer**
- **Storage & Persistence** (`storage/`): Result storage, recovery, and snapshot management
- **Atomic Operations** (`storage/atomic.py`): Atomic file operations with fsync
- **Result Management** (`storage/persistence.py`): Multi-format result persistence
- **Task Recovery** (`storage/recovery.py`): Task recovery mechanisms
- **Shard Management** (`storage/shard.py`): NDJSON shard management with rotation
- **Snapshot Management** (`storage/snapshot.py`): Backup and restore functionality
- **Tools & Utilities** (`tools/`): Infrastructure tools and utilities
- **Logging System** (`tools/logger.py`): Structured logging with API key redaction
- **Rate Limiting** (`tools/ratelimit.py`): Adaptive rate control with token bucket algorithm
- **Load Balancing** (`tools/balancer.py`): Resource distribution strategies
- **Credential Management** (`tools/credential.py`): Secure credential rotation and management
- **Agent Management** (`tools/agent.py`): User-agent rotation for web scraping
- **Pattern Matching** (`tools/patterns.py`): Pattern matching utilities and helpers
- **Retry Framework** (`tools/retry.py`): Unified retry mechanisms with backoff strategies
- **Resource Pooling** (`tools/resources.py`): Resource pool management and optimization

### 6. **State Management Layer**
- **State Collection** (`state/collector.py`): System metrics gathering and aggregation
- **Display Engine** (`state/display.py`): User-friendly progress visualization and formatting
- **Status Builder** (`state/builder.py`): Status data construction and transformation
- **State Models** (`state/models.py`): Monitoring data structures and metrics
- **State Monitoring** (`state/monitor.py`): Real-time state monitoring and tracking
- **State Enumerations** (`state/enums.py`): State-related enumerations and constants
- **State Types** (`state/types.py`): State type definitions and interfaces

## Processing Stages

The system implements a **4-stage pipeline** for comprehensive data acquisition and validation:

1. **Search Stage** (`stage/definition.py:SearchStage`):
- Intelligent GitHub code search with advanced query optimization
- Multi-provider search support (API + Web)
- Query refinement using mathematical optimization algorithms
- Rate-limited search execution with adaptive throttling

2. **Gather Stage** (`stage/definition.py:GatherStage`):
- Detailed information acquisition from search results
- Content extraction and parsing
- Pattern matching for key identification
- Structured data collection and normalization

3. **Check Stage** (`stage/definition.py:CheckStage`):
- API key validation against actual service endpoints
- Authentication verification and capability testing
- Service availability and response validation
- Error handling and retry mechanisms

4. **Inspect Stage** (`stage/definition.py:InspectStage`):
- API capability inspection for validated keys
- Model enumeration and feature detection
- Service limits and quota analysis
- Comprehensive capability profiling

## Advanced Query Optimization Engine

The system features a sophisticated **Query Optimization Engine** with mathematical foundations:

### Core Components

1. **Regex Parser**
- Advanced regex pattern parsing with support for complex syntax
- Handles escaped characters, character classes, and quantifiers
- Converts patterns into analyzable segment structures

2. **Splittability Analyzer**
- Mathematical analysis of pattern divisibility
- Recursive depth limiting for safety
- Value threshold analysis for optimization feasibility
- Resource cost estimation for performance control

3. **Enumeration Optimizer**
- Intelligent enumeration strategy selection
- Multi-dimensional optimization (depth, breadth, value)
- Combinatorial analysis for optimal segment selection
- Topological sorting for dependency resolution

4. **Query Generator**
- Generates optimized query variants from enumeration strategies
- Supports configurable enumeration depth
- Produces mathematically optimal query distributions
- Maintains query semantic equivalence

### Optimization Algorithms

- **Mathematical Modeling**: Uses mathematical principles to analyze regex patterns
- **Enumeration Strategy**: Intelligent selection of optimal enumeration depth and combinations
- **Resource Management**: Prevents resource exhaustion through intelligent limiting
- **Performance Optimization**: Singleton pattern ensures optimal memory usage

## Supported Data Sources & Use Cases

### πŸ” Current Implementation (AI Service Discovery)
- **OpenAI and compatible interfaces**
- **Anthropic Claude**
- **Azure OpenAI**
- **Google Gemini**
- **AWS Bedrock**
- **GooeyAI**
- **Stability AI**
- **百度文心一言**
- **ζ™Ίθ°±AI**
- **Custom providers**

### 🌐 Planned Data Sources
- **[FOFA](https://fofa.info)**: Cyberspace asset discovery and network mapping
- **[Shodan](https://www.shodan.io/)**: Internet-connected device enumeration
- **Custom REST APIs**: Generic API integration framework
- **GraphQL Endpoints**: Flexible query-based data acquisition
- **Web Scraping**: JavaScript-rendered content and dynamic sites
- **Database Connectors**: Direct database query capabilities

### πŸ“Š Potential Use Cases
- **Data Mining**: Large-scale information extraction and analysis

## Key Features

### 🌐 Universal Data Acquisition
- **Multi-Source Support**: GitHub, FOFA, Shodan, and custom endpoints
- **Adaptive Query Engine**: Intelligent optimization for different data sources
- **Protocol Agnostic**: REST, GraphQL, WebSocket, and web scraping support
- **Rate Limiting**: Per-source intelligent rate control and quota management

### πŸ—οΈ Advanced Architecture
- **Dynamic Stage System**: Configurable processing pipelines with DAG execution
- **Plugin Architecture**: Extensible framework for custom data sources and processors
- **Dependency Resolution**: Automatic stage ordering and dependency management
- **Handler Registration**: Pluggable processors for flexible data transformation

### ⚑ High Performance
- **Asynchronous Processing**: Multi-threaded task execution with intelligent queuing
- **Adaptive Load Balancing**: Dynamic resource allocation based on workload
- **Query Optimization**: Mathematical modeling for optimal search strategies
- **Resource Monitoring**: Real-time performance tracking and bottleneck detection

### πŸ›‘οΈ Enterprise Ready
- **Fault Tolerance**: Comprehensive error handling, retry mechanisms, and recovery
- **State Persistence**: Queue state recovery and graceful shutdown capabilities
- **Security**: Credential management, API key redaction, and secure storage
- **Monitoring**: Real-time analytics, alerting, and performance visualization

## System Requirements

### **Dependencies**
- **Python**: 3.10+
- **Libraries**: `PyYAML`
- **Optional**: `uvloop` (Linux/macOS performance boost)
- **Development**: `pytest`, `black`, `mypy` (for contributors)

## Quick Start

> πŸ“š For comprehensive documentation, tutorials, and advanced usage guides, please visit [DeepWiki](https://deepwiki.com/wzdnzd/harvester)

1. **Installation**
```bash
git clone https://github.com/wzdnzd/harvester.git
cd harvester
pip install -r requirements.txt
```

2. **Configuration**

> Choose one of the following methods to create your configuration

**Method 1: Generate default configuration**
```bash
python main.py --create-config
```

**Method 2: Copy from examples**
```bash
# For basic configuration
cp examples/config-simple.yaml config.yaml

# For full configuration with all options
cp examples/config-full.yaml config.yaml
```

Edit the configuration file:
- Set your Github session token or API key
- Configure provider search patterns
- Adjust rate limits and thread counts

### Configuration Guide

The system provides two configuration templates:

1. **Basic Configuration** - Suitable for quick start:
```yaml
# Global application settings
global:
workspace: "./data" # Working directory
github_credentials:
sessions:
- "your_github_session_here" # GitHub session token
strategy: "round_robin" # Load balancing strategy

# Pipeline stage configuration
pipeline:
threads:
search: 1 # Search threads (keep low)
gather: 4 # Acquisition threads
check: 2 # Validation threads
inspect: 1 # API capability inspection threads

# System monitoring settings
monitoring:
update_interval: 2.0 # Monitoring update interval
error_threshold: 0.1 # Error rate threshold

# Data persistence configuration
persistence:
auto_restore: true # Auto restore state on startup
shutdown_timeout: 30 # Shutdown timeout in seconds

# Global rate limiting configuration
ratelimits:
github_web:
base_rate: 0.5 # Base rate in requests per second
burst_limit: 2 # Maximum burst size
adaptive: true # Enable adaptive rate limiting

# Provider task configurations
tasks:
- name: "openai" # Provider name
enabled: true # Enable/disable provider
provider_type: "openai"
use_api: false # Use GitHub API for searching

# Pipeline stage settings
stages:
search: true # Enable search stage
gather: true # Enable acquisition stage
check: true # Enable validation stage
inspect: true # Enable API capability inspection

# Pattern matching configuration
patterns:
key_pattern: "sk(?:-proj)?-[a-zA-Z0-9]{20}T3BlbkFJ[a-zA-Z0-9]{20}"

# Search conditions
conditions:
- query: '"T3BlbkFJ"'
```

2. **Full Configuration** - Includes all advanced options:
- `display`: Display and monitoring settings
- `global`: Global system configuration
- `pipeline`: Pipeline stage configuration
- `monitoring`: System monitoring parameters
- `persistence`: Data persistence settings
- `worker`: Worker pool configuration
- `ratelimits`: Rate limiting settings
- `tasks`: Provider task configurations

### Advanced Task Configuration

> πŸ“‹ **For complete configuration examples, please refer to:**
> - [`examples/config-full.yaml`](examples/config-full.yaml) - Comprehensive configuration with all available options
> - [`examples/config-simple.yaml`](examples/config-simple.yaml) - Basic configuration for quick start

The `tasks` section is the core of the configuration, defining what providers to search and how to process them. Refer to the basic configuration example above for a complete tasks configuration.

#### Key Configuration Options

- **`name`**: Unique identifier for the task
- **`provider_type`**: Determines validation method (`openai`, `openai_like`, `anthropic`, `gemini`, etc.)
- **`api`**: API endpoint configuration for key validation
- **`patterns.key_pattern`**: Regex pattern to identify valid API keys
- **`conditions`**: Search queries to find potential keys
- **`stages`**: Enable/disable specific processing stages
- **`extras.directory`**: Custom output directory for results

3. **Running**
```bash
python main.py # Use default config
python main.py -c custom.yaml # Use custom config
python main.py --validate # Validate config
python main.py --log-level DEBUG # Enable debug logging
```

## Directory Structure

```
harvester/
β”œβ”€β”€ config/ # Configuration management
β”‚ β”œβ”€β”€ accessor.py # Configuration access utilities
β”‚ β”œβ”€β”€ defaults.py # Default configuration values
β”‚ β”œβ”€β”€ loader.py # Configuration loading
β”‚ β”œβ”€β”€ schemas.py # Configuration schemas
β”‚ β”œβ”€β”€ validator.py # Configuration validation
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ constant/ # System constants
β”‚ β”œβ”€β”€ monitoring.py # Monitoring constants
β”‚ β”œβ”€β”€ runtime.py # Runtime constants
β”‚ β”œβ”€β”€ search.py # Search constants
β”‚ β”œβ”€β”€ system.py # System constants
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ core/ # Core domain models
β”‚ β”œβ”€β”€ auth.py # Authentication
β”‚ β”œβ”€β”€ enums.py # System enumerations
β”‚ β”œβ”€β”€ exceptions.py # Custom exceptions
β”‚ β”œβ”€β”€ metrics.py # Performance metrics
β”‚ β”œβ”€β”€ models.py # Core data models & task definitions
β”‚ β”œβ”€β”€ types.py # Core type definitions
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ examples/ # Configuration examples
β”‚ β”œβ”€β”€ config-full.yaml # Complete configuration template
β”‚ └── config-simple.yaml # Basic configuration template
β”œβ”€β”€ manager/ # Task and resource management
β”‚ β”œβ”€β”€ base.py # Base management classes
β”‚ β”œβ”€β”€ pipeline.py # Pipeline management
β”‚ β”œβ”€β”€ queue.py # Queue management
β”‚ β”œβ”€β”€ shutdown.py # Shutdown coordination
β”‚ β”œβ”€β”€ status.py # Status management
β”‚ β”œβ”€β”€ task.py # Task management
β”‚ β”œβ”€β”€ worker.py # Worker thread management
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ refine/ # Query optimization
β”‚ β”œβ”€β”€ config.py # Refine configuration
β”‚ β”œβ”€β”€ engine.py # Optimization engine
β”‚ β”œβ”€β”€ generator.py # Query generation
β”‚ β”œβ”€β”€ optimizer.py # Query optimization
β”‚ β”œβ”€β”€ parser.py # Query parsing
β”‚ β”œβ”€β”€ segment.py # Pattern segmentation
β”‚ β”œβ”€β”€ splittability.py # Splittability analysis
β”‚ β”œβ”€β”€ strategies.py # Optimization strategies
β”‚ β”œβ”€β”€ types.py # Refine type definitions
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ search/ # Search engines
β”‚ β”œβ”€β”€ client.py # Search client
β”‚ β”œβ”€β”€ provider/ # Provider implementations
β”‚ β”‚ β”œβ”€β”€ anthropic.py # Anthropic provider
β”‚ β”‚ β”œβ”€β”€ azure.py # Azure OpenAI provider
β”‚ β”‚ β”œβ”€β”€ base.py # Base provider class
β”‚ β”‚ β”œβ”€β”€ bedrock.py # AWS Bedrock provider
β”‚ β”‚ β”œβ”€β”€ doubao.py # ByteDance Doubao provider
β”‚ β”‚ β”œβ”€β”€ gemini.py # Google Gemini provider
β”‚ β”‚ β”œβ”€β”€ gooeyai.py # GooeyAI provider
β”‚ β”‚ β”œβ”€β”€ openai.py # OpenAI provider
β”‚ β”‚ β”œβ”€β”€ openai_like.py # OpenAI-compatible provider
β”‚ β”‚ β”œβ”€β”€ qianfan.py # Baidu Qianfan provider
β”‚ β”‚ β”œβ”€β”€ registry.py # Provider registry
β”‚ β”‚ β”œβ”€β”€ stabilityai.py # Stability AI provider
β”‚ β”‚ β”œβ”€β”€ vertex.py # Google Vertex AI provider
β”‚ β”‚ └── __init__.py # Package initialization
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ stage/ # Pipeline stages
β”‚ β”œβ”€β”€ base.py # Base stage classes
β”‚ β”œβ”€β”€ definition.py # Stage implementations
β”‚ β”œβ”€β”€ factory.py # Stage factory
β”‚ β”œβ”€β”€ registry.py # Stage registry
β”‚ β”œβ”€β”€ resolver.py # Dependency resolver
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ state/ # State management
β”‚ β”œβ”€β”€ builder.py # Status builder
β”‚ β”œβ”€β”€ collector.py # State collection
β”‚ β”œβ”€β”€ display.py # Display engine
β”‚ β”œβ”€β”€ enums.py # State enumerations
β”‚ β”œβ”€β”€ models.py # State data models
β”‚ β”œβ”€β”€ monitor.py # State monitoring
β”‚ β”œβ”€β”€ types.py # State type definitions
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ storage/ # Storage and persistence
β”‚ β”œβ”€β”€ atomic.py # Atomic file operations
β”‚ β”œβ”€β”€ persistence.py # Result persistence
β”‚ β”œβ”€β”€ recovery.py # Task recovery
β”‚ β”œβ”€β”€ shard.py # NDJSON shard management
β”‚ β”œβ”€β”€ snapshot.py # Snapshot management
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ tools/ # Tools and utilities
β”‚ β”œβ”€β”€ agent.py # User agent management
β”‚ β”œβ”€β”€ balancer.py # Load balancing
β”‚ β”œβ”€β”€ coordinator.py # Resource coordination
β”‚ β”œβ”€β”€ credential.py # Credential management
β”‚ β”œβ”€β”€ logger.py # Logging system
β”‚ β”œβ”€β”€ patterns.py # Pattern matching utilities
β”‚ β”œβ”€β”€ ratelimit.py # Rate limiting
β”‚ β”œβ”€β”€ resources.py # Resource pooling
β”‚ β”œβ”€β”€ retry.py # Retry framework
β”‚ β”œβ”€β”€ utils.py # General utilities
β”‚ └── __init__.py # Package initialization
β”œβ”€β”€ .dockerignore # Docker ignore rules
β”œβ”€β”€ .gitignore # Git ignore rules
β”œβ”€β”€ Dockerfile # Docker container configuration
β”œβ”€β”€ entrypoint.sh # Docker entrypoint script
β”œβ”€β”€ LICENSE # License file
β”œβ”€β”€ main.py # Entry point and application core
β”œβ”€β”€ README.md # English documentation
β”œβ”€β”€ README.zh-CN.md # Chinese documentation
β”œβ”€β”€ requirements.txt # Python dependencies
└── __init__.py # Root package initialization
```

## Advanced Features

1. **Real-time Monitoring**
- Task status tracking
- Performance metrics collection
- Resource usage monitoring
- Alert system

2. **Configuration Flexibility**
- Multi-provider configuration
- Custom search patterns
- Adjustable performance parameters
- Dynamic resource allocation

3. **Extensibility**
- Plugin-style providers
- Custom pipeline stages
- Configurable monitoring system
- Flexible recovery strategies

## Troubleshooting

### **Common Issues**

#### **1. Installation Problems**
```bash
# Issue: pip install fails
# Solution: Upgrade pip and use virtual environment
python -m pip install --upgrade pip
python -m venv venv

# Linux/macOS
source venv/bin/activate

# Windows
venv\Scripts\activate

pip install -r requirements.txt
```

#### **2. Configuration Errors**
```bash
# Issue: Configuration validation fails
# Solution: Validate configuration file
python main.py --validate

# Issue: Missing configuration file
# Solution: Create from example
cp examples/config-simple.yaml config.yaml
```

#### **3. Rate Limiting Issues**
```bash
# Issue: Too many API requests
# Solution: Adjust rate limits in config
rate_limits:
github_api:
base_rate: 0.1 # Reduce rate
adaptive: true # Enable adaptive limiting
```

#### **4. Memory Issues**
```bash
# Issue: High memory usage
# Solution: Reduce batch sizes and thread counts
pipeline:
threads:
search: 1
gather: 2 # Reduce from default
persistence:
batch_size: 25 # Reduce from default 50
```

#### **5. Network Connectivity**
```bash
# Issue: Connection timeouts
# Solution: Increase timeout values
api:
timeout: 60 # Increase from default 30
retries: 5 # Increase retry attempts
```

### **Debug Mode**
```bash
# Enable debug logging
python main.py --log-level DEBUG

# Save debug output to file
python main.py --log-level DEBUG > debug.log 2>&1
```

## Security Considerations

### **Credential Management**
- **Never commit credentials** to version control
- **Use environment variables** for sensitive configuration
- **Rotate credentials regularly** to minimize exposure risk
- **Implement least privilege** access for API keys

### **Data Protection**
```yaml
# Example: Secure credential configuration
global:
github_credentials:
sessions:
- "${GITHUB_SESSION_1}" # Use environment variables
- "${GITHUB_SESSION_2}"
tokens:
- "${GITHUB_TOKEN_1}"
```

### **Privacy Considerations**
- **Respect robots.txt** and website terms of service
- **Implement rate limiting** to avoid overwhelming target services
- **Log redaction** automatically removes sensitive data from logs
- **Data retention policies** should comply with applicable regulations

### **Compliance Guidelines**
- **Review legal requirements** before using in production
- **Obtain necessary permissions** for data collection
- **Implement data anonymization** where required
- **Document data processing** activities for compliance

## Important Notes

1. **Limitations**
- Respect Github API usage limits
- Configure rate limits appropriately
- Mind memory usage
- Handle sensitive data carefully

2. **Best Practices**
- Use appropriate thread counts
- Backup results regularly
- Monitor error rates
- Handle alerts promptly

## TODO & Roadmap

### πŸ—οΈ Core Architecture Improvements

#### Data Source Abstraction
- [ ] **Abstract Data Source Interface**: Create a unified interface for all data sources
- [ ] Define `DataSourceProvider` base class with standard methods (`search`, `gather`, `validate`)
- [ ] Implement adapter pattern for different API formats (REST, GraphQL, WebSocket)
- [ ] Add configuration schema for data source registration
- [ ] Support dynamic data source loading and hot-swapping

#### Stage System Enhancement
- [ ] **Flexible Stage Definition**: Move beyond the current 4-stage limitation
- [ ] Create `StageDefinition` configuration format (YAML/JSON)
- [ ] Implement dynamic stage loading from configuration files
- [ ] Add stage composition and conditional execution
- [ ] Support user-defined stage workflows and DAG customization

#### Handler/Processor Registration System
- [ ] **Pluggable Processing Architecture**: Replace fixed function calls with configurable handlers
- [ ] Implement `HandlerRegistry` for stage-specific processors
- [ ] Create `ProcessorInterface` with standardized input/output contracts
- [ ] Add handler discovery mechanism (annotation-based or configuration-driven)
- [ ] Support middleware chains for request/response processing

### 🌐 Data Source Integrations

#### Network Mapping Platforms
- [ ] **FOFA Integration**
- [ ] Implement FOFA API client with authentication
- [ ] Add FOFA-specific query optimization

- [ ] **Shodan Integration**
- [ ] Support data querying and extraction from Shodan

#### Generic Web Sources
- [ ] **Universal Web Scraper**
- [ ] Build configurable web scraping engine
- [ ] Add support for JavaScript-rendered content (Selenium/Playwright)
- [ ] Implement anti-bot detection bypass mechanisms
- [ ] Create content extraction rule engine

### πŸ”§ Framework Enhancements

#### Configuration & Extensibility
- [ ] **Plugin System**
- [ ] Design plugin architecture with lifecycle management
- [ ] Create plugin marketplace and discovery mechanism
- [ ] Add plugin sandboxing and security validation
- [ ] Implement plugin dependency resolution

#### Performance & Scalability
- [ ] **Distributed Processing**
- [ ] Add support for distributed task execution (Celery/RQ)
- [ ] Implement horizontal scaling with load balancing
- [ ] Create cluster management and node discovery
- [ ] Add distributed state synchronization

#### Security
- [ ] **Enhanced Security Features**
- [ ] Implement credential encryption and secure storage
- [ ] Create rate limiting policies per data source

### πŸ“Š Monitoring & Analytics

#### Advanced Monitoring
- [ ] **Real-time Analytics Dashboard**
- [ ] Build web-based monitoring interface
- [ ] Add real-time metrics visualization
- [ ] Implement alerting and notification system
- [ ] Create performance profiling and bottleneck analysis

### πŸš€ Advanced Features

#### API & Integration
- [ ] **RESTful API Server**
- [ ] Build comprehensive REST API for external integration
- [ ] Implement webhook support for real-time notifications
- [ ] Create SDK libraries for popular programming languages

## Contributing

Contributions are welcome! Before submitting a pull request, please ensure:

1. Tests are updated
2. Code follows style guidelines
3. Documentation is added where necessary
4. All tests pass

### Priority Areas for Contributors

- πŸ”₯ **High Priority**: Data source abstraction and FOFA/Shodan integration
- πŸ”₯ **High Priority**: Stage system flexibility and handler registration
- πŸ”₯ **High Priority**: Plugin architecture and extensibility framework
- πŸ”₯ **Medium Priority**: Performance optimization and distributed processing
- πŸ”₯ **Medium Priority**: Web-based monitoring dashboard

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the [LICENSE](LICENSE) file for details.

## Disclaimer

**⚠️ IMPORTANT NOTICE**

This project is developed **solely for educational and technical research purposes**. Users should exercise caution and responsibility when using this software.

**Key Points:**
- This software is intended for learning, research, and educational use only
- Users must comply with all applicable laws and regulations in their jurisdiction
- Users are responsible for ensuring their usage complies with the terms of service of any third-party platforms or APIs
- **The project authors do not recommend, encourage, or endorse the use of this software for illegally obtaining others' API keys or credentials**
- The project authors assume **no responsibility** for any disputes, legal issues, or damages arising from the use of this software
- Commercial use is strictly prohibited without explicit written permission
- Users should respect the intellectual property rights and privacy of others

**By using this software, you acknowledge that you have read, understood, and agree to these terms. Use at your own risk.**

## Contact

For questions or other inquiries during usage, please contact the project maintainers through GitHub Issues.