{"id":32691373,"url":"https://github.com/0xfave/block-data-fetcher","last_synced_at":"2026-05-12T23:33:56.528Z","repository":{"id":320362839,"uuid":"1081354615","full_name":"0xfave/Block-Data-Fetcher","owner":"0xfave","description":"A minimal ETL (Extract, Transform, Load) pipeline written in Rust for fetching and processing block data from the Solana blockchain.","archived":false,"fork":false,"pushed_at":"2025-10-23T12:14:02.000Z","size":70,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-23T12:14:13.563Z","etag":null,"topics":["data-engineering","data-extraction","data-pipeline","data-pipeline-building","rpc","rust","solana","web3"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/0xfave.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"custom":"https://3cities.xyz/#/pay?c=CAESFAKY9DMuOFdjE4Wzl2YyUFipPiSfIgICATICCAJaFURvbmF0aW9uIHRvIFBhdWwgQmVyZw","github":"PaulRBerg"}},"created_at":"2025-10-22T17:02:41.000Z","updated_at":"2025-10-23T12:14:06.000Z","dependencies_parsed_at":"2025-10-23T12:24:39.097Z","dependency_job_id":null,"html_url":"https://github.com/0xfave/Block-Data-Fetcher","commit_stats":null,"previous_names":["0xfave/block-data-fetcher"],"tags_count":null,"template":false,"template_full_name":"PaulRBerg/rust-template","purl":"pkg:github/0xfave/Block-Data-Fetcher","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xfave%2FBlock-Data-Fetcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xfave%2FBlock-Data-Fetcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xfave%2FBlock-Data-Fetcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xfave%2FBlock-Data-Fetcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/0xfave","download_url":"https://codeload.github.com/0xfave/Block-Data-Fetcher/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xfave%2FBlock-Data-Fetcher/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":282158203,"owners_count":26623961,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-01T02:00:06.759Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-extraction","data-pipeline","data-pipeline-building","rpc","rust","solana","web3"],"created_at":"2025-11-01T15:01:04.688Z","updated_at":"2025-11-01T15:02:46.863Z","avatar_url":"https://github.com/0xfave.png","language":"Rust","funding_links":["https://3cities.xyz/#/pay?c=CAESFAKY9DMuOFdjE4Wzl2YyUFipPiSfIgICATICCAJaFURvbmF0aW9uIHRvIFBhdWwgQmVyZw","https://github.com/sponsors/PaulRBerg"],"categories":[],"sub_categories":[],"readme":"# Solana Block Data Fetcher 🚀\n\nA robust ETL (Extract, Transform, Load) pipeline for fetching Solana blockchain data, classifying transactions, and storing them in PostgreSQL. Built with Rust for performance and reliability.\n\n## 🌟 Features\n\n- **Complete ETL Pipeline**: Extract blocks from Solana RPC → Transform/classify transactions → Load into PostgreSQL\n- **Transaction Classification**: Automatically identifies SOL transfers, SPL token transfers, DEX swaps, NFT operations, and program interactions\n- **Robust Error Handling**: Retry logic with exponential backoff for network failures\n- **Batch Processing**: Efficient batch insertion with database transactions for atomicity\n- **Configurable CLI**: Full command-line interface for all parameters\n- **Real-time Statistics**: Progress tracking, throughput metrics, and success rates\n- **Idempotent Operations**: UPSERT logic allows re-processing blocks without duplicates\n\n## 📋 Prerequisites\n\n- Rust 1.70+ (2021 edition)\n- PostgreSQL 14+\n- Helius RPC API key (or any Solana RPC endpoint)\n\n## 🚀 Quick Start\n\n### 1. Setup Database\n\n```bash\n# Create database and user\nsudo -u postgres psql\nCREATE DATABASE solana_block_data;\nCREATE USER solana_user WITH PASSWORD 'solana_pass';\nGRANT ALL PRIVILEGES ON DATABASE solana_block_data TO solana_user;\n\\c solana_block_data\nGRANT ALL ON SCHEMA public TO solana_user;\n\\q\n```\n\n### 2. Configure Environment\n\nCreate a `.env` file:\n\n```env\nHELIUS_RPC_URL=https://mainnet.helius-rpc.com/?api-key=YOUR_API_KEY\nDATABASE_URL=postgresql://solana_user:solana_pass@localhost:5432/solana_block_data\nRUST_LOG=info\n```\n\n### 3. Build and Run\n\n```bash\n# Build release version\ncargo build --release\n\n# Run with default settings (10 recent blocks)\n./target/release/block-data-fetcher\n\n# Run with custom parameters\n./target/release/block-data-fetcher --num-blocks 50 --batch-size 25\n```\n\n## 📖 Usage\n\n### Basic Commands\n\n```bash\n# Process 20 recent blocks\n./block-data-fetcher --num-blocks 20\n\n# Process specific slot range\n./block-data-fetcher --start-slot 375000000 --end-slot 375000100\n\n# Use custom RPC endpoint\n./block-data-fetcher --rpc-url https://api.mainnet-beta.solana.com --num-blocks 10\n\n# Continuous mode (keep processing latest blocks)\n./block-data-fetcher --continuous --interval 30\n```\n\n### CLI Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `-s, --start-slot \u003cSLOT\u003e` | Starting slot number | latest - 30 |\n| `-e, --end-slot \u003cSLOT\u003e` | Ending slot number | latest - 20 |\n| `-n, --num-blocks \u003cCOUNT\u003e` | Number of blocks to fetch | - |\n| `-r, --rpc-url \u003cURL\u003e` | RPC endpoint URL | From .env |\n| `-d, --database-url \u003cURL\u003e` | Database connection URL | From .env |\n| `-b, --batch-size \u003cSIZE\u003e` | Batch size for processing | 10 |\n| `--max-retries \u003cCOUNT\u003e` | Maximum retry attempts | 3 |\n| `--retry-delay \u003cSECONDS\u003e` | Retry delay in seconds | 2 |\n| `-c, --continuous` | Enable continuous mode | false |\n| `--interval \u003cSECONDS\u003e` | Poll interval for continuous mode | 10 |\n| `-h, --help` | Print help information | - |\n| `-V, --version` | Print version information | - |\n\n### Complete Examples\n\n#### Backfill Historical Data\n```bash\n./block-data-fetcher \\\n  --start-slot 375000000 \\\n  --end-slot 375010000 \\\n  --batch-size 50 \\\n  --max-retries 5\n```\n\n#### Monitor Latest Blocks\n```bash\n./block-data-fetcher \\\n  --num-blocks 10 \\\n  --continuous \\\n  --interval 20\n```\n\n#### Custom RPC with Performance Tuning\n```bash\n./block-data-fetcher \\\n  --rpc-url https://api.mainnet-beta.solana.com \\\n  --num-blocks 100 \\\n  --batch-size 25 \\\n  --max-retries 5 \\\n  --retry-delay 3\n```\n\n#### Process Recent Blocks Only\n```bash\n./block-data-fetcher --num-blocks 5\n```\n\n#### Override Database Connection\n```bash\n./block-data-fetcher \\\n  --database-url postgresql://user:pass@localhost:5432/mydb \\\n  --num-blocks 20\n```\n\n### Environment Variables\n\nThe following environment variables are used when CLI arguments are not provided:\n\n- `HELIUS_RPC_URL` - Solana RPC endpoint\n- `DATABASE_URL` - PostgreSQL connection string\n- `RUST_LOG` - Logging level (info, debug, trace)\n\nThese are loaded from a `.env` file if present.\n\n## 🏗️ Architecture\n\n### High-Level System Overview\n\n```mermaid\ngraph TB\n    subgraph \"External Services\"\n        A[Solana RPC\u003cbr/\u003eHelius API]\n        B[(PostgreSQL\u003cbr/\u003eDatabase)]\n    end\n    \n    subgraph \"CLI Layer\"\n        C[Command Line\u003cbr/\u003eInterface]\n    end\n    \n    subgraph \"Pipeline Orchestration\"\n        D[Pipeline\u003cbr/\u003eCoordinator]\n        E[Statistics\u003cbr/\u003eTracker]\n        F[Error Handler\u003cbr/\u003e\u0026 Retry Logic]\n    end\n    \n    subgraph \"ETL Pipeline\"\n        G[Extract\u003cbr/\u003eRPC Client]\n        H[Transform\u003cbr/\u003eClassifier]\n        I[Load\u003cbr/\u003eDatabase Writer]\n    end\n    \n    C --\u003e|Config| D\n    D --\u003e|Fetch Blocks| G\n    G --\u003e|Block Data| A\n    A --\u003e|Response| G\n    G --\u003e|Raw Data| H\n    H --\u003e|Classified| I\n    I --\u003e|Batch Insert| B\n    D --\u003e|Monitor| E\n    D --\u003e|Handle Errors| F\n    F --\u003e|Retry| G\n    E --\u003e|Report| C\n    \n    style A fill:#e1f5ff\n    style B fill:#e1f5ff\n    style D fill:#fff4e1\n    style G fill:#e8f5e9\n    style H fill:#e8f5e9\n    style I fill:#e8f5e9\n```\n\n### Data Flow Sequence\n\n```mermaid\nsequenceDiagram\n    participant CLI\n    participant Pipeline\n    participant RPC as RPC Client\n    participant Helius as Helius API\n    participant Classifier\n    participant DB as PostgreSQL\n    \n    CLI-\u003e\u003ePipeline: Start (slot range)\n    \n    loop For each block\n        Pipeline-\u003e\u003eRPC: fetch_block(slot)\n        RPC-\u003e\u003eHelius: getBlock(slot, maxSupportedVersion)\n        Helius--\u003e\u003eRPC: Block + Transactions\n        RPC--\u003e\u003ePipeline: RawBlockData\n        \n        Pipeline-\u003e\u003eClassifier: classify_transactions(block)\n        \n        loop For each transaction\n            Classifier-\u003e\u003eClassifier: analyze_instructions()\n            Classifier-\u003e\u003eClassifier: determine_type()\n        end\n        \n        Classifier--\u003e\u003ePipeline: ClassifiedData\n        \n        Pipeline-\u003e\u003eDB: batch_insert(classified_data)\n        \n        alt Insert Success\n            DB--\u003e\u003ePipeline: OK (rows inserted)\n            Pipeline-\u003e\u003ePipeline: update_stats(success)\n        else Insert Failure\n            DB--\u003e\u003ePipeline: Error\n            Pipeline-\u003e\u003ePipeline: retry_logic()\n            Pipeline-\u003e\u003eDB: batch_insert(retry)\n        end\n    end\n    \n    Pipeline--\u003e\u003eCLI: Statistics Report\n```\n\n### Database Schema (ERD)\n\n```mermaid\nerDiagram\n    BLOCKS ||--o{ TRANSACTIONS : contains\n    TRANSACTIONS ||--o{ INSTRUCTIONS : has\n    TRANSACTIONS ||--o{ ACCOUNTS : involves\n    INSTRUCTIONS }o--|| PROGRAM_REGISTRY : references\n    \n    BLOCKS {\n        bigint slot PK\n        text blockhash UK\n        bigint parent_slot\n        timestamptz block_time\n        bigint block_height\n        jsonb raw_data\n        timestamptz created_at\n    }\n    \n    TRANSACTIONS {\n        text signature PK\n        bigint slot FK\n        int index_in_block\n        text fee_payer\n        bigint fee\n        text status\n        text transaction_type\n        jsonb raw_data\n        timestamptz created_at\n    }\n    \n    INSTRUCTIONS {\n        bigserial id PK\n        text transaction_signature FK\n        int instruction_index\n        text program_id\n        text instruction_type\n        jsonb data\n        timestamptz created_at\n    }\n    \n    ACCOUNTS {\n        bigserial id PK\n        text transaction_signature FK\n        text pubkey\n        bool is_signer\n        bool is_writable\n        bigint pre_balance\n        bigint post_balance\n        timestamptz created_at\n    }\n    \n    PROGRAM_REGISTRY {\n        text program_id PK\n        text program_name\n        text program_type\n        text description\n        timestamptz created_at\n    }\n```\n\n### Transaction Classification Logic\n\n```mermaid\nflowchart TD\n    A[Transaction] --\u003e B{Has Instructions?}\n    B --\u003e|No| Z[Unknown Type]\n    B --\u003e|Yes| C[Analyze Instructions]\n    \n    C --\u003e D{System Program\u003cbr/\u003eTransfer?}\n    D --\u003e|Yes| E[SOL Transfer]\n    \n    D --\u003e|No| F{Token Program\u003cbr/\u003eTransfer?}\n    F --\u003e|Yes| G[Token Transfer]\n    \n    F --\u003e|No| H{DEX Program?}\n    H --\u003e|Yes| I[DEX Swap]\n    \n    H --\u003e|No| J{NFT Operation?}\n    J --\u003e|Yes| K[NFT Operation]\n    \n    J --\u003e|No| L{Known Program?}\n    L --\u003e|Yes| M[Program Interaction]\n    L --\u003e|No| Z\n    \n    style E fill:#4caf50\n    style G fill:#2196f3\n    style I fill:#ff9800\n    style K fill:#9c27b0\n    style M fill:#00bcd4\n    style Z fill:#9e9e9e\n```\n\n### Pipeline Stages\n\n1. **Extract**: Fetch blocks from Solana RPC with rate limiting and error handling\n2. **Transform**: Classify transactions based on program IDs and instruction data\n3. **Load**: Batch insert into PostgreSQL with atomic transactions\n\n### Database Schema\n\nThe system uses 6 tables with proper relationships:\n\n- **`blocks`**: Block metadata (slot, blockhash, timestamp, parent relationships)\n- **`transactions`**: Transaction details with classification labels, linked to blocks\n- **`instructions`**: Individual instruction data, linked to transactions\n- **`accounts`**: Account states (pre/post balances, signer status)\n- **`program_registry`**: Known Solana programs for classification\n- **Indexes**: Optimized for common queries on slots, signatures, and program IDs\n\n### Transaction Classification\n\nAutomatically identifies:\n- 💸 **SOL Transfers**: Native SOL transfers via System Program\n- 🪙 **SPL Token Transfers**: Token transfers via Token Program\n- 🔄 **DEX Swaps**: Interactions with Raydium, Orca, Jupiter, etc.\n- 🖼️ **NFT Operations**: NFT mints and transfers\n- ⚙️ **Program Interactions**: Other program invocations (Drift, Kamino, etc.)\n- ❓ **Unknown**: Unclassified transactions\n\n### Key Design Decisions\n\n1. **Batch Processing**: Process blocks in configurable batches (default: 10) to balance memory vs. throughput\n2. **UPSERT Strategy**: Use PostgreSQL UPSERT to enable idempotent re-processing without duplicates\n3. **Classification at ETL Time**: Pre-compute transaction types for faster queries\n4. **Exponential Backoff**: Handle transient failures (network, rate limits) with configurable retry logic\n5. **JSONB Storage**: Structured schema + JSONB for raw data provides queryability + flexibility\n\n## 📊 Performance\n\nTypical performance metrics from real-world testing:\n- **Throughput**: 200-300 transactions/second (including classification)\n- **Batch Processing**: 10 blocks (~12,000 transactions) in ~25 seconds\n- **Success Rate**: 99-100% with retry logic enabled\n- **Database**: 28,597+ transactions processed across 26+ blocks\n- **Idempotency**: UPSERT operations allow safe re-processing without duplicates\n\n## 🛠️ Technology Stack\n\n- **Language**: Rust 2021 edition\n- **Async Runtime**: Tokio for concurrent operations\n- **Database**: PostgreSQL 16 with sqlx 0.8 (compile-time checked queries)\n- **Blockchain**: Solana SDK v2.0 + Helius RPC integration\n- **CLI**: clap v4.5 with derive macros\n- **Error Handling**: anyhow + thiserror for comprehensive error context\n- **Logging**: tracing for structured logging\n\n## 🎓 Key Learnings\n\n### Solana Data Structure\n- JsonParsed encoding simplifies instruction parsing\n- Account keys array determines signer/fee payer roles\n- Program IDs provide reliable classification basis\n\n### Database Design\n- UPSERT enables idempotent operations\n- JSONB storage provides flexibility for evolving schemas\n- Foreign keys ensure referential integrity\n- GIN indexes on JSONB optimize query performance\n\n### Rust Patterns\n- `Arc\u003cT\u003e` for sharing non-Clone types across async tasks\n- Database transactions ensure atomicity\n- sqlx compile-time checking catches SQL errors at build time\n- clap derive macros create elegant, type-safe CLIs\n\n### Error Handling\n- Exponential backoff handles transient network failures\n- Detailed error context tracks failure stages\n- Continue-on-error pattern processes remaining blocks\n\n## 💡 Future Enhancements\n\n### Enhanced Classification\n- More program-specific parsers (MEV, governance, staking)\n- NFT metadata extraction and enrichment\n- DeFi protocol-specific analysis\n\n### Data Analysis\n- SQL views for common analytics queries\n- Transaction volume and pattern analytics\n- Fee analysis and trends over time\n\n### API Layer\n- REST API for querying stored data\n- WebSocket for real-time transaction updates\n- GraphQL for flexible data queries\n\n### Visualization\n- Web dashboard for metrics\n- Transaction flow diagrams\n- Volume, fee, and program interaction charts\n\n### Optimization\n- Parallel block fetching with async concurrency\n- True bulk INSERT using PostgreSQL UNNEST\n- In-memory caching for program registry\n\n## 🛠️ Development\n\n### Code Quality\n\n```bash\n# Format code\ncargo fmt --all\n\n# Run linter\ncargo clippy -- --deny warnings\n\n# Run full checks\njust full-check\n\n# Run tests\ncargo test\n```\n\n### Project Structure\n\n```\nsrc/\n├── cli.rs           # Command-line interface\n├── db/              # Database connection and migrations\n├── etl/             # ETL pipeline modules\n│   ├── extract.rs   # Block fetching from RPC\n│   ├── transform.rs # Transaction classification\n│   ├── load.rs      # Database insertion\n│   └── parsers/     # Instruction parsers\n├── models.rs        # Data models\n├── pipeline.rs      # Pipeline orchestration\n└── rpc/             # RPC client wrapper\n\nmigrations/          # Database migrations\ndocs/                # Documentation\n```\n\n## � Project Summary\n\n### 🎯 Project Goal\nBuild a production-quality Rust ETL pipeline to extract Solana blockchain data, classify transactions into meaningful categories, and store them in PostgreSQL for analysis and querying.\n\n### ✅ Success Criteria - All Met!\n\n✅ Extract blocks from Solana mainnet via RPC  \n✅ Classify transactions into 6 meaningful categories  \n✅ Store data in structured PostgreSQL schema  \n✅ Handle errors gracefully with retry logic  \n✅ Process arbitrary block ranges efficiently  \n✅ Provide CLI for flexible configuration  \n✅ Track and report real-time statistics  \n✅ Maintain high success rate (99%+)  \n✅ Achieve good throughput (200+ txs/sec)  \n✅ Create comprehensive documentation  \n\n### 🏆 What This Project Demonstrates\n\n- **Production ETL Pipelines**: Building robust data pipelines in Rust with proper error handling\n- **Blockchain Data Processing**: Working with Solana's complex transaction structure and RPC APIs\n- **Database Design**: Effective schema design with proper relationships, indexes, and JSONB flexibility\n- **Async Rust**: Leveraging Tokio for concurrent operations and efficient I/O\n- **CLI Design**: Creating user-friendly command-line interfaces with clap\n- **Error Resilience**: Implementing retry logic, exponential backoff, and graceful degradation\n\n**Status**: ✅ **Feature Complete**  \n**Date**: October 23, 2025  \n**Author**: 0xfave\n\n## 🤝 Contributing\n\nThis is a personal side project, but suggestions and improvements are welcome!\n\n## 📝 License\n\nThis project is licensed under MIT.\n\n## 🙏 Acknowledgments\n\n- Built with [Solana SDK](https://github.com/solana-labs/solana)\n- Uses [Helius RPC](https://www.helius.dev/) for enhanced Solana data access\n- Inspired by the need for structured Solana transaction data\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xfave%2Fblock-data-fetcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F0xfave%2Fblock-data-fetcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xfave%2Fblock-data-fetcher/lists"}