{"id":28546180,"url":"https://github.com/abitofhelp/multistage_pipeline_fanout","last_synced_at":"2025-10-11T18:07:30.259Z","repository":{"id":297602643,"uuid":"997303300","full_name":"abitofhelp/multistage_pipeline_fanout","owner":"abitofhelp","description":"A high-performance data processing pipeline implementation in Go that provides efficient file processing with parallel compression and encryption.","archived":false,"fork":false,"pushed_at":"2025-06-07T09:15:43.000Z","size":3112,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-07T06:42:59.873Z","etag":null,"topics":["checksum","compression","concurrent","encryption","go","golang","parallel","sha256"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abitofhelp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-06T09:49:32.000Z","updated_at":"2025-06-07T09:15:46.000Z","dependencies_parsed_at":"2025-06-06T11:32:36.526Z","dependency_job_id":null,"html_url":"https://github.com/abitofhelp/multistage_pipeline_fanout","commit_stats":null,"previous_names":["abitofhelp/multistage_pipeline_fanout"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/abitofhelp/multistage_pipeline_fanout","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitofhelp%2Fmultistage_pipeline_fanout","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitofhelp%2Fmultistage_pipeline_fanout/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitofhelp%2Fmultistage_pipeline_fanout/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitofhelp%2Fmultistage_pipeline_fanout/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abitofhelp","download_url":"https://codeload.github.com/abitofhelp/multistage_pipeline_fanout/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abitofhelp%2Fmultistage_pipeline_fanout/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279008296,"owners_count":26084427,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["checksum","compression","concurrent","encryption","go","golang","parallel","sha256"],"created_at":"2025-06-09T23:09:04.013Z","updated_at":"2025-10-11T18:07:30.224Z","avatar_url":"https://github.com/abitofhelp.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multistage_pipeline_fanout\n\n[![Go Version](https://img.shields.io/badge/Go-1.24-blue.svg)](https://golang.org/doc/go1.24)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Test Coverage](https://img.shields.io/badge/Coverage-59.1%25-yellow.svg)](coverage.html)\n\nA high-performance data processing pipeline implementation in Go that provides efficient file processing with parallel compression and encryption.\n\n## Overview\n\nmultistage_pipeline_fanout implements a multi-stage processing pipeline with concurrent execution of compression and encryption operations for improved performance. The pipeline:\n\n- Reads data from an input file in configurable chunks\n- Compresses the data using parallel workers\n- Encrypts the compressed data using parallel workers\n- Writes the processed data to an output file\n- Calculates SHA256 checksums on the fly for the input and output files to ensure data integrity\n- Collects detailed statistics about the processing\n\nThe application is designed with a focus on performance, reliability, and proper resource management, including graceful shutdown handling.\n\n## Project Structure\n\nThe project is organized into two main package hierarchies:\n\n### `/pkg` - Core Functionality\n\nThe `/pkg` directory contains packages that implement core, reusable functionality:\n\n- `/pkg/compression` - Provides core compression algorithms and utilities\n- `/pkg/dataprocessor` - Provides generic data processing with context awareness\n- `/pkg/encryption` - Provides core encryption algorithms and utilities\n- `/pkg/errors` - Custom error types and error handling utilities\n- `/pkg/logger` - Logging utilities\n- `/pkg/stats` - Statistics collection and reporting\n- `/pkg/utils` - General utility functions\n\n### `/pkg/pipeline` - Pipeline Integration\n\nThe `/pkg/pipeline` directory contains packages that integrate the core functionality into a processing pipeline:\n\n- `/pkg/pipeline/compressor` - Pipeline stage that uses the core compression functionality\n- `/pkg/pipeline/encryptor` - Pipeline stage that uses the core encryption functionality\n- `/pkg/pipeline/processor` - Generic pipeline stage processor\n- `/pkg/pipeline/reader` - Pipeline stage for reading input data\n- `/pkg/pipeline/writer` - Pipeline stage for writing output data\n- `/pkg/pipeline/options` - Configuration options for the pipeline\n\n## Why Similar Package Names Are Not Redundant\n\nThe packages in `/pkg` and `/pkg/pipeline` with similar names (e.g., `compression` vs `compressor`, `encryption` vs `encryptor`) serve different purposes and are not redundant:\n\n1. **Core Packages (`/pkg`)**: \n   - Implement the fundamental algorithms and utilities\n   - Are context-aware but not pipeline-specific\n   - Can be used independently outside the pipeline\n   - Focus on the core functionality (compression, encryption, etc.)\n\n2. **Pipeline Packages (`/pkg/pipeline`)**: \n   - Integrate the core functionality into the pipeline architecture\n   - Handle pipeline-specific concerns like channel communication\n   - Manage concurrency, error handling, and statistics within the pipeline\n   - Act as adapters between the core functionality and the pipeline framework\n\nThis separation allows for:\n- Better code organization and maintainability\n- Reuse of core functionality in different contexts\n- Independent testing of core algorithms and pipeline integration\n- Clearer separation of concerns\n\n## Usage\n\n### Prerequisites\n\n- Go 1.24 or later\n- Make\n\n### Building and Running\n\nThe project includes a comprehensive Makefile that provides various commands for building, testing, and running the application.\n\n#### Basic Commands\n\n```bash\n# Build the application\nmake build\n\n# Run the application with default input and output files\nmake run\n\n# Run all tests\nmake test\n\n# Run short tests (faster)\nmake test-short\n\n# Run unit tests only (excluding integration tests)\nmake test-unit\n\n# Run integration tests\nmake test-integration\n\n# Run tests for a specific package\nmake test-package PKG=./pkg/compression\n\n# Run tests with race detection\nmake test-race\n\n# Run tests with coverage analysis\nmake coverage\n\n# Clean build artifacts\nmake clean\n\n# Format code\nmake fmt\n\n# Run linter\nmake lint\n\n# Run security check\nmake sec\n\n# Run vulnerability scanning\nmake vuln\n\n# Check Go version compatibility\nmake check-go-version\n\n# Generate CHANGELOG\nmake changelog\n\n# Docker operations\nmake docker-build  # Build Docker image\nmake docker-run    # Run in Docker container\nmake docker        # Build and run in Docker\n\n# Generate and serve documentation\nmake doc\n```\n\n#### Advanced Commands\n\n```bash\n# Install dependencies including linting and security tools\nmake deps\n\n# Update dependencies\nmake update-deps\n\n# Run tests with a specific tag\nmake test-tag TAG=unit\n\n# Run tests for CI environments\nmake test-ci\n\n# Generate mocks for testing\nmake mocks\n\n# Install the binary to GOPATH/bin\nmake install\n\n# Run vulnerability scanning\nmake vuln\n\n# Check Go version compatibility\nmake check-go-version\n\n# Generate CHANGELOG\nmake changelog\n\n# Build Docker image\nmake docker-build\n\n# Run in Docker container\nmake docker-run\n\n# Show all available commands\nmake help\n```\n\n### Running the Application Directly\n\nAfter building, you can run the application directly:\n\n```bash\n./build/multistage_pipeline_fanout \u003cinput_file_path\u003e \u003coutput_file_path\u003e\n```\n\nThe application requires two command-line arguments:\n1. `input_file_path`: Path to the file to be processed\n2. `output_file_path`: Path where the processed data will be written\n\nThe application processes the input file through a pipeline that includes compression and encryption, then writes the result to the output file.\n\n### Running with Docker\n\nThe project includes Docker support for containerized execution:\n\n```bash\n# Build the Docker image\nmake docker-build\n\n# Run the application in a Docker container\nmake docker-run\n\n# Or do both in one command\nmake docker\n```\n\nYou can also use Docker commands directly:\n\n```bash\n# Build the image\ndocker build -t multistage_pipeline_fanout:latest .\n\n# Run the container\ndocker run --rm -v $(pwd)/input_file.txt:/app/input.txt -v $(pwd):/app/output multistage_pipeline_fanout:latest input.txt /app/output/output.bin\n```\n\n### Configuration\n\nThe pipeline behavior can be configured through the `options.DefaultPipelineOptions()` function in the `pkg/pipeline/options/options.go` file. Key configurable parameters include:\n\n- `ChunkSize`: Size of data chunks read from the input file (default: 32KB)\n- `CompressorCount`: Number of parallel compression workers (default: 4)\n- `EncryptorCount`: Number of parallel encryption workers (default: 4)\n- `ChannelBufferSize`: Size of the channel buffers between pipeline stages (default: 16)\n\nThese options can be modified programmatically if you're using the pipeline as a library.\n\n### Project Architecture\n\nThe application follows a modular architecture with clear separation of concerns:\n\n1. **Entry Point**: The application entry point is in `cmd/main.go`, which parses command-line arguments and initializes the pipeline.\n\n2. **Pipeline**: The core pipeline implementation in `pkg/pipeline/pipeline.go` orchestrates the data flow through multiple stages.\n\n3. **Pipeline Stages**:\n   - Reader (`pkg/pipeline/reader`): Reads data from the input file in chunks\n   - Compressor (`pkg/pipeline/compressor`): Compresses data using the Brotli algorithm\n   - Encryptor (`pkg/pipeline/encryptor`): Encrypts data using AES-GCM\n   - Writer (`pkg/pipeline/writer`): Writes processed data to the output file\n\n4. **Core Functionality**: Implemented in separate packages for reusability:\n   - Compression (`pkg/compression`): Compression algorithms and utilities\n   - Encryption (`pkg/encryption`): Encryption algorithms and utilities\n   - Statistics (`pkg/stats`): Collection and reporting of processing statistics\n   - Logging (`pkg/logger`): Structured logging using Zap\n\n### Development Workflow\n\nA typical development workflow might look like:\n\n1. Make changes to the code\n2. Format the code: `make fmt`\n3. Run the linter: `make lint`\n4. Run tests: `make test`\n5. Build the application: `make build`\n6. Run the application: `make run` or `./build/multistage_pipeline_fanout \u003cinput\u003e \u003coutput\u003e`\n\n## Performance\n\nmultistage_pipeline_fanout is designed for high performance with parallel processing:\n\n- Multiple compression workers process chunks concurrently\n- Multiple encryption workers process compressed chunks concurrently\n- Buffered channels prevent pipeline stalls\n- Efficient memory management with controlled chunk sizes\n- Optimized Brotli compression\n\nPerformance can be tuned by adjusting the configuration parameters in `options.DefaultPipelineOptions()`.\n\n## Examples\n\n### Basic File Processing\n\n```bash\n# Build the application\nmake build\n\n# Process a 10MB JSON file\n./build/multistage_pipeline_fanout input_10mb.jsonl output.bin\n\n# Processing statistics are displayed automatically after completion\n# No need to run additional commands to view statistics\n```\n\n### Using as a Library\n\n```go\npackage main\n\nimport (\n    \"context\"\n    \"log\"\n\n    \"github.com/abitofhelp/multistage_pipeline_fanout/pkg/logger\"\n    \"github.com/abitofhelp/multistage_pipeline_fanout/pkg/pipeline\"\n)\n\nfunc main() {\n    // Initialize logger\n    log := logger.InitLogger()\n    defer func() { logger.SafeSync(log) }()\n\n    // Create context\n    ctx := context.Background()\n\n    // Process file\n    stats, err := pipeline.ProcessFile(ctx, log, \"input.txt\", \"output.bin\")\n    if err != nil {\n        log.Fatal(\"Failed to process file\", err)\n    }\n\n    // Use stats as needed\n    log.Info(\"Processing complete\", \n        \"inputBytes\", stats.InputBytes.Load(),\n        \"outputBytes\", stats.OutputBytes.Load(),\n        \"compressionRatio\", float64(stats.InputBytes.Load())/float64(stats.OutputBytes.Load()),\n    )\n}\n```\n\n## Error Handling and Reliability\n\nmultistage_pipeline_fanout implements robust error handling and reliability features:\n\n### Custom Error Types\n\nThe application uses a custom error handling system (`pkg/errors`) that provides:\n\n- Categorized error types (I/O errors, timeout errors, cancellation errors, etc.)\n- Rich error context including stage, operation, time, and data size\n- Error aggregation for collecting multiple errors\n- Helper functions for error type checking\n\n### Signal Handling\n\nThe application implements advanced signal handling for graceful shutdown:\n\n- Handles SIGINT, SIGTERM, SIGHUP, and SIGQUIT\n- Implements a two-phase shutdown (graceful on first signal, forced on second)\n- Includes a 30-second timeout for graceful shutdown\n- Properly cleans up resources during shutdown\n\n### Context Propagation\n\nAll operations are context-aware, allowing for:\n\n- Cancellation propagation throughout the pipeline\n- Timeout handling at all stages\n- Proper resource cleanup on cancellation\n\n## Troubleshooting\n\n### Common Issues\n\n1. **\"Failed to open input file\"**: Ensure the input file exists and has proper read permissions.\n\n2. **\"Failed to create output file\"**: Ensure the output directory exists and has proper write permissions.\n\n3. **\"Context deadline exceeded\"**: The processing took longer than the context timeout. For large files, consider using a context with a longer timeout.\n\n4. **\"Out of memory\"**: If processing very large files, try reducing the chunk size in the options to lower memory usage.\n\n5. **\"Pipeline stage blocked\"**: A pipeline stage is waiting too long to send data to the next stage. This could indicate a bottleneck in the pipeline. Try adjusting the number of workers or chunk size.\n\n6. **\"Operation canceled\"**: The processing was canceled, either by a signal (Ctrl+C) or programmatically. This is a normal part of graceful shutdown.\n\n### Performance Issues\n\nIf you're experiencing performance issues:\n\n1. Try adjusting the number of compressor and encryptor workers based on your system's capabilities\n2. Experiment with different chunk sizes\n3. Ensure your storage devices have sufficient I/O performance\n4. Run with `GODEBUG=gctrace=1` to monitor garbage collection overhead\n\n## Dependencies\n\nThis project relies on the following key dependencies:\n\n- [github.com/andybalholm/brotli](https://github.com/andybalholm/brotli) - Brotli compression algorithm implementation\n- [github.com/google/tink/go](https://github.com/google/tink/go) - Cryptographic API providing secure implementations of common cryptographic primitives\n- [github.com/dustin/go-humanize](https://github.com/dustin/go-humanize) - Formatters for units to human-friendly sizes\n- [go.uber.org/zap](https://github.com/uber-go/zap) - Structured, leveled logging\n- [github.com/stretchr/testify](https://github.com/stretchr/testify) - Testing toolkit\n\nFor a complete list of dependencies, see the [go.mod](go.mod) file.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\nCopyright (c) 2023 A Bit of Help, Inc.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabitofhelp%2Fmultistage_pipeline_fanout","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabitofhelp%2Fmultistage_pipeline_fanout","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabitofhelp%2Fmultistage_pipeline_fanout/lists"}