{"id":22025904,"url":"https://github.com/adi3g/collector","last_synced_at":"2026-04-20T13:09:24.016Z","repository":{"id":257307030,"uuid":"850784038","full_name":"Adi3g/collector","owner":"Adi3g","description":"A flexible Python library for collecting, transforming, and unifying data from diverse sources into a standardized format using customizable configurations.","archived":false,"fork":false,"pushed_at":"2024-10-06T21:17:52.000Z","size":109,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-23T11:13:01.584Z","etag":null,"topics":["api","big-data","database","python","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Adi3g.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-01T19:06:48.000Z","updated_at":"2024-10-06T21:17:56.000Z","dependencies_parsed_at":"2024-11-30T07:21:06.006Z","dependency_job_id":"f8ebe251-67bd-41b9-bde0-87977e0b1913","html_url":"https://github.com/Adi3g/collector","commit_stats":null,"previous_names":["adi3g/collector"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Adi3g/collector","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adi3g%2Fcollector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adi3g%2Fcollector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adi3g%2Fcollector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adi3g%2Fcollector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Adi3g","download_url":"https://codeload.github.com/Adi3g/collector/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Adi3g%2Fcollector/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32048483,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","big-data","database","python","transformer"],"created_at":"2024-11-30T07:20:37.980Z","updated_at":"2026-04-20T13:09:24.001Z","avatar_url":"https://github.com/Adi3g.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Collector\n\nCollector is a Python library designed to collect data from various sources such as databases, big data files, cloud storage, APIs, and more, and transform the data into a unified output structure. This flexible and extensible tool allows you to define data collection and transformation rules using a custom configuration file format (`.col`), making data integration tasks streamlined and maintainable.\n\n## Table of Contents\n\n- [Features](#features)\n- [Getting Started](#getting-started)\n- [Configuration File (.col)](#configuration-file-col)\n- [Connectors](#connectors)\n- [Transformations](#transformations)\n- [Output Formats](#output-formats)\n- [Examples](#examples)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Multiple Data Sources**: Supports SQL databases, cloud storage (AWS S3, Google Cloud Storage, Azure Blob), CSV files, APIs, JSON, Parquet, and more.\n- **Flexible Transformation Rules**: Apply type conversions, renaming, formatting, and custom transformations.\n- **Unified Output**: Output data in various formats such as CSV, JSON, and Parquet with custom options.\n- **Modular Configuration**: Use `.col` files to define data sources, transformations, and outputs, with support for imports to reuse configurations.\n- **Data Collection Modes**: Choose between **parallel** and **sequential** data collection modes for improved performance.\n- **Extensible Architecture**: Easily add new connectors and transformations to expand functionality.\n\n## Getting Started\n\nFollow these steps to get started with Collector:\n\n1. **Install Dependencies**: Install required dependencies by running:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n2. **Define a Configuration File (.col)**: Create a `.col` file that specifies your data sources, transformation rules, and output configuration.\n\n3. **Run the Collector**: Use the provided script to run the collector with your configuration file:\n   ```bash\n   python scripts/run_collector.py \u003cyour_col_file.col\u003e\n   ```\n\n## Configuration File (.col)\n\nThe `.col` file is the heart of Collector, allowing you to define how data should be collected, transformed, and output. Below is a basic example of a `.col` file:\n\n```plaintext\nVERSION 1.0\n\n# Optional: Set Collection Mode (default is 'sequence')\nCOLLECT_MODE parallel  # Can be 'parallel' or 'sequence'\n\n# Define Data Sources\nSOURCE sales_db TYPE sql {\n    HOST \"localhost\"\n    PORT 5432\n    USERNAME \"user\"\n    PASSWORD \"pass\"\n    DATABASE \"sales\"\n    QUERY \"SELECT * FROM sales_data\"\n}\n\n# Define Transformations\nTRANSFORM unified_sales FROM sales_db {\n    FIELD sale_date TYPE date FORMAT \"%Y-%m-%d\"\n    FIELD amount TYPE float DEFAULT 0.0\n}\n\n# Define Output\nOUTPUT unified_data TYPE parquet {\n    PATH \"/output/unified_sales.parquet\"\n    OPTIONS {\n        COMPRESSION \"gzip\"\n    }\n}\n\n```\n\n### Collect Mode\n\n- **`parallel`**: Data from all sources is collected concurrently, speeding up the process for large datasets or slower APIs.\n- **`sequence`** (default): Data is collected sequentially, one source at a time.\n\n## Connectors\n\nCollector includes connectors for various data sources:\n\n- **SQL Connector**: Connect to SQL databases like MySQL, PostgreSQL, etc.\n- **CSV Connector**: Read data from CSV files with customizable options.\n- **API Connector**: Fetch data from RESTful APIs using GET, POST, and other methods.\n- **Parquet Connector**: Read data from Parquet files with compression options.\n- **MongoDB Connector**: Fetch data from MongoDB collections.\n- **Cloud Storage Connectors**:\n  - **AWS S3**: Fetch data from Amazon S3 buckets.\n  - **Google Cloud Storage**: Fetch data from Google Cloud Storage buckets.\n  - **Azure Blob Storage**: Fetch data from Azure Blob containers.\n\n## Transformations\n\nDefine transformation rules in your `.col` file to:\n\n- Convert data types (e.g., string to date, int to float).\n- Rename fields.\n- Apply conditional transformations.\n- Set default values.\n\n### Example Transformation\n\n```plaintext\nTRANSFORM unified_sales FROM sales_db {\n    FIELD sale_date TYPE date FORMAT \"%Y-%m-%d\"\n    FIELD amount TYPE float DEFAULT 0.0\n}\n```\n\n## Output Formats\n\nCollector supports various output formats:\n\n- **CSV**: Output data to CSV files with customizable delimiters and headers.\n- **JSON**: Save data as JSON with options for pretty printing.\n- **Parquet**: Export data to Parquet files with optional compression.\n\n## Examples\n\nCheck out the `examples/` directory for sample `.col` files demonstrating different configurations:\n\n- `basic_example.col`: A simple example using SQL and CSV sources.\n- `advanced_example.col`: An advanced configuration with multiple data sources and transformations.\n- `parallel_example.col`: Demonstrates parallel data collection from multiple sources.\n- `shared_sources.col`: Demonstrates importing shared data sources across configurations.\n\n## Contributing\n\nWe welcome contributions to improve Collector! To contribute:\n\n1. Fork the repository.\n2. Create a new branch for your feature or bug fix.\n3. Commit your changes and push to your fork.\n4. Open a pull request with a detailed description of your changes.\n\nPlease ensure that your code follows the project's coding standards and includes appropriate tests.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadi3g%2Fcollector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadi3g%2Fcollector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadi3g%2Fcollector/lists"}