https://github.com/adi3g/collector
A flexible Python library for collecting, transforming, and unifying data from diverse sources into a standardized format using customizable configurations.
https://github.com/adi3g/collector
api big-data database python transformer
Last synced: 2 months ago
JSON representation
A flexible Python library for collecting, transforming, and unifying data from diverse sources into a standardized format using customizable configurations.
- Host: GitHub
- URL: https://github.com/adi3g/collector
- Owner: Adi3g
- License: mit
- Created: 2024-09-01T19:06:48.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-06T21:17:52.000Z (over 1 year ago)
- Last Synced: 2025-03-23T11:13:01.584Z (over 1 year ago)
- Topics: api, big-data, database, python, transformer
- Language: Python
- Homepage:
- Size: 106 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# The Collector
Collector is a Python library designed to collect data from various sources such as databases, big data files, cloud storage, APIs, and more, and transform the data into a unified output structure. This flexible and extensible tool allows you to define data collection and transformation rules using a custom configuration file format (`.col`), making data integration tasks streamlined and maintainable.
## Table of Contents
- [Features](#features)
- [Getting Started](#getting-started)
- [Configuration File (.col)](#configuration-file-col)
- [Connectors](#connectors)
- [Transformations](#transformations)
- [Output Formats](#output-formats)
- [Examples](#examples)
- [Contributing](#contributing)
- [License](#license)
## Features
- **Multiple Data Sources**: Supports SQL databases, cloud storage (AWS S3, Google Cloud Storage, Azure Blob), CSV files, APIs, JSON, Parquet, and more.
- **Flexible Transformation Rules**: Apply type conversions, renaming, formatting, and custom transformations.
- **Unified Output**: Output data in various formats such as CSV, JSON, and Parquet with custom options.
- **Modular Configuration**: Use `.col` files to define data sources, transformations, and outputs, with support for imports to reuse configurations.
- **Data Collection Modes**: Choose between **parallel** and **sequential** data collection modes for improved performance.
- **Extensible Architecture**: Easily add new connectors and transformations to expand functionality.
## Getting Started
Follow these steps to get started with Collector:
1. **Install Dependencies**: Install required dependencies by running:
```bash
pip install -r requirements.txt
```
2. **Define a Configuration File (.col)**: Create a `.col` file that specifies your data sources, transformation rules, and output configuration.
3. **Run the Collector**: Use the provided script to run the collector with your configuration file:
```bash
python scripts/run_collector.py
```
## Configuration File (.col)
The `.col` file is the heart of Collector, allowing you to define how data should be collected, transformed, and output. Below is a basic example of a `.col` file:
```plaintext
VERSION 1.0
# Optional: Set Collection Mode (default is 'sequence')
COLLECT_MODE parallel # Can be 'parallel' or 'sequence'
# Define Data Sources
SOURCE sales_db TYPE sql {
HOST "localhost"
PORT 5432
USERNAME "user"
PASSWORD "pass"
DATABASE "sales"
QUERY "SELECT * FROM sales_data"
}
# Define Transformations
TRANSFORM unified_sales FROM sales_db {
FIELD sale_date TYPE date FORMAT "%Y-%m-%d"
FIELD amount TYPE float DEFAULT 0.0
}
# Define Output
OUTPUT unified_data TYPE parquet {
PATH "/output/unified_sales.parquet"
OPTIONS {
COMPRESSION "gzip"
}
}
```
### Collect Mode
- **`parallel`**: Data from all sources is collected concurrently, speeding up the process for large datasets or slower APIs.
- **`sequence`** (default): Data is collected sequentially, one source at a time.
## Connectors
Collector includes connectors for various data sources:
- **SQL Connector**: Connect to SQL databases like MySQL, PostgreSQL, etc.
- **CSV Connector**: Read data from CSV files with customizable options.
- **API Connector**: Fetch data from RESTful APIs using GET, POST, and other methods.
- **Parquet Connector**: Read data from Parquet files with compression options.
- **MongoDB Connector**: Fetch data from MongoDB collections.
- **Cloud Storage Connectors**:
- **AWS S3**: Fetch data from Amazon S3 buckets.
- **Google Cloud Storage**: Fetch data from Google Cloud Storage buckets.
- **Azure Blob Storage**: Fetch data from Azure Blob containers.
## Transformations
Define transformation rules in your `.col` file to:
- Convert data types (e.g., string to date, int to float).
- Rename fields.
- Apply conditional transformations.
- Set default values.
### Example Transformation
```plaintext
TRANSFORM unified_sales FROM sales_db {
FIELD sale_date TYPE date FORMAT "%Y-%m-%d"
FIELD amount TYPE float DEFAULT 0.0
}
```
## Output Formats
Collector supports various output formats:
- **CSV**: Output data to CSV files with customizable delimiters and headers.
- **JSON**: Save data as JSON with options for pretty printing.
- **Parquet**: Export data to Parquet files with optional compression.
## Examples
Check out the `examples/` directory for sample `.col` files demonstrating different configurations:
- `basic_example.col`: A simple example using SQL and CSV sources.
- `advanced_example.col`: An advanced configuration with multiple data sources and transformations.
- `parallel_example.col`: Demonstrates parallel data collection from multiple sources.
- `shared_sources.col`: Demonstrates importing shared data sources across configurations.
## Contributing
We welcome contributions to improve Collector! To contribute:
1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Commit your changes and push to your fork.
4. Open a pull request with a detailed description of your changes.
Please ensure that your code follows the project's coding standards and includes appropriate tests.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.