Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/anuveyatsu/cloudflare-data-fabric
Cloudflare Data Fabric: Use Cloudflare's global infrastructure to build a flexible, resilient framework for data solutions.
https://github.com/anuveyatsu/cloudflare-data-fabric
cloudflare data data-lake fabric lakehouse mesh
Last synced: about 1 month ago
JSON representation
Cloudflare Data Fabric: Use Cloudflare's global infrastructure to build a flexible, resilient framework for data solutions.
- Host: GitHub
- URL: https://github.com/anuveyatsu/cloudflare-data-fabric
- Owner: anuveyatsu
- License: mit
- Created: 2024-11-10T06:41:19.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-10T06:54:00.000Z (3 months ago)
- Last Synced: 2025-01-08T17:22:47.679Z (about 1 month ago)
- Topics: cloudflare, data, data-lake, fabric, lakehouse, mesh
- Homepage:
- Size: 3.91 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cloudflare Data Fabric
**Cloudflare Data Fabric** is an open-source framework designed to help teams build scalable data solutions, such as data lakes, data lakehouses, and data meshes, using Cloudflare’s suite of products. This framework offers a data fabric approach that combines storage, compute, and security to create a seamless data layer accessible across distributed environments.
## Key Features
- **Centralized Data Storage**: Utilizes [Cloudflare R2](https://www.cloudflare.com/products/r2/) as the primary storage layer for data objects, offering cost-effective, serverless data storage.
- **Serverless Data Processing**: Enables real-time data transformations and processing using [Cloudflare Workers](https://developers.cloudflare.com/workers/).
- **Traditional ETL Workflows**: Supports traditional, long-running ETL tasks through GitHub Actions with a self-hosted runner on Hetzner servers, providing powerful compute resources for data extraction, transformation, and loading at a low cost.
- **Authentication & Authorization**: Implements fine-grained access control with [Cloudflare Access](https://www.cloudflare.com/products/zero-trust/) for secure data access and policy enforcement.
- **Edge Caching and Optimization**: Leverages Cloudflare’s global CDN and edge network for efficient data delivery.
- **Multi-Purpose Data Buckets**: Provides separate public and private buckets for open and restricted data access, with role-based permissions.## Use Cases
- **Data Lakes and Lakehouses**: Store raw and processed data for analysis, training, and model serving.
- **Data Mesh Architecture**: Decentralize data storage across domains with independent data governance.
- **Real-time Data Transformation**: Process and transform data on the fly as it’s accessed.## Getting Started
### Prerequisites
To use Cloudflare Data Fabric, ensure you have:
- A [Cloudflare account](https://dash.cloudflare.com/sign-up) with access to R2 and Workers.
- Git installed locally for cloning and managing the repository.
- Node.js installed (for running the setup script).### Installation
1. **Clone the Repository**:
```bash
git clone https://github.com/yourusername/cloudflare-data-fabric.git
cd cloudflare-data-fabric
```2. **Configure Cloudflare Services**:
- Set up Cloudflare R2 buckets (public and private) in your Cloudflare account.
- Enable Cloudflare Workers for your account.3. **Install Dependencies**:
Run the following command to install required dependencies:
```bash
npm install
```4. **Configure Environment Variables**:
Create a `.env` file with your Cloudflare account details:
```env
CLOUDFLARE_ACCOUNT_ID=your_account_id
CLOUDFLARE_API_TOKEN=your_api_token
```### ETL Process with GitHub Actions and Hetzner
For traditional ETL tasks, we use GitHub Actions with a self-hosted runner on Hetzner servers, providing powerful resources at an affordable cost. Here’s how to set this up:
1. **Set Up Hetzner Runner**:
- Deploy a virtual server on Hetzner with sufficient CPU, memory, and storage for ETL workloads.
- Install the GitHub Actions runner on your Hetzner server by following [GitHub’s self-hosted runner setup](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners).2. **Runner Configuration**:
- Configure your runner to scale according to ETL job requirements, allowing for efficient resource usage.
- Use environment variables or secrets to securely manage credentials for data sources and destinations.3. **ETL Workflow**:
- Define ETL workflows using GitHub Actions. Schedule them with cron syntax for regular, automated runs:
```yaml
on:
schedule:
- cron: '0 0 * * *' # Run daily at midnight
```
- Use GitHub’s workflow visualization to monitor each run and set up notifications for errors or job completions.4. **Data Access and Security**:
- Securely access data sources from the runner, potentially using VPN or SSH tunneling if needed.
- Store all sensitive information (API keys, credentials) in GitHub secrets for secure access during workflows.### Example Code
```javascript
// Example: Access a file in the public R2 bucket
const fetchData = async () => {
const response = await fetch("https://data-cloudflare.example.com/myfile.csv");
const data = await response.text();
console.log(data);
};
fetchData();
```## Contributing
We welcome contributions! Please check the [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to contribute to the Cloudflare Data Fabric project.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.