Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/GitDataAI/jzfs

A Git-like version control file system for data lineage & data collaboration.
https://github.com/GitDataAI/jzfs

aiops data-collaboration data-lake data-lake-management data-lineage data-mesh data-product data-version-control data-versioning datalake dataops digital-twins enterprise-datahub federated-learning git-filesystem git-for-data jiaozifs mlops version-controlled-filesystem

Last synced: about 1 month ago
JSON representation

A Git-like version control file system for data lineage & data collaboration.

Awesome Lists containing this project

README

        

# JZFS
A version control file system for data linage & data collaboration.







----
JZFS is an industry-leading **Data-Centric Version Control** File System, helps ensure Responsible AI Engineering by improving **Data Versioning**, **Provenance**, and **Reproducibility**.

Note:
* The name JZFS pays tribute to the world's earliest paper money: [Song Dynasty JiaoZi](https://en.wikipedia.org/wiki/Jiaozi_(currency)).
* JZFS is yet another implementation of [IPFS (InterPlanetary File System)](https://ipfs.tech/) as JZFS will be compatible with the [implementation requirements](https://specs.ipfs.tech/architecture/principles/#ipfs-implementation-requirements) of IPFS.
* As a filesystem of data versioning at scale, although JZFS is built for machine learning, It has a wide range of use scenarios (refer A Universe of Uses) and can be seamlessly integrated into all your data stack.

Data-centric AI is about the practice of iterating and collaborating on data, used to build AI systems, programmatically. Machine learning pioneer Andrew Ng [argues that focusing on the quality of data fueling AI systems will help unlock its full power](https://youtu.be/TU6u_T-s68Y).

----
### Features

In production systems with machine learning components, updates and experiments are frequent. New updates to models(data products) may be released every day or every few minutes, and different users may see the results of different models as part of A/B experiments or canary releases.

* **Version Everything**: Data scientists are often criticized for being less disciplined with versioning their experiments(versioning of data, pipeline, code, and models), especially when using computational notebooks.
* **Track Data Provenance**: This applies to all processing steps in an AI/ML pipeline, including data collection/acquisition, data merging, data cleaning, feature extraction, learning, or deployment.
* **Reproducibility**: A final question of AI/ML that is often relevant for debugging, audits, and also science more broadly is to what degree data, models, and decisions can be reproduced.

----
### Getting Started

#### Requirement

1. To build JZFS, you need a working installation of [Go 1.22.0 or higher](https://golang.org/dl/)
2. JZFS use postgres to store running data, you can install at [postgres install installation guide](https://www.postgresql.org/docs/current/installation.html)

#### Build And Running

1. clone and build
```bash
git clone https://github.com/GitDataAI/jzfs.git
cd jzfs
make build
```

After following the above steps, you should be able to see an executable file named "jzfs."

2. init program and running
```bash
./jzfs init --db postgres://:@localhost:5432/jiaozifs?sslmode=disable
./jzfs daemon
```

#### run with docker

```bash
docker run -v :/app -p 34913:34913 gitdatateam/jzfs:latest --db "postgres://:@192.168.1.16:5432/jiaozifs?sslmode=disable" --bs_path /app/data --listen http://0.0.0.0:34913 --config /app/config.toml
```
#### Cloud

[Try without installing](https://console.gitdata.ai)

Note: storage config for IPFS backend storage as you create a new repository in JZFS Console.

```
{"type":"ipfs","ipfs":{"url":"/dns/kubo-service.ipfs.svc.cluster.local/tcp/5001"}}
```

#### Examples
Build AL/ML pipeline over JZFS
[Face detection and recognition inference pipeline](https://colab.research.google.com/drive/1wsv-KMxTdsCLZ64eLq4W1MTfspid-vv6?usp=sharing)

----
### Documentation

[Official Documentation](https://docs.gitdata.ai)

----
### Users and Partners

[Lighthouse Permanent Storage](https://www.lighthouse.storage/)
[MesoReef DAO: Decentralized Science for Regenerating](https://linktr.ee/mesoreefdao)
[LunCo](https://www.lunco.space/)
[Artizen Fund](https://artizen.fund/)
[HaAI Labs](https://haai.info/)

----
### Contributors





----
### License

Dual-licensed under [MIT](https://github.com/GitDataAI/jzfs/blob/main/LICENSE-MIT) + [Apache 2.0](https://github.com/GitDataAI/jzfs/blob/main/LICENSE-APACHE)