https://github.com/DataJourneyHQ/DataJourney
Open-source Data Management Framework
https://github.com/DataJourneyHQ/DataJourney
allthingsopen dagster data-engineering flask gha holoviews intake llm mito open-source panel pytest vale
Last synced: 8 months ago
JSON representation
Open-source Data Management Framework
- Host: GitHub
- URL: https://github.com/DataJourneyHQ/DataJourney
- Owner: sayantikabanik
- License: cc0-1.0
- Created: 2022-03-02T14:02:56.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-10-25T03:21:04.000Z (over 1 year ago)
- Last Synced: 2024-10-25T04:21:19.181Z (over 1 year ago)
- Topics: allthingsopen, dagster, data-engineering, flask, gha, holoviews, intake, llm, mito, open-source, panel, pytest, vale
- Language: HTML
- Homepage: https://sayantikabanik.github.io/DataJourney/
- Size: 80.1 MB
- Stars: 13
- Watchers: 2
- Forks: 2
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
[](https://datajourneyhq.github.io/DataJourney/)\
[](https://www.apache.org/licenses/LICENSE-2.0)
[](https://www.bestpractices.dev/projects/11135)
[](https://www.contributor-covenant.org/version/2/0/code_of_conduct/)\
[](https://github.com/sayantikabanik/DataJourney/actions/workflows/CI.yml)
[](https://github.com/sayantikabanik/DataJourney/actions/workflows/github-repo-stats.yml)
[](https://github.com/sayantikabanik/DataJourney/actions/workflows/static.yml)
[](https://github.com/sayantikabanik/DataJourney/actions/workflows/review.yml)
[](https://github.com/sayantikabanik/DataJourney/actions/workflows/rate-limit-monitor.yml)
Recipient: GitHub Secure Open Source Fund
π Sponsor DataJourneyHQ
Β β’Β
π₯Official announcement
### π DataJourney
#### πͺΆShort version
Design- first Open Source Data Management Toolkit. Simplifies data workflows with modular, reproducible solutions
#### π²Long version
DataJourney demonstrates how organizations can effectively manage and utilize data by harnessing the power of open-source technologies. It's designed to help navigate the complex landscape of data tools, offering a structured approach to building **scalable**, and **reproducible** data workflows.
Built on open-source principles, the framework guides users through essential steps: from **identifying** goals and **selecting tools** to **testing** and **customising** workflows. With its flexible, modular design, DataJourney can be tailored to individual needs, making it an invaluable toolkit for data professionals.
### π¦ Hold on, looking to contribute?
Head over to the [wiki](https://github.com/DataJourneyHQ/DataJourney/wiki/Contribute-to-DataJourney), let's make it happen together. We don't bite :)
### π§± Design Philosophy (LEGO)
Built with additive, subtractive capabilities glued with open source.
Each layer has a certain strength of communication inbuilt
- PO (Base): Static home(s) to keep it together `(GitHub)`
- P1 (Tooling): Tooling, strings `(Powered by open source)`
- P2 (Maintenance + Monitoring): Env, automations `(Pixi + GHA)`
- P3 (Abstraction): Layer(s), CLI/task manager for users to interact with `(Pixi)`

### π Current workflows covered
{β¨= Experimental,
β
= Implemented}
| Status | Workflow Description |
|--------|-------------------------------------------------------------------------------------------------------------------------------------|
| β
| `Python Packaging framework` design principles |
| β
| `GitHub actions` configured |
| β
| `Vale.sh` configured at PR level |
| β
| `Pre-commit hooks` configured for code linting/formatting |
| β
| `Hello world` LLM design example based on [LangChain](https://python.langchain.com/) |
| β
| `Environment` management via [pixi](https://prefix.dev/) |
| β
| `Reading data` from online sources using [intake](https://github.com/intake/intake) |
| β
| `Data pipeline` built using [Dagster](https://github.com/dagster-io/dagster) |
| β
| `custom Dashboard` using [holoviews](https://holoviews.org/gallery/index.html) + [panel](https://panel.holoviz.org/reference/index.html) |
| β
| `Exploratory data analysis` (EDA) using [mito](https://www.trymito.io/) |
| β
| `Web UI` build on [Flask](https://flask.palletsprojects.com/en/3.0.x/) |
| β
| `Web UI` re-done and expanded with [FastHTML](https://docs.fastht.ml/) |
| β
| `GenAI examples` to analyse data [GitHub AI models](https://docs.github.com/en/github-models/prototyping-with-ai-models) |
| β
| `Query engine` for LLM application using [Chromadb](https://docs.trychroma.com/docs/overview/introduction) |
| β
| `RAG` powered by `langchain`, `chromadb` & `GitHub AI models` |
| β
| `Prompt enhancer` powered by `gpt-oss-120b`|
### βοΈ Quickly getting started with DataJourney
- Fork the repository
- Generate & add `GITHUB_TOKEN`, instructions [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token-classic)
- Added requirement to run the LLM based workflows
- Switch directory `cd DataJourney`
- Download pixi : [prefix.dev](https://prefix.dev/)
- Activate env: `pixi shell`
- Install DJ framework locally `pixi run DJ_package`
- List all the tasks: `pixi run DJ_list`
- Execute a specific task from the list: `pixi run `
- Execute a specific task with additional logs: `pixi run -v `
### ππ½ββοΈ Active `tasks` under DJ
| **Task Name** | **Description** |
|------------------------------|--------------------------------------------------------------------------------------------------------------|
| `GIT_TOKEN_CHECK` | Verifies the availability and validity of the Git authentication token. |
| `DJ_package` | Prepares and builds the Python package for the DataJourney project. |
| `DJ_pre_commit` | Runs pre-commit hooks to ensure code quality and adherence to standards. |
| `DJ_dagster` | Sets up and runs a Dagster workflow for orchestration in the project. |
| `DJ_fasthtml_app` | Executes a FastAPI-based HTML application. |
| `DJ_flask_app` | Configures and runs a Flask-based application for data services. |
| `DJ_mito_app` | Launches the Mito application for interactive data analysis in notebooks. |
| `DJ_panel_app` | Executes a Panel dashboard app for data visualization and analytics. |
| `DJ_llm_analysis` | Performs analysis using large language models (LLMs) on project data. |
| `DJ_hello_world_langchain` | Sets up a basic LangChain app as a "Hello World" example for LLMs. |
| `DJ_spanish_eng_translation` | Performs Spanish to English translation with Deepseek-R1 (`NOTE`: Takes about ~30 secs to execute this task) |
| `DJ_sync_dataset_trees` | Downloads and synchronizes the `trees.csv` dataset into the project structure. |
| `DJ_chromadb_gen_embedding` | Query engine for LLM applications |
| `DJ_RAG_without_memory` | End-to-end Retrieval-Augmented Generation (RAG) pipeline |
| `DJ_prompt_enhancer` | How to design a simple prompt enhancer using gpt-oss-120b |
### π About pre-commit-hooks and activating
Just like the name suggests, pre-commit-hooks are designed to format the code based on PEP standards before committing. [More details](https://pre-commit.com/)
```shell
pixi run DJ_pre_commit
```
### π¦ Executing LLM script: Generate stock price recommendations
```shell
pixi run DJ_llm_analysis
```
### πͺΌ Execute pre-configured Dagster pipeline
```shell
pixi run DJ_dagster
```

### π Panel app
```shell
pixi run DJ_panel_app
```
*NOTE:*
The dashboard generated is exported into HTML format and saved as [stock_price_twilio_dashboard](analytics_framework%2Fdashboard%2Fstock_price_twilio_dashboard.html)

### π΅ Mito
To explore further visit [trymito.io](https://docs.trymito.io/)
```shell
pixi run DJ_mito_app
```
[//]: # ( )
### π¦ Display all data sources present via web UI
```shell
# Run FastHTML app
pixi run DJ_fasthtml_app
```
