{"id":24192642,"url":"https://github.com/DataJourneyHQ/DataJourney","last_synced_at":"2025-09-21T16:31:34.588Z","repository":{"id":65851306,"uuid":"465329805","full_name":"sayantikabanik/DataJourney","owner":"sayantikabanik","description":"Open-source Data Management Framework","archived":false,"fork":false,"pushed_at":"2024-10-25T03:21:04.000Z","size":84006,"stargazers_count":13,"open_issues_count":6,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-10-25T04:21:19.181Z","etag":null,"topics":["allthingsopen","dagster","data-engineering","flask","gha","holoviews","intake","llm","mito","open-source","panel","pytest","vale"],"latest_commit_sha":null,"homepage":"https://sayantikabanik.github.io/DataJourney/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sayantikabanik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-02T14:02:56.000Z","updated_at":"2024-10-25T03:21:08.000Z","dependencies_parsed_at":"2024-01-16T01:21:18.796Z","dependency_job_id":"65aaf079-7812-4818-9d75-b6e49584cb09","html_url":"https://github.com/sayantikabanik/DataJourney","commit_stats":{"total_commits":82,"total_committers":2,"mean_commits":41.0,"dds":"0.060975609756097615","last_synced_commit":"7b533fa13a3ea704e87d99f9424443a6958ec65f"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayantikabanik%2FDataJourney","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayantikabanik%2FDataJourney/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayantikabanik%2FDataJourney/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sayantikabanik%2FDataJourney/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sayantikabanik","download_url":"https://codeload.github.com/sayantikabanik/DataJourney/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233770290,"owners_count":18727553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["allthingsopen","dagster","data-engineering","flask","gha","holoviews","intake","llm","mito","open-source","panel","pytest","vale"],"created_at":"2025-01-13T16:06:17.697Z","updated_at":"2025-09-21T16:31:34.575Z","avatar_url":"https://github.com/sayantikabanik.png","language":"HTML","funding_links":["https://github.com/sponsors/DataJourneyHQ"],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n\n[![DataJourney Stats](https://img.shields.io/badge/DataJourney-Visitors-orange)](https://datajourneyhq.github.io/DataJourney/)\\\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/11135/badge)](https://www.bestpractices.dev/projects/11135)\n[![Code of Conduct](https://img.shields.io/badge/Code_of_Conduct-Contributor%20Covenant-blue)](https://www.contributor-covenant.org/version/2/0/code_of_conduct/)\\\n[![CI](https://github.com/sayantikabanik/DataJourney/actions/workflows/CI.yml/badge.svg)](https://github.com/sayantikabanik/DataJourney/actions/workflows/CI.yml)\n[![github-repo-stats](https://github.com/sayantikabanik/DataJourney/actions/workflows/github-repo-stats.yml/badge.svg)](https://github.com/sayantikabanik/DataJourney/actions/workflows/github-repo-stats.yml)\n[![Deploy DataJourney Stats](https://github.com/sayantikabanik/DataJourney/actions/workflows/static.yml/badge.svg)](https://github.com/sayantikabanik/DataJourney/actions/workflows/static.yml)\n[![Lint prose](https://github.com/sayantikabanik/DataJourney/actions/workflows/review.yml/badge.svg)](https://github.com/sayantikabanik/DataJourney/actions/workflows/review.yml)\n[![Monitor GitHub API Rate Limit](https://github.com/sayantikabanik/DataJourney/actions/workflows/rate-limit-monitor.yml/badge.svg)](https://github.com/sayantikabanik/DataJourney/actions/workflows/rate-limit-monitor.yml)\n\n\u003c/h1\u003e\n\n\u003c!-- Funding spotlight --\u003e\n\u003cp align=\"center\"\u003e\n  \u003cb\u003eRecipient: GitHub Secure Open Source Fund\u003c/b\u003e\u003cbr/\u003e\n  \u003ca href=\"https://github.com/sponsors/DataJourneyHQ\"\u003e\u003cb\u003e💖 Sponsor DataJourneyHQ\u003c/b\u003e\u003c/a\u003e\n  \u0026nbsp;•\u0026nbsp;\n  \u003ca href=\"https://github.blog/open-source/maintainers/securing-the-supply-chain-at-scale-starting-with-71-important-open-source-projects/\"\u003e\u003cb\u003e🥁Official announcement\u003c/b\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./assets/DataJourney_logo_svg/dj_darkmode.svg\" alt=\"DJ rocks\" style=\"width:500px; height:600px;\"\u003e\n\u003c/p\u003e\n\n### 🚌 DataJourney\n\n#### 🪶Short version\n\nDesign- first Open Source Data Management Toolkit. Simplifies data workflows with modular, reproducible solutions\n\n#### 🌲Long version\n\nDataJourney demonstrates how organizations can effectively manage and utilize data by harnessing the power of open-source technologies. It's designed to help navigate the complex landscape of data tools, offering a structured approach to building **scalable**, and **reproducible** data workflows.\n\nBuilt on open-source principles, the framework guides users through essential steps: from **identifying** goals and **selecting tools** to **testing** and **customising** workflows. With its flexible, modular design, DataJourney can be tailored to individual needs, making it an invaluable toolkit for data professionals.\n\n### 🚦 Hold on, looking to contribute?\n\nHead over to the [wiki](https://github.com/DataJourneyHQ/DataJourney/wiki/Contribute-to-DataJourney), let's make it happen together. We don't bite :)\n\n\n### 🧱 Design Philosophy (LEGO)\nBuilt with additive, subtractive capabilities glued with open source.\nEach layer has a certain strength of communication inbuilt\n\n- PO (Base): Static home(s) to keep it together `(GitHub)`\n- P1 (Tooling): Tooling, strings `(Powered by open source)`\n- P2 (Maintenance + Monitoring): Env, automations `(Pixi + GHA)`\n- P3 (Abstraction): Layer(s), CLI/task manager for users to interact with `(Pixi)`\n\n\n![DJ Design](assets/design/dj_vision.png)\n\n### 🛠 Current workflows covered\n{✨= Experimental,\n✅ = Implemented}\n\n| Status | Workflow Description                                                                                                                |\n|--------|-------------------------------------------------------------------------------------------------------------------------------------|\n| ✅     | `Python Packaging framework` design principles                                                                                      |\n| ✅     | `GitHub actions` configured                                                                                                         |\n| ✅     | `Vale.sh` configured at PR level                                                                                                    |\n| ✅     | `Pre-commit hooks` configured for code linting/formatting                                                                           |\n| ✅     | `Hello world` LLM design example based on [LangChain](https://python.langchain.com/)                                                |\n| ✅     | `Environment` management via [pixi](https://prefix.dev/)                                                                              |\n| ✅     | `Reading data` from online sources using [intake](https://github.com/intake/intake)                                                   |\n| ✅     | `Data pipeline` built using [Dagster](https://github.com/dagster-io/dagster)                                                        |\n| ✅     | `custom Dashboard` using [holoviews](https://holoviews.org/gallery/index.html) + [panel](https://panel.holoviz.org/reference/index.html) |\n| ✅     | `Exploratory data analysis` (EDA) using [mito](https://www.trymito.io/)                                                               |\n| ✅     | `Web UI` build on [Flask](https://flask.palletsprojects.com/en/3.0.x/)                                                                |\n| ✅     | `Web UI` re-done and expanded with [FastHTML](https://docs.fastht.ml/)                                                                |\n| ✅     | `GenAI examples` to analyse data [GitHub AI models](https://docs.github.com/en/github-models/prototyping-with-ai-models)          |\n| ✅     | `Query engine` for LLM application using [Chromadb](https://docs.trychroma.com/docs/overview/introduction)                            |\n| ✅     | `RAG` powered by `langchain`, `chromadb` \u0026 `GitHub AI models` |\n| ✅     | `Prompt enhancer` powered by `gpt-oss-120b`|\n\n\n### ☕️ Quickly getting started with DataJourney\n\n- Fork the repository\n- Generate \u0026 add `GITHUB_TOKEN`, instructions [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens#creating-a-personal-access-token-classic)\n  - Added requirement to run the LLM based workflows\n- Switch directory `cd DataJourney`\n- Download pixi : [prefix.dev](https://prefix.dev/)\n- Activate env: `pixi shell`\n- Install DJ framework locally `pixi run DJ_package`\n- List all the tasks: `pixi run DJ_list`\n- Execute a specific task from the list: `pixi run \u003cTASK_NAME\u003e`\n- Execute a specific task with additional logs: `pixi run -v \u003cTASK_NAME\u003e`\n\n### 🏃🏽‍♀️ Active `tasks` under DJ\n\n| **Task Name**                | **Description**                                                                                              |\n|------------------------------|--------------------------------------------------------------------------------------------------------------|\n| `GIT_TOKEN_CHECK`            | Verifies the availability and validity of the Git authentication token.                                      |\n| `DJ_package`                 | Prepares and builds the Python package for the DataJourney project.                                          |\n| `DJ_pre_commit`              | Runs pre-commit hooks to ensure code quality and adherence to standards.                                     |\n| `DJ_dagster`                 | Sets up and runs a Dagster workflow for orchestration in the project.                                        |\n| `DJ_fasthtml_app`            | Executes a FastAPI-based HTML application.                                                                   |\n| `DJ_flask_app`               | Configures and runs a Flask-based application for data services.                                             |\n| `DJ_mito_app`                | Launches the Mito application for interactive data analysis in notebooks.                                    |\n| `DJ_panel_app`               | Executes a Panel dashboard app for data visualization and analytics.                                         |\n| `DJ_llm_analysis`            | Performs analysis using large language models (LLMs) on project data.                                        |\n| `DJ_hello_world_langchain`   | Sets up a basic LangChain app as a \"Hello World\" example for LLMs.                                           |\n| `DJ_spanish_eng_translation` | Performs Spanish to English translation with Deepseek-R1 (`NOTE`: Takes about ~30 secs to execute this task) |\n| `DJ_sync_dataset_trees`      | Downloads and synchronizes the `trees.csv` dataset into the project structure.                               |\n| `DJ_chromadb_gen_embedding`  | Query engine for LLM applications                                                                            |\n| `DJ_RAG_without_memory`      | End-to-end Retrieval-Augmented Generation (RAG) pipeline                                                     |\n| `DJ_prompt_enhancer`         | How to design a simple prompt enhancer using gpt-oss-120b |\n\n\n### 🔌 About pre-commit-hooks and activating\nJust like the name suggests, pre-commit-hooks are designed to format the code based on PEP standards before committing. [More details](https://pre-commit.com/)\n\n```shell\npixi run DJ_pre_commit\n```\n\n### 🦭 Executing LLM script: Generate stock price recommendations\n\n```shell\npixi run DJ_llm_analysis\n```\n\n### 🪼 Execute pre-configured Dagster pipeline\n\n```shell\npixi run DJ_dagster\n```\n![Dagit UI output](assets/pipeline/dagster_ui.png)\n\n### 🐙 Panel app\n```shell\npixi run DJ_panel_app\n```\n\n*NOTE:*\nThe dashboard generated is exported into HTML format and saved as [stock_price_twilio_dashboard](analytics_framework%2Fdashboard%2Fstock_price_twilio_dashboard.html)\n\n![Panel app output](assets/dashboard/panel_app_stock.png)\n\n### 🐵 Mito\n\nTo explore further visit [trymito.io](https://docs.trymito.io/)\n```shell\npixi run DJ_mito_app\n```\n\n[//]: # (![mito output]\u0026#40;assets/pipeline/mito_graph.png \"Graph generated via mitosheet\"\u0026#41; ![mito output operation]\u0026#40;assets/pipeline/mito_operations.png \"Operations performed via mitosheet\"\u0026#41;)\n\n\u003cdiv style=\"display: flex; justify-content: space-between;\"\u003e\n    \u003cimg src=\"assets/pipeline/mito_graph.png\" alt=\"mito_output\" width=\"400\"/\u003e\n    \u003cimg src=\"assets/pipeline/mito_operations.png\" alt=\"mito_output\" width=\"400\"/\u003e\n\u003c/div\u003e\n\n### 🦋 Display all data sources present via web UI\n\n```shell\n# Run FastHTML app\npixi run DJ_fasthtml_app\n```\n![data_sources_fasthtml.png](assets/pipeline/data_sources_fasthtml.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDataJourneyHQ%2FDataJourney","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDataJourneyHQ%2FDataJourney","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDataJourneyHQ%2FDataJourney/lists"}