Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/javitorres/datalakeStudio
Python+VueJS application to load, explore, combine,transform and deliver data
https://github.com/javitorres/datalakeStudio
Last synced: about 2 months ago
JSON representation
Python+VueJS application to load, explore, combine,transform and deliver data
- Host: GitHub
- URL: https://github.com/javitorres/datalakeStudio
- Owner: javitorres
- License: gpl-3.0
- Created: 2023-05-13T17:52:17.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-09-09T06:01:48.000Z (4 months ago)
- Last Synced: 2024-09-09T07:25:45.961Z (4 months ago)
- Language: Vue
- Size: 9.41 MB
- Stars: 67
- Watchers: 5
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-duckdb - DatalakeStudio - Load, explore, transform your datasets and expose them via API. Integration with external APIs, S3, PostgreSQL and ChatGPT. (Tools Powered by DuckDB)
README
# Datalake Studio
Datalake Studio is an enhanced Data Exploration and Management tool
## Key Features of Datalake Studio:
Quick for big data: Datalake Studio is built on top of DuckDB, a high-performance, embedded SQL OLAP database management system. DuckDB is designed to handle large datasets, making it ideal for data exploration and analysis.
See your data: Plot automatically your data or see data over a map: Points, H3 aggregations, etc
Versatile Data Loading Options: Users can effortlessly upload data from a several sources: directly from local computer, via a URL, or from an Amazon S3 bucket. Additionally, it supports direct data downloads from PostgreSQL databases, enhancing its utility for database administrators and data analysts.
Several data formats: Wide range of data formats, Datalake Studio is compatible with CSV, TSV, Parquet and Shapefile formats. Load data without tedious conversions.
ChatGPT Integration with SQL Assistants: Users with ChatGPT credentials can use the power of SQL assistants. These assistants provide contextual understanding about your tables and fields, making data manipulation and query formulation more intuitive and efficient.
Enhancement through Remote APIs: Users have the ability to enrich their data by integrating information from remote APIs.
API Exposure for Data Sharing: After completing data transformation processes, users can expose their data through APIs. This feature allows for easy sharing and collaboration, making Datalake Studio not just a tool for data exploration, but also a platform for data distribution.
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/786276af-5d2e-43a5-9f14-e56e7456e3ea)
# Project build with Docker
```
docker-compose up --build
```Open http://localhost:8080/ in your browser.
## If you dont want to use compose
docker build -t datalakestudioserver .
docker run --name datalakestudioserver -p 8000:8000 datalakestudioserverdocker build -t datalakestudiofront .
docker run --name datalakestudiofront -p 8080:8080 datalakestudiofront# Project build without Docker
## Server
Inside server folder run:
```
pip3 install -r requirements.txt
python3 server.py
```If you want to use venv:
```
python3 -m venv venv
source venv/bin/activate
```Exit venv:
```
deactivate
```## Client
Inside the client folder of the project, run these commands to build the Vue UI project:
```
npm install
npm run dev -- --port 8080
```Open http://localhost:8080/ in your browser.
# Configuration files
## Server
Inside server folder create a file named config.yml. Example:
```
port: 8000
database: "data/datalakeStudio.db"
```And another file named secrets.yml with properties:
```
# Optional for DuckDB to work with S3, if not defined, user aws credentials will be loaded through the AWS Default Credentials Provider Chain
s3_access_key_id: "YOUR_S3_ACCESS_KEY_ID"
s3_secret_access_key: "YOUR_S3_SECRET_ACCESS_KEY"# For OpenAI
openai_organization: "YOUR_OPENAI_ORGANIZATION"
openai_api_key: "YOUR_OPENAI_API_KEY"# For API search
api_domain: "YOUR_API_DOMAIN"
api_context: "YOUR_API_CONTEXT"# Database connections
pgpass_file: "YOUR_PG_PASS_FILE"# Mapbox
mapbox_access_token: "YOUR_MAPBOX_ACCESS_TOKEN"```
Also, docker-compose will get the credentials in .aws for AWS access.
If you want to use remote database, copy your pgpass file to the server folder. pgpass is a file with the following format:
```
hostname:port:database:username:password
```# Usage
## Load data
You can load data from local filesystem, from any URL or from S3.
Try to load this example: https://raw.githubusercontent.com/javitorres/GenericCross/main/public/data/iris.csv![image](https://github.com/javitorres/datalakeStudio/assets/4235424/6954818b-94f6-4438-b7b7-012f42edeb63)
## Table explorer
Inspect loaded data. Export data to CSV or Parquet
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/5625c1e9-a399-4089-acd1-73381174089c)
Get data profile
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/959a1fae-2740-488e-b9ac-5e3c8079e8dd)
or use crossfilter to play with your data
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/392f50f8-6d8d-4a4f-a1fa-9e70c7fc652b)
If your data has spatial info you can see in a map:
![Captura desde 2024-08-19 18-09-32](https://github.com/user-attachments/assets/cc91394c-1f4a-4b3b-9065-983b6efd3764)
## Query panel
Query your data and generate new tables. Save or load your queries. Use ChatGPT to create new queries
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/13de8f41-e002-4f2a-811b-a64a3fdeca19)
# Load data from APIs
Enrich your datasets calling external APIs
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/8a81495b-0e40-4829-af9e-f1081f871bb9)
New table:
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/d367ddfa-089d-4670-8277-0693899b50cd)
# Load data from remote databases
Explore your external databases and load data into Datalake Studio for local analysis
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/948a0165-a908-43ce-b195-cdd17839f45e)
# Expose your data via API
Publish endpoints serving your data with parametrized queries:
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/34537cf8-c59c-4167-940c-3c07a71e2cc5)
Keep control of endpoints published:
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/32ee7182-228c-4130-8ce7-482e464c3c0d)
# Explore your S3 buckets
Move in your S3 buckets and write descriptions
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/cd63c467-7cee-4fdc-8c9d-705372e8387e)
Preview files or load them into DatalaleStudio
![image](https://github.com/javitorres/datalakeStudio/assets/4235424/16c1f44a-52a0-4593-9c42-4f687fe315b1)
# Talk to ChatGPT
Talk to explore your data (experimental)![image](https://github.com/javitorres/datalakeStudio/assets/4235424/e3913bb0-5741-4cac-b702-ad30f37d5fa5)