{"id":13563012,"url":"https://github.com/dataux/dataux","last_synced_at":"2025-04-07T06:10:48.843Z","repository":{"id":25109376,"uuid":"28530729","full_name":"dataux/dataux","owner":"dataux","description":"Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore","archived":false,"fork":false,"pushed_at":"2022-05-23T23:52:12.000Z","size":5098,"stargazers_count":323,"open_issues_count":24,"forks_count":45,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-03-31T05:03:50.061Z","etag":null,"topics":["database","elasticsearch","go","golang","google-datastore","mongo","mysql-protocol","query-engine","sql","sql-query"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataux.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-12-27T06:54:00.000Z","updated_at":"2025-03-24T10:09:27.000Z","dependencies_parsed_at":"2022-08-23T19:40:43.296Z","dependency_job_id":null,"html_url":"https://github.com/dataux/dataux","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataux%2Fdataux","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataux%2Fdataux/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataux%2Fdataux/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataux%2Fdataux/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataux","download_url":"https://codeload.github.com/dataux/dataux/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247601448,"owners_count":20964864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","elasticsearch","go","golang","google-datastore","mongo","mysql-protocol","query-engine","sql","sql-query"],"created_at":"2024-08-01T13:01:14.244Z","updated_at":"2025-04-07T06:10:48.822Z","avatar_url":"https://github.com/dataux.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"\n##  Sql Query Proxy to Elasticsearch, Mongo, Kubernetes, BigTable, etc.\n\nUnify disparate data sources and files into a single Federated\nview of your data and query with SQL without copying into datawarehouse.\n\n\nMysql compatible federated query engine to Elasticsearch, Mongo, \nGoogle Datastore, Cassandra, Google BigTable, Kubernetes, file-based sources.\nThis query engine hosts a mysql protocol listener, \nwhich rewrites sql queries to native (elasticsearch, mongo, cassandra, kuberntes-rest-api, bigtable).\nIt works by implementing a full relational algebra distributed execution engine\nto run sql queries and poly-fill missing features\nfrom underlying sources.  So, a backend key-value storage such as cassandra\ncan now have complete `WHERE` clause support as well as aggregate functions etc.\n\nMost similar to [prestodb](http://prestodb.io/) but in Golang, and focused on\neasy to add custom data sources as well as REST api sources.\n\n## Storage Sources\n\n* [Google Big Table](https://github.com/dataux/dataux/tree/master/backends/bigtable) SQL against big-table [Bigtable](https://cloud.google.com/bigtable/).\n* [Elasticsearch](https://github.com/dataux/dataux/tree/master/backends/elasticsearch) Simplify access to Elasticsearch.\n* [Mongo](https://github.com/dataux/dataux/tree/master/backends/mongo) Translate SQL into mongo.\n* [Google Cloud Storage / (csv, json files)](https://github.com/dataux/dataux/tree/master/backends/files) An example of REST api backends (list of files), as well as the file contents themselves are tables.\n* [Cassandra](https://github.com/dataux/dataux/tree/master/backends/cassandra) SQL against cassandra.  Adds sql features that are missing.\n* [Lytics](https://github.com/dataux/dataux/tree/master/backends/lytics) SQL against [Lytics REST Api's](https://www.getlytics.com)\n* [Kubernetes](https://github.com/dataux/dataux/tree/master/backends/_kube) An example of REST api backend.\n* [Google Big Query](https://github.com/dataux/dataux/tree/master/backends/bigquery) MYSQL against worlds best analytics datawarehouse [BigQuery](https://cloud.google.com/bigquery/).\n* [Google Datastore](https://github.com/dataux/dataux/tree/master/backends/datastore) MYSQL against [Datastore](https://cloud.google.com/datastore/).\n\n\n## Features\n\n* *Distributed*  run queries across multiple servers\n* *Hackable Sources*  Very easy to add a new Source for your custom data, files, json, csv, storage.\n* *Hackable Functions* Add custom go functions to extend the sql language.\n* *Joins* Get join functionality between heterogeneous sources.\n* *Frontends* currently only MySql protocol is supported but RethinkDB (for real-time api) is planned, and are pluggable.\n* *Backends*  Elasticsearch, Google-Datastore, Mongo, Cassandra, BigTable, Kubernetes currently implemented.  Csv, Json files, and custom formats (protobuf) are in progress.\n\n## Status\n* NOT Production ready.  Currently supporting a few non-critical use-cases (ad-hoc queries, support tool) in production.\n\n\n## Try it Out\nThese examples are:\n1. We are going to create a CSV `database` of Baseball data from http://seanlahman.com/baseball-archive/statistics/\n2. Connect to Google BigQuery public datasets (you will need a project, but the free quota will probably keep it free).\n\n\n\n```sh\n# download files to local /tmp\nmkdir -p /tmp/baseball\ncd /tmp/baseball\ncurl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip \u003e bball.zip\nunzip bball.zip\n\nmv baseball*/core/*.csv .\nrm bball.zip\nrm -rf baseballdatabank-*\n\n# run a docker container locally\ndocker run -e \"LOGGING=debug\" --rm -it -p 4000:4000 \\\n  -v /tmp/baseball:/tmp/baseball \\\n  gcr.io/dataux-io/dataux:latest\n\n\n```\nIn another Console open Mysql:\n```sql\n# connect to the docker container you just started\nmysql -h 127.0.0.1 -P4000\n\n\n-- Now create a new Source\nCREATE source baseball WITH {\n  \"type\":\"cloudstore\", \n  \"schema\":\"baseball\", \n  \"settings\" : {\n     \"type\": \"localfs\",\n     \"format\": \"csv\",\n     \"path\": \"baseball/\",\n     \"localpath\": \"/tmp\"\n  }\n};\n\nshow databases;\n\nuse baseball;\n\nshow tables;\n\ndescribe appearances\n\nselect count(*) from appearances;\n\nselect * from appearances limit 10;\n\n\n```\n\nBig Query Example\n------------------------------\n\n```sh\n\n# assuming you are running local, if you are instead in Google Cloud, or Google Container Engine\n# you don't need the credentials or volume mount\ndocker run -e \"GOOGLE_APPLICATION_CREDENTIALS=/.config/gcloud/application_default_credentials.json\" \\\n  -e \"LOGGING=debug\" \\\n  --rm -it \\\n  -p 4000:4000 \\\n  -v ~/.config/gcloud:/.config/gcloud \\\n  gcr.io/dataux-io/dataux:latest\n\n# now that dataux is running use mysql-client to connect\nmysql -h 127.0.0.1 -P 4000\n```\nnow run some queries\n```sql\n-- add a bigquery datasource\nCREATE source `datauxtest` WITH {\n    \"type\":\"bigquery\",\n    \"schema\":\"bqsf_bikes\",\n    \"table_aliases\" : {\n       \"bikeshare_stations\" : \"bigquery-public-data:san_francisco.bikeshare_stations\"\n    },\n    \"settings\" : {\n      \"billing_project\" : \"your-google-cloud-project\",\n      \"data_project\" : \"bigquery-public-data\",\n      \"dataset\" : \"san_francisco\"\n    }\n};\n\nuse bqsf_bikes;\n\nshow tables;\n\ndescribe film_locations;\n\nselect * from film_locations limit 10;\n\n```\n\n\n**Hacking**\n\nFor now, the goal is to allow this to be used for library, so the \n`vendor` is not checked in.  use docker containers or `dep` for now.\n\n```sh\n# run dep ensure\ndep ensure -v \n\n\n```\n\nRelated Projects, Database Proxies \u0026 Multi-Data QL\n-------------------------------------------------------\n* ***Data-Accessability*** Making it easier to query, access, share, and use data.   Protocol shifting (for accessibility).  Sharing/Replication between db types.\n* ***Scalability/Sharding*** Implement sharding, connection sharing\n\nName | Scaling | Ease Of Access (sql, etc) | Comments\n---- | ------- | ----------------------------- | ---------\n***[Vitess](https://github.com/youtube/vitess)***                          | Y |   | for scaling (sharding), very mature\n***[twemproxy](https://github.com/twitter/twemproxy)***                    | Y |   | for scaling memcache\n***[Couchbase N1QL](https://github.com/couchbaselabs/query)***             | Y | Y | sql interface to couchbase k/v (and full-text-index)\n***[prestodb](http://prestodb.io/)***                                      |   | Y | query front end to multiple backends, distributed\n***[cratedb](https://crate.io/)***                                         | Y | Y | all-in-one db, not a proxy, sql to es\n***[codis](https://github.com/wandoulabs/codis)***                         | Y |   | for scaling redis\n***[MariaDB MaxScale](https://github.com/mariadb-corporation/MaxScale)***  | Y |   | for scaling mysql/mariadb (sharding) mature\n***[Netflix Dynomite](https://github.com/Netflix/dynomite)***              | Y |   | not really sql, just multi-store k/v \n***[redishappy](https://github.com/mdevilliers/redishappy)***              | Y |   | for scaling redis, haproxy\n***[mixer](https://github.com/siddontang/mixer)***                         | Y |   | simple mysql sharding \n\nWe use more and more databases, flatfiles, message queues, etc.\nFor db's the primary reader/writer is fine but secondary readers \nsuch as investigating ad-hoc issues means we might be accessing \nand learning many different query languages.  \n\nCredit to [mixer](https://github.com/siddontang/mixer), derived mysql connection pieces from it (which was forked from vitess).\n\nInspiration/Other works\n--------------------------\n* https://github.com/linkedin/databus, \n* [ql.io](http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/), [yql](https://developer.yahoo.com/yql/)\n* [dockersql](https://github.com/crosbymichael/dockersql), [q -python](http://harelba.github.io/q/), [textql](https://github.com/dinedal/textql),[GitQL/GitQL](https://github.com/gitql/gitql), [GitQL](https://github.com/cloudson/gitql)\n\n\n\u003e In Internet architectures, data systems are typically categorized\n\u003e into source-of-truth systems that serve as primary stores \n\u003e for the user-generated writes, and derived data stores or \n\u003e indexes which serve reads and other complex queries. The data \n\u003e in these secondary stores is often derived from the primary data \n\u003e through custom transformations, sometimes involving complex processing \n\u003e driven by business logic. Similarly data in caching tiers is derived \n\u003e from reads against the primary data store, but needs to get \n\u003e invalidated or refreshed when the primary data gets mutated. \n\u003e A fundamental requirement emerging from these kinds of data \n\u003e architectures is the need to reliably capture, \n\u003e flow and process primary data changes.\n\nfrom [Databus](https://github.com/linkedin/databus)\n\n\nBuilding\n--------------------------\nI plan on getting the `vendor` getting checked in soon so the build will work.  However\nI am currently trying to figure out how to organize packages to allow use as both a library\nas well as a daemon.  (see how minimal main.go is, to encourage your own builtins and datasources.)\n\n\n```sh\n\n# for just docker\n\n# ensure /vendor has correct versions\ndep ensure -update \n\n# build binary\n./.build\n\n# build docker\n\ndocker build -t gcr.io/dataux-io/dataux:v0.15.1 .\n\n\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataux%2Fdataux","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataux%2Fdataux","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataux%2Fdataux/lists"}