https://github.com/dataux/dataux

Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore
https://github.com/dataux/dataux

database elasticsearch go golang google-datastore mongo mysql-protocol query-engine sql sql-query

Last synced: about 2 months ago
JSON representation

Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore

Host: GitHub
URL: https://github.com/dataux/dataux
Owner: dataux
License: mit
Created: 2014-12-27T06:54:00.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2022-05-23T23:52:12.000Z (about 3 years ago)
Last Synced: 2025-03-31T05:03:50.061Z (2 months ago)
Topics: database, elasticsearch, go, golang, google-datastore, mongo, mysql-protocol, query-engine, sql, sql-query
Language: Go
Homepage:
Size: 4.86 MB
Stars: 323
Watchers: 16
Forks: 45
Open Issues: 24
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

awesome-ccamel - dataux/dataux - Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore (Go)
awesome-repositories - dataux/dataux - Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore (Go)

README

        
##  Sql Query Proxy to Elasticsearch, Mongo, Kubernetes, BigTable, etc.

Unify disparate data sources and files into a single Federated

view of your data and query with SQL without copying into datawarehouse.

Mysql compatible federated query engine to Elasticsearch, Mongo, 

Google Datastore, Cassandra, Google BigTable, Kubernetes, file-based sources.

This query engine hosts a mysql protocol listener, 

which rewrites sql queries to native (elasticsearch, mongo, cassandra, kuberntes-rest-api, bigtable).

It works by implementing a full relational algebra distributed execution engine

to run sql queries and poly-fill missing features

from underlying sources.  So, a backend key-value storage such as cassandra

can now have complete `WHERE` clause support as well as aggregate functions etc.

Most similar to [prestodb](http://prestodb.io/) but in Golang, and focused on

easy to add custom data sources as well as REST api sources.

## Storage Sources

* [Google Big Table](https://github.com/dataux/dataux/tree/master/backends/bigtable) SQL against big-table [Bigtable](https://cloud.google.com/bigtable/).

* [Elasticsearch](https://github.com/dataux/dataux/tree/master/backends/elasticsearch) Simplify access to Elasticsearch.

* [Mongo](https://github.com/dataux/dataux/tree/master/backends/mongo) Translate SQL into mongo.

* [Google Cloud Storage / (csv, json files)](https://github.com/dataux/dataux/tree/master/backends/files) An example of REST api backends (list of files), as well as the file contents themselves are tables.

* [Cassandra](https://github.com/dataux/dataux/tree/master/backends/cassandra) SQL against cassandra.  Adds sql features that are missing.

* [Lytics](https://github.com/dataux/dataux/tree/master/backends/lytics) SQL against [Lytics REST Api's](https://www.getlytics.com)

* [Kubernetes](https://github.com/dataux/dataux/tree/master/backends/_kube) An example of REST api backend.

* [Google Big Query](https://github.com/dataux/dataux/tree/master/backends/bigquery) MYSQL against worlds best analytics datawarehouse [BigQuery](https://cloud.google.com/bigquery/).

* [Google Datastore](https://github.com/dataux/dataux/tree/master/backends/datastore) MYSQL against [Datastore](https://cloud.google.com/datastore/).

## Features

* *Distributed*  run queries across multiple servers

* *Hackable Sources*  Very easy to add a new Source for your custom data, files, json, csv, storage.

* *Hackable Functions* Add custom go functions to extend the sql language.

* *Joins* Get join functionality between heterogeneous sources.

* *Frontends* currently only MySql protocol is supported but RethinkDB (for real-time api) is planned, and are pluggable.

* *Backends*  Elasticsearch, Google-Datastore, Mongo, Cassandra, BigTable, Kubernetes currently implemented.  Csv, Json files, and custom formats (protobuf) are in progress.

## Status

* NOT Production ready.  Currently supporting a few non-critical use-cases (ad-hoc queries, support tool) in production.

## Try it Out

These examples are:

1. We are going to create a CSV `database` of Baseball data from http://seanlahman.com/baseball-archive/statistics/

2. Connect to Google BigQuery public datasets (you will need a project, but the free quota will probably keep it free).

```sh

# download files to local /tmp

mkdir -p /tmp/baseball

cd /tmp/baseball

curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip

unzip bball.zip

mv baseball*/core/*.csv .

rm bball.zip

rm -rf baseballdatabank-*

# run a docker container locally

docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \

  -v /tmp/baseball:/tmp/baseball \

  gcr.io/dataux-io/dataux:latest

```

In another Console open Mysql:

```sql

# connect to the docker container you just started

mysql -h 127.0.0.1 -P4000

-- Now create a new Source

CREATE source baseball WITH {

  "type":"cloudstore", 

  "schema":"baseball", 

  "settings" : {

     "type": "localfs",

     "format": "csv",

     "path": "baseball/",

     "localpath": "/tmp"

  }

};

show databases;

use baseball;

show tables;

describe appearances

select count(*) from appearances;

select * from appearances limit 10;

```

Big Query Example

------------------------------

```sh

# assuming you are running local, if you are instead in Google Cloud, or Google Container Engine

# you don't need the credentials or volume mount

docker run -e "GOOGLE_APPLICATION_CREDENTIALS=/.config/gcloud/application_default_credentials.json" \

  -e "LOGGING=debug" \

  --rm -it \

  -p 4000:4000 \

  -v ~/.config/gcloud:/.config/gcloud \

  gcr.io/dataux-io/dataux:latest

# now that dataux is running use mysql-client to connect

mysql -h 127.0.0.1 -P 4000

```

now run some queries

```sql

-- add a bigquery datasource

CREATE source `datauxtest` WITH {

    "type":"bigquery",

    "schema":"bqsf_bikes",

    "table_aliases" : {

       "bikeshare_stations" : "bigquery-public-data:san_francisco.bikeshare_stations"

    },

    "settings" : {

      "billing_project" : "your-google-cloud-project",

      "data_project" : "bigquery-public-data",

      "dataset" : "san_francisco"

    }

};

use bqsf_bikes;

show tables;

describe film_locations;

select * from film_locations limit 10;

```

**Hacking**

For now, the goal is to allow this to be used for library, so the 

`vendor` is not checked in.  use docker containers or `dep` for now.

```sh

# run dep ensure

dep ensure -v 

```

Related Projects, Database Proxies & Multi-Data QL

-------------------------------------------------------

* ***Data-Accessability*** Making it easier to query, access, share, and use data.   Protocol shifting (for accessibility).  Sharing/Replication between db types.

* ***Scalability/Sharding*** Implement sharding, connection sharing

Name | Scaling | Ease Of Access (sql, etc) | Comments

---- | ------- | ----------------------------- | ---------

***[Vitess](https://github.com/youtube/vitess)***                          | Y |   | for scaling (sharding), very mature

***[twemproxy](https://github.com/twitter/twemproxy)***                    | Y |   | for scaling memcache

***[Couchbase N1QL](https://github.com/couchbaselabs/query)***             | Y | Y | sql interface to couchbase k/v (and full-text-index)

***[prestodb](http://prestodb.io/)***                                      |   | Y | query front end to multiple backends, distributed

***[cratedb](https://crate.io/)***                                         | Y | Y | all-in-one db, not a proxy, sql to es

***[codis](https://github.com/wandoulabs/codis)***                         | Y |   | for scaling redis

***[MariaDB MaxScale](https://github.com/mariadb-corporation/MaxScale)***  | Y |   | for scaling mysql/mariadb (sharding) mature

***[Netflix Dynomite](https://github.com/Netflix/dynomite)***              | Y |   | not really sql, just multi-store k/v 

***[redishappy](https://github.com/mdevilliers/redishappy)***              | Y |   | for scaling redis, haproxy

***[mixer](https://github.com/siddontang/mixer)***                         | Y |   | simple mysql sharding 

We use more and more databases, flatfiles, message queues, etc.

For db's the primary reader/writer is fine but secondary readers 

such as investigating ad-hoc issues means we might be accessing 

and learning many different query languages.  

Credit to [mixer](https://github.com/siddontang/mixer), derived mysql connection pieces from it (which was forked from vitess).

Inspiration/Other works

--------------------------

* https://github.com/linkedin/databus, 

* [ql.io](http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/), [yql](https://developer.yahoo.com/yql/)

* [dockersql](https://github.com/crosbymichael/dockersql), [q -python](http://harelba.github.io/q/), [textql](https://github.com/dinedal/textql),[GitQL/GitQL](https://github.com/gitql/gitql), [GitQL](https://github.com/cloudson/gitql)

> In Internet architectures, data systems are typically categorized

> into source-of-truth systems that serve as primary stores 

> for the user-generated writes, and derived data stores or 

> indexes which serve reads and other complex queries. The data 

> in these secondary stores is often derived from the primary data 

> through custom transformations, sometimes involving complex processing 

> driven by business logic. Similarly data in caching tiers is derived 

> from reads against the primary data store, but needs to get 

> invalidated or refreshed when the primary data gets mutated. 

> A fundamental requirement emerging from these kinds of data 

> architectures is the need to reliably capture, 

> flow and process primary data changes.

from [Databus](https://github.com/linkedin/databus)

Building

--------------------------

I plan on getting the `vendor` getting checked in soon so the build will work.  However

I am currently trying to figure out how to organize packages to allow use as both a library

as well as a daemon.  (see how minimal main.go is, to encourage your own builtins and datasources.)

```sh

# for just docker

# ensure /vendor has correct versions

dep ensure -update 

# build binary

./.build

# build docker

docker build -t gcr.io/dataux-io/dataux:v0.15.1 .

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dataux/dataux

Awesome Lists containing this project

README