Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dataux/dataux

Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore
https://github.com/dataux/dataux

database elasticsearch go golang google-datastore mongo mysql-protocol query-engine sql sql-query

Last synced: 2 months ago
JSON representation

Federated mysql compatible proxy to elasticsearch, mongo, cassandra, big-table, google datastore

Awesome Lists containing this project

README

        

## Sql Query Proxy to Elasticsearch, Mongo, Kubernetes, BigTable, etc.

Unify disparate data sources and files into a single Federated
view of your data and query with SQL without copying into datawarehouse.

Mysql compatible federated query engine to Elasticsearch, Mongo,
Google Datastore, Cassandra, Google BigTable, Kubernetes, file-based sources.
This query engine hosts a mysql protocol listener,
which rewrites sql queries to native (elasticsearch, mongo, cassandra, kuberntes-rest-api, bigtable).
It works by implementing a full relational algebra distributed execution engine
to run sql queries and poly-fill missing features
from underlying sources. So, a backend key-value storage such as cassandra
can now have complete `WHERE` clause support as well as aggregate functions etc.

Most similar to [prestodb](http://prestodb.io/) but in Golang, and focused on
easy to add custom data sources as well as REST api sources.

## Storage Sources

* [Google Big Table](https://github.com/dataux/dataux/tree/master/backends/bigtable) SQL against big-table [Bigtable](https://cloud.google.com/bigtable/).
* [Elasticsearch](https://github.com/dataux/dataux/tree/master/backends/elasticsearch) Simplify access to Elasticsearch.
* [Mongo](https://github.com/dataux/dataux/tree/master/backends/mongo) Translate SQL into mongo.
* [Google Cloud Storage / (csv, json files)](https://github.com/dataux/dataux/tree/master/backends/files) An example of REST api backends (list of files), as well as the file contents themselves are tables.
* [Cassandra](https://github.com/dataux/dataux/tree/master/backends/cassandra) SQL against cassandra. Adds sql features that are missing.
* [Lytics](https://github.com/dataux/dataux/tree/master/backends/lytics) SQL against [Lytics REST Api's](https://www.getlytics.com)
* [Kubernetes](https://github.com/dataux/dataux/tree/master/backends/_kube) An example of REST api backend.
* [Google Big Query](https://github.com/dataux/dataux/tree/master/backends/bigquery) MYSQL against worlds best analytics datawarehouse [BigQuery](https://cloud.google.com/bigquery/).
* [Google Datastore](https://github.com/dataux/dataux/tree/master/backends/datastore) MYSQL against [Datastore](https://cloud.google.com/datastore/).

## Features

* *Distributed* run queries across multiple servers
* *Hackable Sources* Very easy to add a new Source for your custom data, files, json, csv, storage.
* *Hackable Functions* Add custom go functions to extend the sql language.
* *Joins* Get join functionality between heterogeneous sources.
* *Frontends* currently only MySql protocol is supported but RethinkDB (for real-time api) is planned, and are pluggable.
* *Backends* Elasticsearch, Google-Datastore, Mongo, Cassandra, BigTable, Kubernetes currently implemented. Csv, Json files, and custom formats (protobuf) are in progress.

## Status
* NOT Production ready. Currently supporting a few non-critical use-cases (ad-hoc queries, support tool) in production.

## Try it Out
These examples are:
1. We are going to create a CSV `database` of Baseball data from http://seanlahman.com/baseball-archive/statistics/
2. Connect to Google BigQuery public datasets (you will need a project, but the free quota will probably keep it free).

```sh
# download files to local /tmp
mkdir -p /tmp/baseball
cd /tmp/baseball
curl -Ls http://seanlahman.com/files/database/baseballdatabank-2017.1.zip > bball.zip
unzip bball.zip

mv baseball*/core/*.csv .
rm bball.zip
rm -rf baseballdatabank-*

# run a docker container locally
docker run -e "LOGGING=debug" --rm -it -p 4000:4000 \
-v /tmp/baseball:/tmp/baseball \
gcr.io/dataux-io/dataux:latest

```
In another Console open Mysql:
```sql
# connect to the docker container you just started
mysql -h 127.0.0.1 -P4000

-- Now create a new Source
CREATE source baseball WITH {
"type":"cloudstore",
"schema":"baseball",
"settings" : {
"type": "localfs",
"format": "csv",
"path": "baseball/",
"localpath": "/tmp"
}
};

show databases;

use baseball;

show tables;

describe appearances

select count(*) from appearances;

select * from appearances limit 10;

```

Big Query Example
------------------------------

```sh

# assuming you are running local, if you are instead in Google Cloud, or Google Container Engine
# you don't need the credentials or volume mount
docker run -e "GOOGLE_APPLICATION_CREDENTIALS=/.config/gcloud/application_default_credentials.json" \
-e "LOGGING=debug" \
--rm -it \
-p 4000:4000 \
-v ~/.config/gcloud:/.config/gcloud \
gcr.io/dataux-io/dataux:latest

# now that dataux is running use mysql-client to connect
mysql -h 127.0.0.1 -P 4000
```
now run some queries
```sql
-- add a bigquery datasource
CREATE source `datauxtest` WITH {
"type":"bigquery",
"schema":"bqsf_bikes",
"table_aliases" : {
"bikeshare_stations" : "bigquery-public-data:san_francisco.bikeshare_stations"
},
"settings" : {
"billing_project" : "your-google-cloud-project",
"data_project" : "bigquery-public-data",
"dataset" : "san_francisco"
}
};

use bqsf_bikes;

show tables;

describe film_locations;

select * from film_locations limit 10;

```

**Hacking**

For now, the goal is to allow this to be used for library, so the
`vendor` is not checked in. use docker containers or `dep` for now.

```sh
# run dep ensure
dep ensure -v

```

Related Projects, Database Proxies & Multi-Data QL
-------------------------------------------------------
* ***Data-Accessability*** Making it easier to query, access, share, and use data. Protocol shifting (for accessibility). Sharing/Replication between db types.
* ***Scalability/Sharding*** Implement sharding, connection sharing

Name | Scaling | Ease Of Access (sql, etc) | Comments
---- | ------- | ----------------------------- | ---------
***[Vitess](https://github.com/youtube/vitess)*** | Y | | for scaling (sharding), very mature
***[twemproxy](https://github.com/twitter/twemproxy)*** | Y | | for scaling memcache
***[Couchbase N1QL](https://github.com/couchbaselabs/query)*** | Y | Y | sql interface to couchbase k/v (and full-text-index)
***[prestodb](http://prestodb.io/)*** | | Y | query front end to multiple backends, distributed
***[cratedb](https://crate.io/)*** | Y | Y | all-in-one db, not a proxy, sql to es
***[codis](https://github.com/wandoulabs/codis)*** | Y | | for scaling redis
***[MariaDB MaxScale](https://github.com/mariadb-corporation/MaxScale)*** | Y | | for scaling mysql/mariadb (sharding) mature
***[Netflix Dynomite](https://github.com/Netflix/dynomite)*** | Y | | not really sql, just multi-store k/v
***[redishappy](https://github.com/mdevilliers/redishappy)*** | Y | | for scaling redis, haproxy
***[mixer](https://github.com/siddontang/mixer)*** | Y | | simple mysql sharding

We use more and more databases, flatfiles, message queues, etc.
For db's the primary reader/writer is fine but secondary readers
such as investigating ad-hoc issues means we might be accessing
and learning many different query languages.

Credit to [mixer](https://github.com/siddontang/mixer), derived mysql connection pieces from it (which was forked from vitess).

Inspiration/Other works
--------------------------
* https://github.com/linkedin/databus,
* [ql.io](http://www.ebaytechblog.com/2011/11/30/announcing-ql-io/), [yql](https://developer.yahoo.com/yql/)
* [dockersql](https://github.com/crosbymichael/dockersql), [q -python](http://harelba.github.io/q/), [textql](https://github.com/dinedal/textql),[GitQL/GitQL](https://github.com/gitql/gitql), [GitQL](https://github.com/cloudson/gitql)

> In Internet architectures, data systems are typically categorized
> into source-of-truth systems that serve as primary stores
> for the user-generated writes, and derived data stores or
> indexes which serve reads and other complex queries. The data
> in these secondary stores is often derived from the primary data
> through custom transformations, sometimes involving complex processing
> driven by business logic. Similarly data in caching tiers is derived
> from reads against the primary data store, but needs to get
> invalidated or refreshed when the primary data gets mutated.
> A fundamental requirement emerging from these kinds of data
> architectures is the need to reliably capture,
> flow and process primary data changes.

from [Databus](https://github.com/linkedin/databus)

Building
--------------------------
I plan on getting the `vendor` getting checked in soon so the build will work. However
I am currently trying to figure out how to organize packages to allow use as both a library
as well as a daemon. (see how minimal main.go is, to encourage your own builtins and datasources.)

```sh

# for just docker

# ensure /vendor has correct versions
dep ensure -update

# build binary
./.build

# build docker

docker build -t gcr.io/dataux-io/dataux:v0.15.1 .

```