https://github.com/druid-io/pydruid

A Python connector for Druid
https://github.com/druid-io/pydruid
Last synced: 4 months ago
JSON representation
A Python connector for Druid
Host: GitHub
URL: https://github.com/druid-io/pydruid
Owner: druid-io
License: other
Created: 2013-08-28T00:52:10.000Z (almost 13 years ago)
Default Branch: master
Last Pushed: 2025-09-11T01:52:41.000Z (10 months ago)
Last Synced: 2025-10-20T16:05:55.033Z (8 months ago)
Language: Python
Homepage:
Size: 576 KB
Stars: 519
Watchers: 53
Forks: 203
Open Issues: 90
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # pydruid

pydruid exposes a simple API to create, execute, and analyze [Druid](http://druid.io/) queries. pydruid can parse query results into [Pandas](http://pandas.pydata.org/) DataFrame objects for subsequent data analysis -- this offers a tight integration between [Druid](http://druid.io/), the [SciPy](http://www.scipy.org/stackspec.html) stack (for scientific computing) and [scikit-learn](http://scikit-learn.org/stable/) (for machine learning). pydruid can export query results into TSV or JSON for further processing with your favorite tool, e.g., R, Julia, Matlab, Excel. It provides both synchronous and asynchronous clients.

Additionally, pydruid implements the [Python DB API 2.0](https://www.python.org/dev/peps/pep-0249/), a [SQLAlchemy dialect](http://docs.sqlalchemy.org/en/latest/dialects/), and a provides a command line interface to interact with Druid.

To install:

```python

pip install pydruid

# or, if you intend to use asynchronous client

pip install pydruid[async]

# or, if you intend to export query results into pandas

pip install pydruid[pandas]

# or, if you intend to do both

pip install pydruid[async, pandas]

# or, if you want to use the SQLAlchemy engine

pip install pydruid[sqlalchemy]

# or, if you want to use the CLI

pip install pydruid[cli]

```

Documentation: https://pythonhosted.org/pydruid/.

# examples

The following exampes show how to execute and analyze the results of three types of queries: timeseries, topN, and groupby. We will use these queries to ask simple questions about twitter's public data set.

## timeseries

What was the average tweet length, per day, surrounding the 2014 Sochi olympics?

```python

from pydruid.client import *

from pylab import plt

query = PyDruid(druid_url_goes_here, 'druid/v2')

ts = query.timeseries(

    datasource='twitterstream',

    granularity='day',

    intervals='2014-02-02/p4w',

    aggregations={'length': doublesum('tweet_length'), 'count': doublesum('count')},

    post_aggregations={'avg_tweet_length': (Field('length') / Field('count'))},

    filter=Dimension('first_hashtag') == 'sochi2014'

)

df = query.export_pandas()

df['timestamp'] = df['timestamp'].map(lambda x: x.split('T')[0])

df.plot(x='timestamp', y='avg_tweet_length', ylim=(80, 140), rot=20,

        title='Sochi 2014')

plt.ylabel('avg tweet length (chars)')

plt.show()

```

![alt text](https://github.com/metamx/pydruid/raw/master/docs/figures/avg_tweet_length.png "Avg. tweet length")

## topN

Who were the top ten mentions (@user_name) during the 2014 Oscars?

```python

top = query.topn(

    datasource='twitterstream',

    granularity='all',

    intervals='2014-03-03/p1d',  # utc time of 2014 oscars

    aggregations={'count': doublesum('count')},

    dimension='user_mention_name',

    filter=(Dimension('user_lang') == 'en') & (Dimension('first_hashtag') == 'oscars') &

           (Dimension('user_time_zone') == 'Pacific Time (US & Canada)') &

           ~(Dimension('user_mention_name') == 'No Mention'),

    metric='count',

    threshold=10

)

df = query.export_pandas()

print df

   count                 timestamp user_mention_name

0   1303  2014-03-03T00:00:00.000Z      TheEllenShow

1     44  2014-03-03T00:00:00.000Z        TheAcademy

2     21  2014-03-03T00:00:00.000Z               MTV

3     21  2014-03-03T00:00:00.000Z         peoplemag

4     17  2014-03-03T00:00:00.000Z               THR

5     16  2014-03-03T00:00:00.000Z      ItsQueenElsa

6     16  2014-03-03T00:00:00.000Z           eonline

7     15  2014-03-03T00:00:00.000Z       PerezHilton

8     14  2014-03-03T00:00:00.000Z     realjohngreen

9     12  2014-03-03T00:00:00.000Z       KevinSpacey

```

## groupby

What does the social network of users replying to other users look like?

```python

from igraph import *

from cairo import *

from pandas import concat

group = query.groupby(

    datasource='twitterstream',

    granularity='hour',

    intervals='2013-10-04/pt12h',

    dimensions=["user_name", "reply_to_name"],

    filter=(~(Dimension("reply_to_name") == "Not A Reply")) &

           (Dimension("user_location") == "California"),

    aggregations={"count": doublesum("count")}

)

df = query.export_pandas()

# map names to categorical variables with a lookup table

names = concat([df['user_name'], df['reply_to_name']]).unique()

nameLookup = dict([pair[::-1] for pair in enumerate(names)])

df['user_name_lookup'] = df['user_name'].map(nameLookup.get)

df['reply_to_name_lookup'] = df['reply_to_name'].map(nameLookup.get)

# create the graph with igraph

g = Graph(len(names), directed=False)

vertices = zip(df['user_name_lookup'], df['reply_to_name_lookup'])

g.vs["name"] = names

g.add_edges(vertices)

layout = g.layout_fruchterman_reingold()

plot(g, "tweets.png", layout=layout, vertex_size=2, bbox=(400, 400), margin=25, edge_width=1, vertex_color="blue")

```

![alt text](https://github.com/metamx/pydruid/raw/master/docs/figures/twitter_graph.png "Social Network")

# asynchronous client

```pydruid.async_client.AsyncPyDruid``` implements an asynchronous client. To achieve that, it utilizes an asynchronous

HTTP client from ```Tornado``` framework. The asynchronous client is suitable for use with async frameworks such as Tornado

and provides much better performance at scale. It lets you serve multiple requests at the same time, without blocking on

Druid executing your queries.

## example

```python

from tornado import gen

from pydruid.async_client import AsyncPyDruid

from pydruid.utils.aggregators import longsum

from pydruid.utils.filters import Dimension

client = AsyncPyDruid(url_to_druid_broker, 'druid/v2')

@gen.coroutine

def your_asynchronous_method_serving_top10_mentions_for_day(day

    top_mentions = yield client.topn(

        datasource='twitterstream',

        granularity='all',

        intervals="%s/p1d" % (day, ),

        aggregations={'count': doublesum('count')},

        dimension='user_mention_name',

        filter=(Dimension('user_lang') == 'en') & (Dimension('first_hashtag') == 'oscars') &

               (Dimension('user_time_zone') == 'Pacific Time (US & Canada)') &

               ~(Dimension('user_mention_name') == 'No Mention'),

        metric='count',

        threshold=10)

    # asynchronously return results

    # can be simply ```return top_mentions``` in python 3.x

    raise gen.Return(top_mentions)

```

# thetaSketches

Theta sketch Post aggregators are built slightly differently to normal Post Aggregators, as they have different operators.

Note: you must have the ```druid-datasketches``` extension loaded into your Druid cluster in order to use these.

See the [Druid datasketches](http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html) documentation for details.

```python

from pydruid.client import *

from pydruid.utils import aggregators

from pydruid.utils import filters

from pydruid.utils import postaggregator

query = PyDruid(url_to_druid_broker, 'druid/v2')

ts = query.groupby(

    datasource='test_datasource',

    granularity='all',

    intervals='2016-09-01/P1M',

    filter = ( filters.Dimension('product').in_(['product_A', 'product_B'])),

    aggregations={

        'product_A_users': aggregators.filtered(

            filters.Dimension('product') == 'product_A',

            aggregators.thetasketch('user_id')

            ),

        'product_B_users': aggregators.filtered(

            filters.Dimension('product') == 'product_B',

            aggregators.thetasketch('user_id')

            )

    },

    post_aggregations={

        'both_A_and_B': postaggregator.ThetaSketchEstimate(

            postaggregator.ThetaSketch('product_A_users') & postaggregator.ThetaSketch('product_B_users')

            )

    }

)

```

# DB API

```python

from pydruid.db import connect

conn = connect(host='localhost', port=8082, path='/druid/v2/sql/', scheme='http')

curs = conn.cursor()

curs.execute("""

    SELECT place,

           CAST(REGEXP_EXTRACT(place, '(.*),', 1) AS FLOAT) AS lat,

           CAST(REGEXP_EXTRACT(place, ',(.*)', 1) AS FLOAT) AS lon

      FROM places

     LIMIT 10

""")

for row in curs:

    print(row)

```

# SQLAlchemy

```python

from sqlalchemy import *

from sqlalchemy.engine import create_engine

from sqlalchemy.schema import *

engine = create_engine('druid://localhost:8082/druid/v2/sql/')  # uses HTTP by default :(

# engine = create_engine('druid+http://localhost:8082/druid/v2/sql/')

# engine = create_engine('druid+https://localhost:8082/druid/v2/sql/')

places = Table('places', MetaData(bind=engine), autoload=True)

print(select([func.count('*')], from_obj=places).scalar())

```

## Column headers

In version 0.13.0 Druid SQL added support for including the column names in the

response which can be requested via the "header" field in the request. This

helps to ensure that the cursor description is defined (which is a requirement

for SQLAlchemy query statements) regardless on whether the result set contains

any rows. Historically this was problematic for result sets which contained no

rows at one could not infer the expected column names.

Enabling the header can be configured via the SQLAlchemy URI by using the query

parameter, i.e.,

```python

engine = create_engine('druid://localhost:8082/druid/v2/sql?header=true')

```

Note the current default is `false` to ensure backwards compatibility but should

be set to `true` for Druid versions >= 0.13.0.

# Command line

```bash

$ pydruid http://localhost:8082/druid/v2/sql/

> SELECT COUNT(*) AS cnt FROM places

  cnt

-----

12345

> SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES;

TABLE_NAME

----------

test_table

COLUMNS

SCHEMATA

TABLES

> BYE;

GoodBye!

```

# Contributing

Contributions are welcomed of course. We like to use `black` and `flake8`.

```bash

pip install -r requirements-dev.txt  # installs useful dev deps

pre-commit install  # installs useful commit hooks

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/druid-io/pydruid

Awesome Lists containing this project

README