Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/neighborhoods/honeycomb

Last synced: about 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/neighborhoods/honeycomb
Owner: neighborhoods
Created: 2020-02-07T20:22:51.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2021-09-03T19:20:29.000Z (over 3 years ago)
Last Synced: 2024-11-09T09:49:04.087Z (2 months ago)
Language: Python
Size: 721 KB
Stars: 0
Watchers: 16
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project

README

        # honeycomb #

`honeycomb` serves as an abstraction around connections to a Hadoop/Hive-based

data lake and other databases, built with extensibility to other services in

mind. It handles connections, type conversions, a degree of query generation,

and more for users, in order to greatly reduce the degree of technical

expertise and Python knowledge required to interact with the NHDS data lake

and other data repositories. In doing this, we aim to increase data

availability throughout the organization, and facilitate any individual

member feeling empowered to interact with data.

## Usage

Much of the functionality of `honeycomb` is completely under-the-hood, and as

a result the average use case of it is very straightforward.

### General

1. Supported engines are currently:

    * Runs against data lake:

       1. Hive - Runs against data lake. Queries are slightly slower than Presto,

       but has effectively has no limit on query size and has advanced

       recovery/reliability functionality. This is the suggested engine

       for most queries.

       Must provide a schema in queries.

       2. Presto - Runs against data lake.

       Presto runs queries more quickly than other engines, but has query size

       limitations. Returns all columns as strings. Used for quick, ad-hoc

       queries, but Hive is recommended over Presto in almost all situations.

       Must provide a schema in queries.

    * Does not run against data lake

        1. Google BigQuery - Runs against data stored in Google BigQuery. Currently,

        the only data stored there is clickstream data and cost information on

        our GA/GBQ accounts. Must provide a project ID, dataset, and table name

        in queries. NOTE: BigQuery uses different SQL syntax than the lake, and this

        must be reflected in your queries.

        ```

        from honeycomb.bigquery import run_gbq_query

        query = "SELECT * from `places-clickstream`.9317736.ga_sessions_20200728 LIMIT 5"

        df = run_gbq_query(query, project_id='places-clickstream')

        ```

        2. Salesforce - Runs against data stored in the Salesforce Object Manager.

        NOTE: Salesforce uses SOQL, which has different syntax than the lake.

        ```

        from honeycomb import salesforce as hcsf

        query = "SELECT Id FROM Lead LIMIT 5"

        df = hcsf.run_sf_query(query)

        df_star = hcsf.sf_select_star(object_name='Lead',

                                      where_clause='WHERE CreatedDate <= 2020-08-01T00:00-00:00',

                                      limit_clause='LIMIT 5')

        ```

### Running Queries

The `run_lake_query` function allows for running queries through either the

`presto` or `hive` querying engines.

```

import honeycomb as hc

df0 = hc.run_lake_query('SELECT COUNT(*) FROM experimental.test_table',

                        engine='presto')

df1 = hc.run_lake_query('SELECT COUNT(*) FROM experimental.test_table',

                        engine='hive')

```

### Table Creation

`honeycomb` only supports table creation using `hive` as the engine. To create

a table in the data lake, all that is required is a dataframe, a table name -

it defaults to writing to the 'experimental' zone.

`create_table_from_df` will infer the file type to create the table with based

on the extension of the provided `filename`. If `filename` is not provided or

does not contain an extension, it will default to Parquet.

NOTE: If your dataframe contains a high degree of nesting, generating

DDL for its table can be very time-consuming. It may be faster to generate the

table with a subset of the dataframe's rows, and append the remaining rows

after the fact.

An empty directory is expected for a table's underlying storage.

If files are already present in the S3 path designated to store files

for the new table, table creation will fail.

Additional parameters:

1. `schema`: The schema to write to. Most users will only have access to write

to the 'experimental' zone.

2. `dtypes`: A dictionary from column names to data types. `honeycomb` automatically

converts Python/pandas data types into Hive-compliant data types, but this parameter

allows for manual override of this. For example, `pd.DateTime`s are normally converted to

`hive` datetimes, but if specified in this parameter, it could be saved as a string instead.

3. `path`: Where the files that this table references should be stored in S3. It defaults

to the table's name, and under most circumstances should be kept this way.

4. `filename`: The name to store the file containing the DataFrame. If writing

to the experimental zone, this is optional, and a name based on a timestamp will

be generated. However, when writing to any other lake zone, a specified name

is required, in order to ensure that the table's underlying filesystem will

be navegable in the future.

5. `table_comment`: A plaintext description of what data the table contains

and what its purpose is. When writing to the experimental zone, this is optional,

but with any other zone it is required.

6. `column_comments`: A dictionary from column names to a plaintext description

of what the column contains. This is optional when writing to the experimental zone,

but with any other zone it is required.

7. `timezones`: A dictionary from datetime columns to the timezone they

represent. If the column is timezone-naive, it will have the

timezone added to its metadata, leaving the times themselves

unmodified. If the column is timezone-aware and is in a different

timezone than the one that is specified, the column's timezone

will be converted, modifying the original times.

8. `copy_df`: Whether the operations performed on df should be performed on the

original or a copy. Keep in mind that if this is set to False,

the original df passed in will be modified as well - twice as

memory efficient, but may be undesirable if the df is needed

again later

9. `partitioned_by`: Dictionary or list of tuples containing a partition name

and type. Cannot be a vanilla dictionary if using Python version < 3.6

10. `partition_values`: Required if 'partitioned_by' is used and

'auto_upload_df' is True. List of tuples containing partition name and

value to store the dataframe under

11. `overwrite`: In the case of a table already existing with the specified

name, states whether to drop a table and completely remove its underlying files

or throw errors. In any other case, this parameter does nothing.

12. `auto_upload_df`: Whether the df that the table's structure will be based

off of should be automatically uploaded to the table, or just used to generate

and execute the DDL.

```

import pandas as pd

import honeycomb as hc

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})

hc.create_table_from_df(df, table_name='test_table')

```

### Table Appending

`honeycomb` only supports using `hive` as the engine for table appending.

To append a DataFrame to a table, all that is needed is the table name.

Schema can optionally be provided and defaults to 'experimental'.

```

import pandas as pd

import honeycomb as hc

df = pd.DataFrame({'col1': [7, 8, 9], 'col2': [10, 11, 12]})

hc.append_table(df, table_name='test_table')

```

As with `create_table_from_df`, a filename can be provided as well. If one is not,

a name is generated based on a timestamp. However, if writing anywhere other than

the experimental zone, a specified filename is required.

### Table Describing

`honeycomb` can be used to obtain information on tables in the lake, such

as column names and dtypes, and if `include_metadata` is set to true,

a table description and column comments.

```

import honeycomb as hc

hc.describe_table('test_table', schema='curated', include_metadata=True)

```

### Session-Level Configuration

`honeycomb` outputs certain messages to the screen to help interactive users

maintain awareness of what is being performed behind-the-scenes. If this

is not desirable (as may be the case for notebooks, pipelines, usage of

`honeycomb` within other packages, etc.), all non-logging output can be

disabled with `hc.set_option('verbose', False)`. This also sets the

corresponding `verbose` option in `rivet` to False.