Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/drgfreeman/dynamo-pandas

Make working with pandas data and AWS DynamoDB easy
https://github.com/drgfreeman/dynamo-pandas

aws aws-dynamodb boto3 database dataframe deserialization dynamo-pandas dynamodb interface pandas serialization

Last synced: 3 months ago
JSON representation

Make working with pandas data and AWS DynamoDB easy

Awesome Lists containing this project

README

        

[![unit-tests-linux](https://github.com/drgfreeman/dynamo-pandas/actions/workflows/checks.yml/badge.svg)](https://github.com/DrGFreeman/dynamo-pandas/actions/workflows/checks.yml)
[![Documentation Status](https://readthedocs.org/projects/dynamo-pandas/badge/?version=latest)](https://dynamo-pandas.readthedocs.io/en/latest/?badge=latest)

# dynamo-pandas
Make working with pandas data and AWS DynamoDB easy.

## Motivation
This package aims a making the transfer of data between pandas dataframes and DynamoDB as simple as possible. To meet this goal, the package offers two key features:
1. Automatic conversion of pandas data types to DynamoDB supported data types.
1. A simple, high level interface to *put* data from a dataframe into a DynamoDB table and *get* all or selected items from a table into a dataframe.

## Documentation

The project's documentation is available at https://dynamo-pandas.readthedocs.io/.

## Requirements
* `python>=3.8`
* `pandas>=1.2`
* `boto3`

## Installation

```
python -m pip install dynamo-pandas
```

This will install the package and its dependencies except for `boto3` which is not installed by default to avoid unnecessary installation when building Lambda layers.

To include `boto3` as part of the installation, add the `boto3` "extra" this way:

```
python -m pip install dynamo-pandas[boto3]
```

## Example Usage

Consider the pandas DataFrame below.

```python
>>> print(players_df)

player_id last_play play_time rating bonus_points
0 player_one 2021-01-18 22:47:23 2 days 17:41:55 4.3 3
1 player_two 2021-01-19 19:07:54 0 days 22:07:34 3.8 1
2 player_three 2021-01-21 10:22:43 1 days 14:01:19 2.5 4
3 player_four 2021-01-22 13:51:12 0 days 03:45:49 4.8
```

The columns of the dataframe use different data types, some of which are not natively supported by DynamoDB, like numpy.datetime64, timedelta64 and pandas' nullable integers.

```python
>>> players_df.info()

RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player_id 4 non-null object
1 last_play 4 non-null datetime64[ns]
2 play_time 4 non-null timedelta64[ns]
3 rating 4 non-null float64
4 bonus_points 3 non-null Int8
dtypes: Int8(1), datetime64[ns](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 264.0+ bytes
```

Storing the rows of this dataframe to DynamoDB requires multiple data type conversions.

```python
>>> from dynamo_pandas import put_df, get_df, keys
```

The `put_df` function adds or updates the rows of a dataframe into the specified table, taking care of the required type conversions (the table must be already created and the primary key column(s) be present in the dataframe).

```python
>>> put_df(players_df, table="players")
```

The `get_df` function retrieves the items matching the speficied key(s) from the table into a dataframe.

```python
>>> df = get_df(table="players", keys=[{"player_id": "player_three"}, {"player_id": "player_one"}])
>>> print(df)

bonus_points player_id last_play rating play_time
0 4 player_three 2021-01-21 10:22:43 2.5 1 days 14:01:19
1 3 player_one 2021-01-18 22:47:23 4.3 2 days 17:41:55
```

In the case where only a partition key is used, the `keys` function simplifies the generation of the keys list.

```python
>>> df = get_df(table="players", keys=keys(player_id=["player_two", "player_four"]))
>>> print(df)

bonus_points player_id last_play rating play_time
0 1.0 player_two 2021-01-19 19:07:54 3.8 0 days 22:07:34
1 NaN player_four 2021-01-22 13:51:12 4.8 0 days 03:45:49
```

The data types returned by the `get_df` function are basic types and no automatic type conversion is attempted.

```python
>>> df.info()

RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null float64
1 player_id 2 non-null object
2 last_play 2 non-null object
3 rating 2 non-null float64
4 play_time 2 non-null object
dtypes: float64(2), object(3)
memory usage: 208.0+ bytes
```

The `dtype` parameter of the `get_df` function allows specifying the desired data types.

```python
>>> df = get_df(
... table="players",
... keys=keys(player_id=["player_two", "player_four"]),
... dtype={
... "bonus_points": "Int8",
... "last_play": "datetime64[ns, UTC]",
... "play_time": "timedelta64[ns]" # See note below.
... }
... )
>>> df.info()

RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null Int8
1 player_id 2 non-null object
2 last_play 2 non-null datetime64[ns, UTC]
3 rating 2 non-null float64
4 play_time 2 non-null timedelta64[ns]
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
```

**Note**: Due to a known bug in pandas versions < 1.5, timedelta strings cannot be converted back to Timedelta type via this parameter (ref. https://github.com/pandas-dev/pandas/issues/38509). If using pandas < 1.5, use the pandas.to_timedelta function instead:

```python
>>> df.play_time = pd.to_timedelta(df.play_time)
>>> df.info()

RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null Int8
1 player_id 2 non-null object
2 last_play 2 non-null datetime64[ns, UTC]
3 rating 2 non-null float64
4 play_time 2 non-null timedelta64[ns]
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
```

Omitting the `keys` parameter performs a scan of the table and returns all the items.

```python
>>> df = get_df(table="players")
>>> print(df)

bonus_points player_id last_play rating play_time
0 4.0 player_three 2021-01-21 10:22:43 2.5 1 days 14:01:19
1 NaN player_four 2021-01-22 13:51:12 4.8 0 days 03:45:49
2 3.0 player_one 2021-01-18 22:47:23 4.3 2 days 17:41:55
3 1.0 player_two 2021-01-19 19:07:54 3.8 0 days 22:07:34
```

## License

Released under the terms of the [MIT License](LICENSE).