Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/drgfreeman/dynamo-pandas
Make working with pandas data and AWS DynamoDB easy
https://github.com/drgfreeman/dynamo-pandas
aws aws-dynamodb boto3 database dataframe deserialization dynamo-pandas dynamodb interface pandas serialization
Last synced: 3 months ago
JSON representation
Make working with pandas data and AWS DynamoDB easy
- Host: GitHub
- URL: https://github.com/drgfreeman/dynamo-pandas
- Owner: DrGFreeman
- License: mit
- Created: 2021-03-07T03:44:03.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-04-01T02:47:05.000Z (10 months ago)
- Last Synced: 2024-10-04T13:17:10.911Z (4 months ago)
- Topics: aws, aws-dynamodb, boto3, database, dataframe, deserialization, dynamo-pandas, dynamodb, interface, pandas, serialization
- Language: Python
- Homepage: https://dynamo-pandas.readthedocs.io/en/stable/
- Size: 162 KB
- Stars: 21
- Watchers: 4
- Forks: 6
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
[![unit-tests-linux](https://github.com/drgfreeman/dynamo-pandas/actions/workflows/checks.yml/badge.svg)](https://github.com/DrGFreeman/dynamo-pandas/actions/workflows/checks.yml)
[![Documentation Status](https://readthedocs.org/projects/dynamo-pandas/badge/?version=latest)](https://dynamo-pandas.readthedocs.io/en/latest/?badge=latest)# dynamo-pandas
Make working with pandas data and AWS DynamoDB easy.## Motivation
This package aims a making the transfer of data between pandas dataframes and DynamoDB as simple as possible. To meet this goal, the package offers two key features:
1. Automatic conversion of pandas data types to DynamoDB supported data types.
1. A simple, high level interface to *put* data from a dataframe into a DynamoDB table and *get* all or selected items from a table into a dataframe.## Documentation
The project's documentation is available at https://dynamo-pandas.readthedocs.io/.
## Requirements
* `python>=3.8`
* `pandas>=1.2`
* `boto3`## Installation
```
python -m pip install dynamo-pandas
```This will install the package and its dependencies except for `boto3` which is not installed by default to avoid unnecessary installation when building Lambda layers.
To include `boto3` as part of the installation, add the `boto3` "extra" this way:
```
python -m pip install dynamo-pandas[boto3]
```## Example Usage
Consider the pandas DataFrame below.
```python
>>> print(players_df)player_id last_play play_time rating bonus_points
0 player_one 2021-01-18 22:47:23 2 days 17:41:55 4.3 3
1 player_two 2021-01-19 19:07:54 0 days 22:07:34 3.8 1
2 player_three 2021-01-21 10:22:43 1 days 14:01:19 2.5 4
3 player_four 2021-01-22 13:51:12 0 days 03:45:49 4.8
```The columns of the dataframe use different data types, some of which are not natively supported by DynamoDB, like numpy.datetime64, timedelta64 and pandas' nullable integers.
```python
>>> players_df.info()RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player_id 4 non-null object
1 last_play 4 non-null datetime64[ns]
2 play_time 4 non-null timedelta64[ns]
3 rating 4 non-null float64
4 bonus_points 3 non-null Int8
dtypes: Int8(1), datetime64[ns](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 264.0+ bytes
```Storing the rows of this dataframe to DynamoDB requires multiple data type conversions.
```python
>>> from dynamo_pandas import put_df, get_df, keys
```The `put_df` function adds or updates the rows of a dataframe into the specified table, taking care of the required type conversions (the table must be already created and the primary key column(s) be present in the dataframe).
```python
>>> put_df(players_df, table="players")
```The `get_df` function retrieves the items matching the speficied key(s) from the table into a dataframe.
```python
>>> df = get_df(table="players", keys=[{"player_id": "player_three"}, {"player_id": "player_one"}])
>>> print(df)bonus_points player_id last_play rating play_time
0 4 player_three 2021-01-21 10:22:43 2.5 1 days 14:01:19
1 3 player_one 2021-01-18 22:47:23 4.3 2 days 17:41:55
```In the case where only a partition key is used, the `keys` function simplifies the generation of the keys list.
```python
>>> df = get_df(table="players", keys=keys(player_id=["player_two", "player_four"]))
>>> print(df)bonus_points player_id last_play rating play_time
0 1.0 player_two 2021-01-19 19:07:54 3.8 0 days 22:07:34
1 NaN player_four 2021-01-22 13:51:12 4.8 0 days 03:45:49
```The data types returned by the `get_df` function are basic types and no automatic type conversion is attempted.
```python
>>> df.info()RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null float64
1 player_id 2 non-null object
2 last_play 2 non-null object
3 rating 2 non-null float64
4 play_time 2 non-null object
dtypes: float64(2), object(3)
memory usage: 208.0+ bytes
```The `dtype` parameter of the `get_df` function allows specifying the desired data types.
```python
>>> df = get_df(
... table="players",
... keys=keys(player_id=["player_two", "player_four"]),
... dtype={
... "bonus_points": "Int8",
... "last_play": "datetime64[ns, UTC]",
... "play_time": "timedelta64[ns]" # See note below.
... }
... )
>>> df.info()RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null Int8
1 player_id 2 non-null object
2 last_play 2 non-null datetime64[ns, UTC]
3 rating 2 non-null float64
4 play_time 2 non-null timedelta64[ns]
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
```**Note**: Due to a known bug in pandas versions < 1.5, timedelta strings cannot be converted back to Timedelta type via this parameter (ref. https://github.com/pandas-dev/pandas/issues/38509). If using pandas < 1.5, use the pandas.to_timedelta function instead:
```python
>>> df.play_time = pd.to_timedelta(df.play_time)
>>> df.info()RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null Int8
1 player_id 2 non-null object
2 last_play 2 non-null datetime64[ns, UTC]
3 rating 2 non-null float64
4 play_time 2 non-null timedelta64[ns]
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
```Omitting the `keys` parameter performs a scan of the table and returns all the items.
```python
>>> df = get_df(table="players")
>>> print(df)bonus_points player_id last_play rating play_time
0 4.0 player_three 2021-01-21 10:22:43 2.5 1 days 14:01:19
1 NaN player_four 2021-01-22 13:51:12 4.8 0 days 03:45:49
2 3.0 player_one 2021-01-18 22:47:23 4.3 2 days 17:41:55
3 1.0 player_two 2021-01-19 19:07:54 3.8 0 days 22:07:34
```## License
Released under the terms of the [MIT License](LICENSE).