Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/robhowley/s3-streaming

stream and (de)serialize s3 streams
https://github.com/robhowley/s3-streaming

aws file-io s3 stream-processing

Last synced: 3 months ago
JSON representation

stream and (de)serialize s3 streams

Host: GitHub
URL: https://github.com/robhowley/s3-streaming
Owner: robhowley
License: mit
Created: 2019-03-15T00:41:10.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-03-15T17:40:02.000Z (almost 3 years ago)
Last Synced: 2024-04-24T16:38:23.860Z (10 months ago)
Topics: aws, file-io, s3, stream-processing
Language: Python
Size: 11.7 KB
Stars: 15
Watchers: 3
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# s3-streaming: handling (big) S3 files like regular files

Storing, retrieving and using files in S3 is a regular activity so it should be easy. It should also ...

* stream the data

* have an api that is python file-io like

* handle some of the desearization and compression stuff because why not

 

## Install

```bash

pip install s3-streaming

```

## Streaming S3 objects like regular files

### The basics

Opening and reading S3 objects is similar to regular python io. The only difference is that you need to provide a 

`boto3.session.Session` instance to handle the bucket access. 

```python

import boto3

from s3streaming import s3_open

with s3_open('s3://bucket/key', boto_session=boto3.session.Session()) as f:

    for next_line in f:

        print(next_line)

```

### Injecting deserialization and compression handling in stream

Consider a file that is `gzip` compressed and contains lines of `json`. There's some boilerplate in dealing with that,

but why bother? Just handle that in stream.

```python

from s3streaming import s3_open, deserialize, compression

reader_settings = dict(

  boto_session=boto3.session.Session(),

  deserializer=deserialize.json_lines, 

  compression=compression.gzip

)

with s3_open('s3://bucket/key.gzip', **reader_settings) as f:

    for next_line in f:

        print(next_line.keys())    # because the file was decompressed ...

        print(next_line.values())  #   ... and the json is now a loaded dict!

```

Other `deserialize` options include 

* `csv`

* `csv_as_dict`

* `tsv`

* `tsv_as_dict`

* `string`