Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/e2fyi/iotoolz
An unified IO interface to access both local and remote resources (e.g. http/https, s3, etc).
https://github.com/e2fyi/iotoolz
Last synced: 3 months ago
JSON representation
An unified IO interface to access both local and remote resources (e.g. http/https, s3, etc).
- Host: GitHub
- URL: https://github.com/e2fyi/iotoolz
- Owner: e2fyi
- License: apache-2.0
- Created: 2020-10-12T15:11:41.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T12:44:33.000Z (about 2 years ago)
- Last Synced: 2024-11-07T00:06:47.288Z (4 months ago)
- Language: Python
- Size: 286 KB
- Stars: 3
- Watchers: 3
- Forks: 2
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# iotoolz
[](https://badge.fury.io/py/iotoolz)
[](https://travis-ci.com/github/e2fyi/iotoolz)
[](https://coveralls.io/github/e2fyi/iotoolz?branch=master)
[](https://iotoolz.readthedocs.io/en/latest/?badge=latest)
[](https://github.com/psf/black)
[](https://pepy.tech/project/iotoolz)
[](https://app.fossa.com/projects/git%2Bgithub.com%2Fe2fyi%2Fiotoolz?ref=badge_shield)`iotoolz` is an improvement over `e2fyi-utils` and is inspired partly by `toolz`.
`iotoolz` is a lib to help provide a consistent dev-x for interacting with any IO resources.
It provides an abstract class `iotoolz.AbcStream` which mimics python's native `open`
very closely (with some additional parameters and methods such as `save`).API documentation can be found at [https://iotoolz.readthedocs.io/en/latest/](https://iotoolz.readthedocs.io/en/latest/).
Change logs are available in [CHANGELOG.md](https://github.com/e2fyi/iotoolz/blob/master/CHANGELOG.md).
> - Python 3.6.1 and above
> - Licensed under [Apache-2.0](./LICENSE).## Supported streams
Current the following streams are supported:
- `iotoolz.FileStream`: wrapper over built-in `open` function (`file://`)
- `iotoolz.TempStream`: in-memory stream that will rollover to disk (`tmp://`, `temp://`)
- `iotoolz.HttpStream`: http or https stream implemented with `requests` (`http://`, `https://`)
- `iotoolz.extensions.S3Stream`: s3 stream implemented with `boto3` (`s3://`, `s3a://`, `s3n://`)
- `iotoolz.extensions.MinioStream`: s3 stream implemented with `minio` (`minio://`)## Installation
```bash
# install the default packages only (most lite-weight)
pip install iotoolz# install dependencies for specific extension
pip install iotoolz[boto3]# install all the extras
pip install iotoolz[all]
```Available extras:
- `all`: All the optional dependencies
- `boto3`: `boto3` for `iotoolz.extensions.S3Stream`
- `minio`: `minio` for `iotoolz.extensions.MinioStream`
- `chardet`: install LGPL `chardet` for guessing encoding. Otherwise `charset_normalizer` will be used.## Quickstart
The helper object `iotoolz.streams.stream_factory` is a default singleton of
`iotoolz.streams.Streams` provided to support most of the common use cases.`iotoolz.streams.open_stream` (alias `iotoolz.streams.Stream`) and is a util method
provided by the singleton helper to create a stream object. This method accepts the same
arguments as python's `open` method with the following additional parameters:- `data`: optional str or bytes that will be passed into the stream
- `fileobj`: optional file-like object which will be copied into the stream
- `content_type`: optional mime type information to describe the stream (e.g. application/json)
- `inmem_size`: determines how much memory to allocate to the stream before rolling over to local file system. Defaults to no limits (may result in MemoryError).
- `schema_kwargs`: optional mapping of schemas to their default kwargs.### Basic Setup
```py
from iotoolz.streams import (
set_schema_kwargs,
set_buffer_rollover_size,
)# set params to pass to the Stream obj handling https
# i.e. HttpStream (implemented with requests)
set_schema_kwargs(
"https",
verify=False, # do not verify ssl cert
use_post=True # use POST instead of PUT when writing to https
)# use a custom client for S3Stream (via boto3)
set_schema_kwargs(
"s3",
client=boto3.client(
"s3",
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
aws_session_token=SESSION_TOKEN,
)
)# use a custom credentials for MinioStream
set_schema_kwargs(
"minio",
access_key=ACCESS_KEY,
secret_key=SECRET_KEY,
secure=True,
)# buffer will rollover to disk if the data is more than 100 MB
# (default is everything is in-memory - may result in memory error)
set_buffer_rollover_size(10**8)
```### Opening streams
You can open any stream just like python's built-in `open` method.
```py
import pandasfrom iotoolz import open_stream
# print line by line some data in from a https endpoint
# and do not verify the ssl cert of the https endpoint
with open_stream(
"https://foo/bar/data.txt",
mode="r",
schema_kwargs={"https": {"verify": False}}
) as stream:
for line in stream:
print(line)# POST some binary content to a http endpoint (default is PUT)
with open_stream("https://foo.bar/api/data", "wb", use_post=True) as stream:
stream.write(b"hello world")# Copying a local file to s3
with open_stream("path/to/data.csv", "r") as csv_source,
open_stream("s3://bucket/foobar.txt?StorageClass=STANDARD", "w") as s3_sink:
# pipe content in csv_source to tmpsink
csv_source.pipe(s3_sink)# load to pandas dataframe from s3 fileobj
with open_stream("s3://bucket/foobar.csv", "r") as csv:
df = pd.read_csv(csv)```
## TempStream
`TempStream` is a stream can functions like a virtual file system in memory.
```py
import gcfrom iotoolz import Stream, exists, glob, iter_dir
# this stream can be garbage collected
Stream("tmp://foo/bar/data.txt", data="foobar")# True if not gc yet, False if already gc
exists("tmp://foo/bar/data.txt")# force gc
gc.collect()
# will not exist
exists("tmp://foo/bar/data.txt")# create temp stream with strong ref (hence will not be gc)
s1 = Stream("tmp://foo/bar/data.txt", data="foobar")
s2 = Stream("tmp://foo/example.txt", data="...")# returns s1 and s2
iter_dir("tmp://foo/")# returns s1 only
glob("tmp://foo/bar/*.txt")
```## Stream-like operations
`Stream` is an alias of `open_stream`, both methods return a concrete `AbcStream` object.
You can treat the object as both a "file-like" and "stream-like" object - i.e. you can
read, write, seek, flush, close the object.> NOTE
>
> By default, the underlying buffer is in-memory. You can enable rollover to disk by
> passing the `inmem_size` arg to the method, or update the default `inmem_size` value
> with the `iotoolz.streams.set_buffer_rollover_size` method.```py
from iotoolz import open_stream, Stream, set_buffer_rollover_size# `Stream` is an alias of `open_stream`
assert open_stream == Stream# rollover to disk if data is over 100 MB
set_buffer_rollover_size(10**8)# you can overwrite the default kwargs here also
stream = Stream(
"path/to/data.txt",
mode="rw", # you can both read and write to a stream
)
# stream is lazily evaluated, nothing will be buffered until you call some methods
# that requires the data
data = stream.read()
# will attempt to provide encoding and content_type if not provided when opening the stream
print(stream.encoding)
print(stream.content_type)
# stream has the same interface as an IO object - i.e. u can seek, flush, close, etc
stream.seek(5) # go to offset 5 from start of buffer
stream.write("replace with this text")
stream.seek(0, whence=2) # go to end of buffer
stream.write("additional text after original eof") # continue writing to the end of the buffer
stream.save() # flush save the entire buffer to the same dst location
stream.close() # close the stream
```## Path-like operations
`exists`, `mkdir`, `iter_dir` and `glob` are path-like methods that are available to the
stream object. These methods mimics their equivalent in `pathlib.Path` when appropriate.| method | supported streams | desc |
| ---------------------------- | ------------------------------------------- | ------------------------------------------------------------------------------------------ |
| `stats` | All Streams | return the StreamInfo for an existing resource |
| `unlink`, `delete`, `remove` | All Streams | Delete and remove the stream (except for `TempStream` where the buffer is cleared instead) |
| `exists` | All Streams | check if a stream points to an existing resource. |
| `mkdir` | `FileStream` | create a directory. |
| `rmdir` | `FileStream`, `TempStream`, and `S3Stream`, | remove recursively everything in the directory. |
| `iter_dir` | `FileStream`, `TempStream`, and `S3Stream` | iterate thru the streams in the directory. |
| `glob` | `FileStream`, `TempStream`, and `S3Stream` | iterate thru the streams in the directory that match a pattern. |```py
import itertoolsfrom iotoolz import Stream, mkdir, iter_dir, glob, exists
# similar to 'mkdir -p'
mkdir("path/to/folder", parents=True, exist_ok=True)
Stream("path/to/folder").mkdir(parents=True, exist_ok=True)# list object in an s3 bucket
iter_dir("s3://bucket/prefix/")
for stream in Stream("s3://bucket/prefix/").iter_dir():
print(stream.uri)# find s3 objects with a specific pattern
glob("s3://bucket/prefix/*txt")
for stream in Stream("s3://bucket/prefix/").glob("*.txt"):
print(stream.uri)# exists
exists("s3://bucket/prefix/foo.txt")# stats
info = stats("s3://bucket/prefix/foo.txt")
print(info.name)
print(info.content_type)
print(info.encoding)
print(info.last_modified)
print(info.etag)
print(info.extras)# delete resource
unlink("s3://bucket/prefix/foo.txt")# rm all key with prefix
rmdir("s3://bucket/prefix/")```
## Piping streams
`pipe` is method to push data to a sink (similar to NodeJS stream except it has no
watermark or buffering).```py
from iotoolz.streams import open_streamlocal_file = open_stream(
"path/to/google.html", content_type="text/html", mode="w"
)
temp_file = open_stream(
"tmp://google.html", content_type="text/html", mode="wb"
)# when source is closed, all sinks will be closed also
with open_stream("https://google.com") as source:
# writes to a temp file then to a local file in sequence
source.pipe(temp_file).pipe(local_file)local_file2 = open_stream(
"path/to/google1.html", content_type="text/html", mode="w"
)
local_file3 = open_stream(
"path/to/google2.html", content_type="text/html", mode="w"
)# when source is closed, all sinks will be closed also
with open_stream("tmp://foo_src", mode="w") as source:
# writes in a fan shape manner
source.pipe(local_file2)
source.pipe(local_file3)source.write("hello world")
```> TODO support transform streams so that pipe can be more useful
## License
[](https://app.fossa.com/projects/git%2Bgithub.com%2Fe2fyi%2Fiotoolz?ref=badge_large)