https://github.com/eyecan-ai/pipelime-python
A swiss army knife for data processing!
https://github.com/eyecan-ai/pipelime-python
ai dataops dataset deeplearning mlops python
Last synced: 2 days ago
JSON representation
A swiss army knife for data processing!
- Host: GitHub
- URL: https://github.com/eyecan-ai/pipelime-python
- Owner: eyecan-ai
- License: other
- Created: 2022-03-31T09:49:57.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2026-01-13T09:33:29.000Z (5 days ago)
- Last Synced: 2026-01-13T11:53:27.510Z (5 days ago)
- Topics: ai, dataops, dataset, deeplearning, mlops, python
- Language: Python
- Homepage: https://pipelime-python.readthedocs.io/
- Size: 5.1 MB
- Stars: 19
- Watchers: 4
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🍋 `pipelime`
[](https://pipelime-python.readthedocs.io/en/latest/?badge=latest)
[](https://badge.fury.io/py/pipelime-python)

*If life gives you lemons, use `pipelime`.*
Welcome to **pipelime**, a swiss army knife for data processing!
`pipelime` is a full-fledge **framework** for **data science**: read your datasets,
manipulate them and write back to disk.
Then build up your **dataflow** with Piper and manage the configuration with Choixe.
Finally, **embed** your custom commands into the `pipelime` workspace, to act both as dataflow nodes and advanced command line interface.
Maybe too much for you? No worries, `pipelime` is **modular** and you can just take out what you need:
- **data processing scripts**: use the powerful `SamplesSequence` and create your own data processing pipelines, with a simple and intuitive API. Parallelization works out-of-the-box and, moreover, you can easily serialize your pipelines to yaml/json. Integrations with popular frameworks, e.g., [pytorch](https://pytorch.org/), are also provided.
- **easy dataflow**: `Piper` can manage and execute directed acyclic graphs (DAGs), giving back feedback on the progress through sockets or custom callbacks.
- **configuration management**: `Choixe` is a simple and intuitive mini scripting language designed to ease the creation of configuration files with the help of variables, symbol importing, for loops, switch statements, parameter sweeps and more.
- **command line interface**: `pipelime` can remove all the boilerplate code needed to create a beautiful CLI for you scripts and packages. You focus on *what matters* and we provide input parsing, advanced interfaces for complex arguments, automatic help generation, configuration management. Also, any `PipelimeCommand` can be used as a node in a dataflow for free!
- **pydantic tools**: most of the classes in `pipelime` derive from [`pydantic.BaseModel`](https://docs.pydantic.dev/), so we have built some useful tools to, e.g., inspect their structure, auto-generate human-friendly documentation and more (including a TUI to help you writing input data to [deserialize](https://docs.pydantic.dev/usage/models/#helper-functions) any pydantic model).
---
## Installation
Install `pipelime` using pip:
```
pip install pipelime-python
```
To be able to *draw* the dataflow graphs, you need the `draw` variant:
```
pip install pipelime-python[draw]
```
> **Warning**
>
> The `draw` variant needs `Graphviz` installed on your system
> On Linux Ubuntu/Debian, you can install it with:
>
> ```
> sudo apt-get install graphviz graphviz-dev
> ```
>
> Alternatively you can use `conda`
>
> ```
> conda install --channel conda-forge pygraphviz
> ```
>
> Please see the full options at https://github.com/pygraphviz/pygraphviz/blob/main/INSTALL.txt
## Basic Usage
### Underfolder Format
The **Underfolder** format is the preferred `pipelime` dataset formats, i.e., a flexible way to
model and store a generic dataset through **filesystem**.

An Underfolder **dataset** is a collection of samples. A **sample** is a collection of items.
An **item** is a unitary block of data, i.e., a multi-channel image, a python object,
a dictionary and more.
Any valid underfolder dataset must contain a subfolder named `data` with samples
and items. Also, *global shared* items can be stored in the root folder.
Items are named using the following naming convention:

Where:
* `$ID` is the sample index, must be a unique integer for each sample.
* `ITEM` is the item name.
* `EXT` is the item extension.
We currently support many common file formats and others can be added by users:
* `.png`, `.jpeg/.jpg/.jfif/.jpe`, `.bmp` for images
* `.tiff/.tif` for multi-page images and multi-dimensional numpy arrays
* `.yaml/.yml`, `.json` and `.toml/.tml` for metadata
* `.txt` for numpy 2D matrix notation
* `.npy` for general numpy arrays
* `.pkl/.pickle` for picklable python objects
* `.bin` for generic binary data
Root files follow the same convention but they lack the sample identifier part, i.e., `$ITEM.$EXT`
### Reading an Underfolder Dataset
pipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets.
No complex signatures, weird object iterators, or boilerplate code, you just need a `SamplesSequence`:
```python
from pipelime.sequences import SamplesSequence
# Read an underfolder dataset with a single line of code
dataset = SamplesSequence.from_underfolder('tests/sample_data/datasets/underfolder_minimnist')
# A dataset behaves like a Sequence
print(len(dataset)) # the number of samples
sample = dataset[4] # get the fifth sample
# A sample is a mapping
print(len(sample)) # the number of items
print(set(sample.keys())) # the items' keys
# An item is an object wrapping the actual data
image_item = sample["image"] # get the "image" item from the sample
print(type(image_item)) #
image = image_item() # actually loads the data from disk (may have been on the cloud as well)
print(type(image)) #
```
### Writing an Underfolder Dataset
You can **write** a dataset by calling the associated operation:
```python
# Attach a "write" operation to the dataset
dataset = dataset.to_underfolder('/tmp/my_output_dataset')
# Now run over all the samples
dataset.run()
# You can easily spawn multiple processes if needed
dataset.run(num_workers=4)
```