https://github.com/clarifai/clarifai-python-datautils
Extract Transform and Load unstructured data into the Clarifai's AI platform
https://github.com/clarifai/clarifai-python-datautils
dataanalysis dataengineering ingestion ingestion-pipeline unstructured-data unstructured-data-analysis unstructured-image unstructured-text
Last synced: about 1 month ago
JSON representation
Extract Transform and Load unstructured data into the Clarifai's AI platform
- Host: GitHub
- URL: https://github.com/clarifai/clarifai-python-datautils
- Owner: Clarifai
- License: apache-2.0
- Created: 2023-09-21T16:17:14.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-01-31T11:44:39.000Z (8 months ago)
- Last Synced: 2025-03-27T04:51:17.272Z (7 months ago)
- Topics: dataanalysis, dataengineering, ingestion, ingestion-pipeline, unstructured-data, unstructured-data-analysis, unstructured-image, unstructured-text
- Language: Python
- Homepage:
- Size: 1.02 MB
- Stars: 6
- Watchers: 6
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README

# Clarifai Python Data Utils
[](https://discord.gg/M32V7a7a)
[](https://pypi.org/project/clarifai-datautils)This is a collection of utilities for handling various types of multimedia data. Enhance your experience by seamlessly integrating these utilities with the Clarifai Python SDK. This powerful combination empowers you to address both visual and textual use cases effortlessly through the capabilities of Artificial Intelligence. Unlock new possibilities and elevate your projects with the synergy of versatile data utilities and the robust features offered by the [Clarifai Python SDK](https://github.com/Clarifai/clarifai-python). Explore the fusion of these tools to amplify the intelligence in your applications! 🌐🚀
[Website](https://www.clarifai.com/) | [Schedule Demo](https://www.clarifai.com/company/schedule-demo) | [Signup for a Free Account](https://clarifai.com/signup) | [API Docs](https://docs.clarifai.com/) | [Clarifai Community](https://clarifai.com/explore) | [Python SDK Docs](https://docs.clarifai.com/python-sdk/api-reference) | [Examples](https://github.com/Clarifai/examples) | [Colab Notebooks](https://github.com/Clarifai/colab-notebooks) | [Discord](https://discord.gg/XAPE3Vtg)
---
## Table Of Contents* **[Installation](#installation)**
* **[Getting Started](#getting-started)**
* **[Features](#features)**
* [Image Utils](#image-utils)
* [Data Ingestion Pipeline](#ingestion-pipeline)
* **[Usage](#usage)**
* **[Examples](#more-examples)**## Installation
Install from PyPi:
```bash
pip install clarifai-datautils
```Install from Source:
```bash
git clone https://github.com/Clarifai/clarifai-python-datautils
cd clarifai-python-datautils
python3 -m venv env
source env/bin/activate
pip3 install -r requirements.txt
```## Getting started
Quick intro to Image Annotation Conversion feature
```python
from clarifai_datautils.image import ImageAnnotationsannotated_dataset = ImageAnnotations.import_from(path= 'folder_path', format= 'annotation_format')
```## Features
### Image Utils
- #### Annotation Loader
- Load various annotated image datasets and export to clarifai Platform
- Convert from one annotation format to other supported annotation formats### Data Ingestion Pipeline
- Easy to use pipelines to load data from files and ingest into clarifai platfrom.
- Load text files(pdf, doc, etc..) , transform, chunk and upload to the Clarifai Platform## Usage
### Image Annotation Loader#### Setup
To use Image Annotation Loader, please install the extra libs required for `annotations````python
from clarifai_datautils.image import ImageAnnotations
#import from folder
coco_dataset = ImageAnnotations.import_from(path='folder_path',format= 'coco_detection')#Using clarifai SDK to upload to Clarifai Platform
#export CLARIFAI_PAT={your personal access token} # set PAT as env variable
from clarifai.client.dataset import Dataset
dataset = Dataset(user_id="user_id", app_id="app_id", dataset_id="dataset_id")
dataset.upload_dataset(dataloader=coco_dataset.dataloader)#info about loaded dataset
coco_dataset.get_info()#exporting to other formats
coco_dataset.export_to('voc_detection')
```### Data Ingestion Pipelines
#### Setup
To use Data Ingestion Pipeline, please run
```python
pip install -r requirements-dev.txt
``````python
from clarifai_datautils.text import Pipeline, PDFPartition
from clarifai_datautils.text.pipeline.cleaners import Clean_extra_whitespace# Define the pipeline
pipeline = Pipeline(
name='pipeline-1',
transformations=[
PDFPartition(chunking_strategy = "by_title",max_characters = 1024),
Clean_extra_whitespace()
]
)# Using SDK to upload
from clarifai.client import Dataset
dataset = Dataset(dataset_url)
dataset.upload_dataset(pipeline.run(files = file_path, loader = True))```
## More Examples
See many more code examples in this [repo](https://github.com/Clarifai/examples).