{"id":26122936,"url":"https://github.com/clarifai/clarifai-python-datautils","last_synced_at":"2025-10-18T14:23:26.710Z","repository":{"id":275135141,"uuid":"694755675","full_name":"Clarifai/clarifai-python-datautils","owner":"Clarifai","description":"Extract Transform and Load unstructured data into the Clarifai's AI platform","archived":false,"fork":false,"pushed_at":"2025-01-31T11:44:39.000Z","size":1074,"stargazers_count":6,"open_issues_count":2,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-27T04:51:17.272Z","etag":null,"topics":["dataanalysis","dataengineering","ingestion","ingestion-pipeline","unstructured-data","unstructured-data-analysis","unstructured-image","unstructured-text"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Clarifai.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-21T16:17:14.000Z","updated_at":"2025-02-26T18:09:36.000Z","dependencies_parsed_at":"2025-01-31T12:42:31.571Z","dependency_job_id":null,"html_url":"https://github.com/Clarifai/clarifai-python-datautils","commit_stats":null,"previous_names":["clarifai/clarifai-python-datautils"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clarifai%2Fclarifai-python-datautils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clarifai%2Fclarifai-python-datautils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clarifai%2Fclarifai-python-datautils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Clarifai%2Fclarifai-python-datautils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Clarifai","download_url":"https://codeload.github.com/Clarifai/clarifai-python-datautils/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248483041,"owners_count":21111419,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataanalysis","dataengineering","ingestion","ingestion-pipeline","unstructured-data","unstructured-data-analysis","unstructured-image","unstructured-text"],"created_at":"2025-03-10T15:13:34.204Z","updated_at":"2025-10-18T14:23:21.663Z","avatar_url":"https://github.com/Clarifai.png","language":"Python","readme":"![Clarifai logo](docs/logo.png)\n\n# Clarifai Python Data Utils\n\n\n[![Discord](https://img.shields.io/discord/1145701543228735582)](https://discord.gg/M32V7a7a)\n[![codecov](https://img.shields.io/pypi/dm/clarifai)](https://pypi.org/project/clarifai-datautils)\n\n\nThis is a collection of utilities for handling various types of multimedia data. Enhance your experience by seamlessly integrating these utilities with the Clarifai Python SDK. This powerful combination empowers you to address both visual and textual use cases effortlessly through the capabilities of Artificial Intelligence. Unlock new possibilities and elevate your projects with the synergy of versatile data utilities and the robust features offered by the [Clarifai Python SDK](https://github.com/Clarifai/clarifai-python). Explore the fusion of these tools to amplify the intelligence in your applications! 🌐🚀\n\n[Website](https://www.clarifai.com/) | [Schedule Demo](https://www.clarifai.com/company/schedule-demo) | [Signup for a Free Account](https://clarifai.com/signup) | [API Docs](https://docs.clarifai.com/) | [Clarifai Community](https://clarifai.com/explore) | [Python SDK Docs](https://docs.clarifai.com/python-sdk/api-reference) | [Examples](https://github.com/Clarifai/examples) | [Colab Notebooks](https://github.com/Clarifai/colab-notebooks) | [Discord](https://discord.gg/XAPE3Vtg)\n\n---\n## Table Of Contents\n\n* **[Installation](#installation)**\n* **[Getting Started](#getting-started)**\n* **[Features](#features)**\n  * [Image Utils](#image-utils)\n  * [Data Ingestion Pipeline](#ingestion-pipeline)\n* **[Usage](#usage)**\n* **[Examples](#more-examples)**\n\n\n## Installation\n\n\nInstall from PyPi:\n\n```bash\npip install clarifai-datautils\n```\n\nInstall from Source:\n\n```bash\ngit clone https://github.com/Clarifai/clarifai-python-datautils\ncd clarifai-python-datautils\npython3 -m venv env\nsource env/bin/activate\npip3 install -r requirements.txt\n```\n\n\n## Getting started\n\nQuick intro to Image Annotation Conversion feature\n\n```python\nfrom clarifai_datautils.image import ImageAnnotations\n\nannotated_dataset = ImageAnnotations.import_from(path= 'folder_path', format= 'annotation_format')\n```\n\n## Features\n\n### Image Utils\n- #### Annotation Loader\n  - Load various annotated image datasets and export to clarifai Platform\n  - Convert from one annotation format to other supported annotation formats\n\n### Data Ingestion Pipeline\n  - Easy to use pipelines to load data from files and ingest into clarifai platfrom.\n  - Load text files(pdf, doc, etc..) , transform, chunk and upload to the Clarifai Platform\n\n## Usage\n### Image Annotation Loader\n\n#### Setup\nTo use Image Annotation Loader, please install the extra libs required for `annotations`\n\n```python\nfrom clarifai_datautils.image import ImageAnnotations\n#import from folder\ncoco_dataset = ImageAnnotations.import_from(path='folder_path',format= 'coco_detection')\n\n#Using clarifai SDK to upload to Clarifai Platform\n#export CLARIFAI_PAT={your personal access token}  # set PAT as env variable\nfrom clarifai.client.dataset import Dataset\ndataset = Dataset(user_id=\"user_id\", app_id=\"app_id\", dataset_id=\"dataset_id\")\ndataset.upload_dataset(dataloader=coco_dataset.dataloader)\n\n#info about loaded dataset\ncoco_dataset.get_info()\n\n\n#exporting to other formats\ncoco_dataset.export_to('voc_detection')\n```\n\n\n### Data Ingestion Pipelines\n\n#### Setup\nTo use Data Ingestion Pipeline, please run\n```python\npip install -r requirements-dev.txt\n```\n\n\n```python\nfrom clarifai_datautils.text import Pipeline, PDFPartition\nfrom clarifai_datautils.text.pipeline.cleaners import Clean_extra_whitespace\n\n# Define the pipeline\npipeline = Pipeline(\n    name='pipeline-1',\n    transformations=[\n        PDFPartition(chunking_strategy = \"by_title\",max_characters = 1024),\n        Clean_extra_whitespace()\n    ]\n)\n\n\n# Using SDK to upload\nfrom clarifai.client import Dataset\ndataset = Dataset(dataset_url)\ndataset.upload_dataset(pipeline.run(files = file_path, loader = True))\n\n```\n\n\n## More Examples\n\nSee many more code examples in this [repo](https://github.com/Clarifai/examples).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclarifai%2Fclarifai-python-datautils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclarifai%2Fclarifai-python-datautils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclarifai%2Fclarifai-python-datautils/lists"}