{"id":13671200,"url":"https://github.com/eto-ai/rikai","last_synced_at":"2025-04-07T06:08:14.711Z","repository":{"id":36992309,"uuid":"327786049","full_name":"eto-ai/rikai","owner":"eto-ai","description":"Parquet-based ML data format optimized for working with unstructured data","archived":false,"fork":false,"pushed_at":"2023-01-05T05:29:47.000Z","size":16692,"stargazers_count":140,"open_issues_count":111,"forks_count":21,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-04T17:11:19.413Z","etag":null,"topics":["deep-learning","machine-learning","pytorch","spark","tensorflow"],"latest_commit_sha":null,"homepage":"https://rikai.readthedocs.io/en/latest/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eto-ai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-08T03:13:33.000Z","updated_at":"2025-02-18T13:10:34.000Z","dependencies_parsed_at":"2023-01-17T11:46:42.542Z","dependency_job_id":null,"html_url":"https://github.com/eto-ai/rikai","commit_stats":null,"previous_names":[],"tags_count":39,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eto-ai%2Frikai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eto-ai%2Frikai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eto-ai%2Frikai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eto-ai%2Frikai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eto-ai","download_url":"https://codeload.github.com/eto-ai/rikai/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247601448,"owners_count":20964864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","pytorch","spark","tensorflow"],"created_at":"2024-08-02T09:01:02.639Z","updated_at":"2025-04-07T06:08:14.680Z","avatar_url":"https://github.com/eto-ai.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"![Apache License](https://img.shields.io/github/license/eto-ai/rikai?style=for-the-badge)\n[![Read The Doc](https://img.shields.io/readthedocs/rikai?style=for-the-badge)](https://rikai.readthedocs.io/)\n[![javadoc](https://javadoc.io/badge2/ai.eto/rikai_2.12/javadoc.svg?style=for-the-badge)](https://javadoc.io/doc/ai.eto/rikai_2.12)\n![Pypi version](https://img.shields.io/pypi/v/rikai?style=for-the-badge)\n![Github Action](https://img.shields.io/github/workflow/status/eto-ai/rikai/Python?style=for-the-badge)\n![stability-experimental](https://img.shields.io/badge/stability-experimental-orange.svg?style=for-the-badge)\n\n\nJoin the community:\n[![Join the chat at https://gitter.im/rikaidev/community](https://img.shields.io/badge/chat-on%20gitter-green?style=for-the-badge)](https://gitter.im/rikaidev/community?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\n\u003e :heavy_exclamation_mark: This repository is still experimental. No API-compatibility is guaranteed.\n\n# Rikai\n\nRikai is a framework specifically designed for AI workflows focused around large scale unstructured datasets\n(e.g., images, videos, sensor data (future), text (future), and more).\nThrough every stage of the AI modeling workflow,\nRikai strives to offer a great developer experience when working with real-world AI datasets.\n\nThe quality of an AI dataset can make or break an AI project, but tooling for AI data is sorely lacking in ergonomics.\nAs a result, practitioners must spend most of their time and effort wrestling with their data instead of innovating on the models and use cases.\nRikai alleviates the pain that AI practitioners experience on a daily basis dealing with the myriad of tedious data tasks,\nso they can focus again on model-building and problem solving.\n\nTo start trying Rikai right away, checkout the [Quickstart Guide](https://rikai.readthedocs.io/en/latest/quickstart.html).\n\n## Main Features\n\n### Data format\n\nThe core of Rikai is a data format (\"rikai format\") based on [Apache Parquet](https://parquet.apache.org/).\nRikai augments parquet with a rich collection of semantic types design specifically for unstructured data and annotations.\n\n### Integrations\n\nRikai comes with an extensive set of I/O connectors. For ETL, Rikai is able to consume popular formats like ROS bags and Coco.\nFor analysis, it's easy to read Rikai data into pandas/spark DataFrames (Rikai handles serde for the semantic types).\nAnd for training, Rikai allows direct creation of Pytorch/Tensorflow datasets without manual conversion.\n\n### SQL-ML Engine\n\nRikai extends Spark SQL with ML capability which allows users to analyze Rikai datasets using own models with SQL\n(\"Bring your own model\")\n\n### Visualization\n\nCarefully crafted data-visualization embedded with semantic types, especially in Jupyter notebooks,\nto help you visualize and inspect your AI data without having to remember complicated raw image manipulations.\n\n## Roadmap\n1. Improved video support\n2. Text / sensors / geospatial support\n3. Versioning support built into the dataset\n4. Better Rikai UDT-support\n5. Declarative annotation API (think vega-lite for annotating images/videos)\n6. Integrations into dbt and BI tools\n\n## Example\n\n```python\nfrom pyspark.sql import Row\nfrom pyspark.ml.linalg import DenseMatrix\nfrom rikai.types import Image, Box2d\nfrom rikai.numpy import wrap\nimport numpy as np\n\ndf = spark.createDataFrame(\n    [\n        {\n            \"id\": 1,\n            \"mat\": DenseMatrix(2, 2, range(4)),\n            \"image\": Image(\"s3://foo/bar/1.png\"),\n            \"annotations\": [\n                Row(\n                    label=\"cat\",\n                    mask=wrap(np.random.rand(256, 256)),\n                    bbox=Box2d(xmin=1.0, ymin=2.0, xmax=3.0, ymax=4.0),\n                )\n            ],\n        }\n    ]\n)\n\ndf.write.format(\"rikai\").save(\"s3://path/to/features\")\n```\n\nTrain dataset in `Pytorch`\n\n```python\nfrom torch.utils.data import DataLoader\nfrom torchvision import transforms as T\nfrom rikai.pytorch.vision import Dataset\n\ntransform = T.Compose([\n   T.Resize(640),\n   T.ToTensor(),\n   T.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))\n])\n\ndataset = Dataset(\n   \"s3://path/to/features\",\n   image_column=\"image\",\n   transform=transform\n)\nloader = DataLoader(\n    dataset,\n    batch_size=32,\n    num_workers=8,\n)\nfor batch in loader:\n    predicts = model(batch.to(\"cuda\"))\n```\n\nUsing a ML model in Spark SQL (**experiemental**)\n\n```sql\nCREATE MODEL yolo5\nOPTIONS (min_confidence=0.3, device=\"gpu\", batch_size=32)\nUSING \"s3://bucket/to/yolo5_spec.yaml\";\n\nSELECT id, ML_PREDICT(yolo5, image) FROM my_dataset\nWHERE split = \"train\" LIMIT 100;\n```\n\nRikai can use MLflow as its model registry. This allows you to automatically pickup the latest\nmodel version if you're using the mlflow model registry. Here is a list of supported model flavors:\n+ PyTorch (pytorch)\n+ Tensorflow (tensorflow)\n+ Scikit-learn (sklearn)\n\n```sql\nCREATE MODEL yolo5\nOPTIONS (min_confidence=0.3, device=\"gpu\", batch_size=32)\nUSING \"mlflow:///yolo5_model/\";\n\nSELECT id, ML_PREDICT(yolo5, image) FROM my_dataset\nWHERE split = \"train\" LIMIT 100;\n```\n\nFor more details on the model spec, see [SQL-ML documentation](https://rikai.readthedocs.io/en/latest/sqlml.html)\n\n## Getting Started\n\nCurrently Rikai is maintained for \u003ca name=\"VersionMatrix\"\u003e\u003c/a\u003eScala 2.12 and Python 3.7, 3.8, 3.9\n\nThere are multiple ways to install Rikai:\n\n1. Try it using the included [Dockerfile](#Docker).\n2. Install via pip `pip install rikai`, with\n   [extras for gcp, pytorch/tf, and others](#Extras).\n3. Install from [source](#Source)\n\nNote: if you want to use Rikai with your own pyspark, please consult\n[rikai documentation](https://rikai.readthedocs.io/en/latest/spark.html) for tips.\n\n### \u003ca name=\"Docker\"\u003e\u003c/a\u003eDocker\n\nThe included Dockerfile creates a standalone demo image with\nJupyter, Pytorch, Spark, and rikai preinstalled with notebooks for you\nto play with the capabilities of the rikai feature store.\n\nTo build and run the docker image from the current directory:\n```bash\n# Clone the repo\ngit clone git@github.com:eto-ai/rikai rikai\n# Build the docker image\ndocker build --tag rikai --network host .\n# Run the image\ndocker run -p 0.0.0.0:8888:8888/tcp rikai:latest jupyter lab -ip 0.0.0.0 --port 8888\n```\n\nIf successful, the console should then print out a clickable link to JupyterLab. You can also\nopen a browser tab and go to `localhost:8888`.\n\n### \u003ca name=\"Extras\"\u003e\u003c/a\u003eInstall from pypi\n\nBase rikai library can be installed with just `pip install rikai`. Dependencies for supporting\npytorch (pytorch and torchvision), jupyter (matplotlib and jupyterlab) are all part of\noptional extras. Many open-source datasets also use Youtube videos so we've also added pafy and\nyoutube-dl as optional extras as well.\n\nFor example, if you want to use pytorch in Jupyter to train models on rikai datasets in s3\ncontaining Youtube videos you would run:\n\n`pip install rikai[pytorch,jupyter,youtube]`\n\nIf you're not sure what you need and don't mind installing some extra dependencies, you can\nsimply install everything:\n\n`pip install rikai[all]`\n\n### \u003ca name=\"Source\"\u003e\u003c/a\u003eInstall from source\n\nTo build from source you'll need python as well as Scala with sbt installed:\n\n```bash\n# Clone the repo\ngit clone git@github.com:eto-ai/rikai rikai\n# Build the jar\nsbt publishLocal\n# Install python package\ncd python\npip install -e . # pip install -e .[all] to install all optional extras (see \"Install from pypi\")\n```\n\n### Utilities\n\n[pre-commit](https://pre-commit.com/) can be helpful in keep consistent code format with the repository. \nIt can trigger reformat and extra things in your local machine before the CI force you to do it.\n\nIf you want it, install and enable `pre-commit`\n```bash\npip install pre-commit\npre-commit install #in your local development directory\n#pre-commit installed at .git/hooks/pre-commit\n```\nIf you want to uninstall it, it would be easy, too.\n```\npre-commit uninstall\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feto-ai%2Frikai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feto-ai%2Frikai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feto-ai%2Frikai/lists"}