{"id":16779560,"url":"https://github.com/fzliu/radient","last_synced_at":"2025-04-06T15:12:33.759Z","repository":{"id":223108920,"uuid":"758722013","full_name":"fzliu/radient","owner":"fzliu","description":"Radient turns many data types (not just text) into vectors for similarity search, clustering, regression analysis, and more.","archived":false,"fork":false,"pushed_at":"2024-05-17T05:09:54.000Z","size":67,"stargazers_count":207,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-05-17T06:26:06.475Z","etag":null,"topics":["audio","embeddings","etl","fraud-detection","graphs","image-search","images","milvus","molecular-search","molecules","recommender-system","retrieval-augmented-generation","semantic-search","similarity-search","text","unstructured-data-etl","vector-database","vectors"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fzliu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/supported_methods.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-16T23:18:02.000Z","updated_at":"2024-05-30T06:15:16.319Z","dependencies_parsed_at":"2024-05-30T06:15:11.868Z","dependency_job_id":"d5d08028-0331-41e3-96ef-679d5282d182","html_url":"https://github.com/fzliu/radient","commit_stats":null,"previous_names":["fzliu/radient"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fzliu%2Fradient","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fzliu%2Fradient/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fzliu%2Fradient/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fzliu%2Fradient/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fzliu","download_url":"https://codeload.github.com/fzliu/radient/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247500469,"owners_count":20948880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","embeddings","etl","fraud-detection","graphs","image-search","images","milvus","molecular-search","molecules","recommender-system","retrieval-augmented-generation","semantic-search","similarity-search","text","unstructured-data-etl","vector-database","vectors"],"created_at":"2024-10-13T07:30:34.293Z","updated_at":"2025-04-06T15:12:33.736Z","avatar_url":"https://github.com/fzliu.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Radient\n\nRadient is a developer-friendly, lightweight library for unstructured data ETL, i.e. turning audio, graphs, images, molecules, text, and other data types into embeddings. Radient supports simple vectorization as well as complex vector-centric workflows.\n\n```shell\n$ pip install radient\n```\n\nIf you find this project helpful or interesting, please consider giving it a star. :star:\n\n### Getting started\n\nBasic vectorization can be performed as follows:\n\n```python\nfrom radient import text_vectorizer\nvz = text_vectorizer()\nvz.vectorize(\"Hello, world!\")\n# Vector([-3.21440510e-02, -5.10351397e-02,  3.69579718e-02, ...])\n```\n\nThe above snippet vectorizes the string `\"Hello, world!\"` using a default model, namely `bge-small-en-v1.5` from `sentence-transformers`. If your Python environment does not contain the `sentence-transformers` library, Radient will prompt you for it:\n\n```python\nvz = text_vectorizer()\n# Vectorizer requires sentence-transformers. Install? [Y/n]\n```\n\nYou can type \"Y\" to have Radient install it for you automatically.\n\nEach vectorizer can take a `method` parameter along with optional keyword arguments which get passed directly to the underlying vectorization library. For example, we can pick Mixbread AI's `mxbai-embed-large-v1` model using the `sentence-transformers` library via:\n\n```python\nvz_mbai = text_vectorizer(method=\"sentence-transformers\", model_name_or_path=\"mixedbread-ai/mxbai-embed-large-v1\")\nvz_mbai.vectorize(\"Hello, world!\")\n# Vector([ 0.01729078,  0.04468533,  0.00055427, ...])\n```\n\n### More than just text\n\nWith Radient, you're not limited to text. Audio, graphs, images, and molecules can be vectorized as well:\n\n```python\nfrom radient import (\n    audio_vectorizer,\n    graph_vectorizer,\n    image_vectorizer,\n    molecule_vectorizer,\n)\navec = audio_vectorizer().vectorize(str(Path.home() / \"audio.wav\"))\ngvec = graph_vectorizer().vectorize(nx.karate_club_graph())\nivec = image_vectorizer().vectorize(str(Path.home() / \"image.jpg\"))\nmvec = molecule_vectorizer().vectorize(\"O=C=O\")\n```\n\nA partial list of methods and optional kwargs supported by each modality can be found [here](https://github.com/fzliu/radient/blob/main/docs/supported_methods.md).\n\nFor production use cases with large quantities of data, performance is key. Radient also provides an `accelerate` function to optimize vectorizers on-the-fly:\n\n```python\nimport numpy as np\nvz = text_vectorizer()\nvec0 = vz.vectorize(\"Hello, world!\")\nvz.accelerate()\nvec1 = vz.vectorize(\"Hello, world!\")\nnp.allclose(vec0, vec1)\n# True\n```\n\nOn a 2.3 GHz Quad-Core Intel Core i7, the original vectorizer returns in ~32ms, while the accelerated vectorizer returns in ~17ms.\n\n### Building unstructured data ETL\n\nAside from running experiments, pure vectorization is not particularly useful. Mirroring strutured data ETL pipelines, unstructured data ETL workloads often require a combination of four components: a data __source__ where unstructured data is stored, one more more __transform__ modules that perform data conversions and pre-processing, a __vectorizer__ which turns the data into semantically rich embeddings, and a __sink__ to persist the vectors once they have been computed.\n\nRadient provides a `Workflow` object specifically for building vector-centric ETL applications. With Workflows, you can combine any number of each of these components into a directed graph. For example, a workflow to continuously read text documents from Google Drive, vectorize them with [Voyage AI](https://www.voyageai.com/), and vectorize them into Milvus might look like:\n\n```python\nfrom radient import make_operator\nfrom radient import Workflow\n\nextract = make_operator(\"source\", method=\"google-drive\", task_params={\"folder\": \"My Files\"})\ntransform = make_operator(\"transform\", method=\"read-text\", task_params={})\nvectorize = make_operator(\"vectorizer\", method=\"voyage-ai\", modality=\"text\", task_params={})\nload = make_operator(\"sink\", method=\"milvus\", task_params={\"operation\": \"insert\"})\n\nwf = (\n    Workflow()\n    .add(extract, name=\"extract\")\n    .add(transform, name=\"transform\")\n    .add(vectorize, name=\"vectorize\")\n    .add(load, name=\"load\")\n)\n```\n\nYou can use accelerated vectorizers and transforms in a Workflow by specifying `accelerate=True` for all supported operators.\n\n### Supported vectorizer engines\n\nRadient builds atop work from the broader ML community. Most vectorizers come from other libraries:\n\n- [Imagebind](https://imagebind.metademolab.com/)\n- [Pytorch Image Models](https://huggingface.co/timm)\n- [RDKit](https://rdkit.org)\n- [Sentence Transformers](https://sbert.net)\n- [scikit-learn](https://scikit-learn.org)\n- [TorchAudio](https://pytorch.org/audio)\n\nOn-the-fly model acceleration is done via [ONNX](https://onnx.ai).\n\nA massive thank you to all the creators and maintainers of these libraries.\n\n### Coming soon\u0026trade;\n\nA couple of features slated for the near-term (hopefully):\n1) Sparse vector, binary vector, and multi-vector support\n2) Support for all relevant embedding models on Huggingface\n\nLLM connectors _will not_ be a feature that Radient provides. Building context-aware systems around LLMs is a complex task, and not one that Radient intends to solve. Projects such as [Haystack](https://haystack.deepset.ai/) and [Llamaindex](https://www.llamaindex.ai/) are two of the many great options to consider if you're looking to extract maximum RAG performance.\n\nFull write-up on Radient will come later, along with more sample applications, so stay tuned.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffzliu%2Fradient","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffzliu%2Fradient","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffzliu%2Fradient/lists"}