{"id":26266812,"url":"https://github.com/anaregdesign/vectorize-openai","last_synced_at":"2025-03-14T04:13:55.035Z","repository":{"id":266581927,"uuid":"898534058","full_name":"anaregdesign/vectorize-openai","owner":"anaregdesign","description":"Tabular calculation with LLM, Spark UDF Builder","archived":false,"fork":false,"pushed_at":"2025-03-13T01:25:43.000Z","size":284,"stargazers_count":6,"open_issues_count":3,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-13T02:25:48.727Z","etag":null,"topics":["apache-spark","data-engineering","data-science","llm","machinelearning","openai","openai-api","pandas","python","spark-sql","spark-udf"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/openaivec/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anaregdesign.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-04T15:09:42.000Z","updated_at":"2025-03-13T01:24:12.000Z","dependencies_parsed_at":"2024-12-05T02:27:42.842Z","dependency_job_id":"334ed5ad-8bf1-4f47-876d-30ec12a67dfe","html_url":"https://github.com/anaregdesign/vectorize-openai","commit_stats":null,"previous_names":["anaregdesign/vectorize-openai"],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anaregdesign%2Fvectorize-openai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anaregdesign%2Fvectorize-openai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anaregdesign%2Fvectorize-openai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anaregdesign%2Fvectorize-openai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anaregdesign","download_url":"https://codeload.github.com/anaregdesign/vectorize-openai/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243521287,"owners_count":20304187,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","data-engineering","data-science","llm","machinelearning","openai","openai-api","pandas","python","spark-sql","spark-udf"],"created_at":"2025-03-14T04:13:54.414Z","updated_at":"2025-03-14T04:13:55.016Z","avatar_url":"https://github.com/anaregdesign.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What is this?\nDefine a LLM based UDF for Apache Spark with simple code!\n\n```python\n# Model for structured output\nclass Fruit(pydantic.BaseModel):\n   name: str\n   color: str\n   taste: str\n\n# Prompt with example\nprompt = \"\"\"\n    return the color and taste of given fruit.\n\n    #example\n\n    ## input\n    apple\n\n    ## output\n    {{\n        \"name\": \"apple\",\n        \"color\": \"red\",\n        \"taste\": \"sweet\"\n    }}\n\"\"\"\n# Simple UDF builder in openaivec\nudf = UDFBuilder.of_azureopenai(...)\n\n# Register UDFs with structured output\nspark.udf.register(\"parse_fruit\", udf.completion(prompt, response_format=Fruit))\n\n# Use UDFs in Spark SQL\nspark.sql(\"SELECT name, parse_fruit(name) from dummy\").show(truncate=False)\n```\n\nThe following output is produced:\n```text\n+------+--------------------------------------------------------+\n|name  |fruit(name)                                             |\n+------+--------------------------------------------------------+\n|apple |{\"name\":\"apple\",\"color\":\"red\",\"taste\":\"sweet\"}          |\n|banana|{\"name\":\"banana\",\"color\":\"yellow\",\"taste\":\"sweet\"}      |\n|cherry|{\"name\":\"cherry\",\"color\":\"red\",\"taste\":\"sweet and tart\"}|\n+------+--------------------------------------------------------+\n```\n\n# Overview\n\nThis package provides a vectorized interface for the OpenAI API, enabling you to process multiple inputs with a single\nAPI call instead of sending requests one by one.\nThis approach helps reduce latency and simplifies your code.\n\nAdditionally, it integrates effortlessly with Pandas DataFrames and Apache Spark UDFs, making it easy to incorporate\ninto your data processing pipelines.\n\n## Features\n\n- Vectorized API requests for processing multiple inputs at once.\n- Seamless integration with Pandas DataFrames.\n- A UDF builder for Apache Spark.\n- Compatibility with multiple OpenAI clients, including Azure OpenAI.\n\n## Requirements\n\n- Python 3.10 or higher\n\n## Installation\n\nInstall the package with:\n\n```bash\npip install openaivec\n```\n\nIf you want to uninstall the package, you can do so with:\n\n```bash\npip uninstall openaivec\n```\n\n## Basic Usage\n\n```python\nimport os\nfrom openai import OpenAI\nfrom openaivec import VectorizedOpenAI\n\n\n# Initialize the vectorized client with your system message and parameters\nclient = VectorizedOpenAI(\n    client=OpenAI(...),\n    temperature=0.0,\n    top_p=1.0,\n    model_name=\"\u003cyour-model-name\u003e\",\n    system_message=\"Please answer only with 'xx family' and do not output anything else.\"\n)\n\nresult = client.predict([\"panda\", \"rabbit\", \"koala\"])\nprint(result)  # Expected output: ['bear family', 'rabbit family', 'koala family']\n```\n\n## Using with Pandas DataFrame\n\n```python\nimport pandas as pd\n\ndf = pd.DataFrame({\"name\": [\"panda\", \"rabbit\", \"koala\"]})\n\ndf.assign(\n    kind=lambda df: client.predict(df.name)\n)\n```\n\nExample output:\n\n| name   | kind          |\n|--------|---------------|\n| panda  | bear family   |\n| rabbit | rabbit family |\n| koala  | koala family  |\n\n## Using with Apache Spark UDF\n\nBelow is an example showing how to create UDFs for Apache Spark using the provided `UDFBuilder`.\nThis configuration is intended for use with Azure OpenAI or OpenAI.\n\n```python\nfrom openaivec.spark import UDFBuilder\n\nudf = UDFBuilder.of_azureopenai(\n    api_key=\"\u003cyour-api-key\u003e\",\n    api_version=\"2024-10-21\",\n    endpoint=\"https://\u003cyour_resource_name\u003e.openai.azure.com\",\n    model_name=\"\u003cyour_deployment_name\u003e\"\n)\n\n# Register UDFs (e.g., to extract flavor or product type from product names)\nspark.udf.register(\"parse_taste\", udf.completion(\"\"\"\n- Extract flavor-related information from the product name. Return only the concise flavor name with no extra text.\n- Minimize unnecessary adjectives related to the flavor.\n    - Example:\n        - Hokkaido Milk → Milk\n        - Uji Matcha → Matcha\n\"\"\"))\n\n# Register UDFs (e.g., to extract product type from product names)\nspark.udf.register(\"parse_product\", udf.completion(\"\"\"\n- Extract the type of food from the product name. Return only the food category with no extra text.\n- Example output:\n    - Smoothie\n    - Milk Tea\n    - Protein Bar\n\"\"\"))\n```\n\nYou can then use the UDFs in your Spark SQL queries as follows:\n\n```sql\nSELECT id,\n       product_name,\n       parse_taste(product_name)   AS taste,\n       parse_product(product_name) AS product\nFROM product_names;\n```\n\nExample Output:\n\n| id            | product_name                         | taste     | product     |\n|---------------|--------------------------------------|-----------|-------------|\n| 4414732714624 | Cafe Mocha Smoothie (Trial Size)     | Mocha     | Smoothie    |\n| 4200162318339 | Dark Chocolate Tea (New Product)     | Chocolate | Tea         |\n| 4920122084098 | Cafe Mocha Protein Bar (Trial Size)  | Mocha     | Protein Bar |\n| 4468864478874 | Dark Chocolate Smoothie (On Sale)    | Chocolate | Smoothie    |\n| 4036242144725 | Uji Matcha Tea (New Product)         | Matcha    | Tea         |\n| 4847798245741 | Hokkaido Milk Tea (Trial Size)       | Milk      | Milk Tea    |\n| 4449574211957 | Dark Chocolate Smoothie (Trial Size) | Chocolate | Smoothie    |\n| 4127044426148 | Fruit Mix Tea (Trial Size)           | Fruit     | Tea         |\n| ...           | ...                                  | ...       | ...         |\n\n## Building Prompts\n\nBuilding prompt is a crucial step in using LLMs.\nIn particular, providing a few examples in a prompt can significantly improve an LLM’s performance,\na technique known as \"few-shot learning.\" Typically, a few-shot prompt consists of a purpose, cautions,\nand examples.\n\n`FewShotPromptBuilder` is a class that helps you build a few-shot learning prompt with simple interface.\n\n### Basic Usage\n\n`FewShotPromptBuilder` requires simply a purpose, cautions, and examples, and `build` method will \nreturn rendered prompt with XML format.\n\nHere is an example:\n\n```python\nfrom openaivec.prompt import FewShotPromptBuilder\n\nprompt: str = (\n    FewShotPromptBuilder()\n    .purpose(\"Return the smallest category that includes the given word\")\n    .caution(\"Never use proper nouns as categories\")\n    .example(\"Apple\", \"Fruit\")\n    .example(\"Car\", \"Vehicle\")\n    .example(\"Tokyo\", \"City\")\n    .example(\"Keiichi Sogabe\", \"Musician\")\n    .example(\"America\", \"Country\")\n    .build()\n)\nprint(prompt)\n```\n\nThe output will be:\n\n```xml\n\n\u003cPrompt\u003e\n    \u003cPurpose\u003eReturn the smallest category that includes the given word\u003c/Purpose\u003e\n    \u003cCautions\u003e\n        \u003cCaution\u003eNever use proper nouns as categories\u003c/Caution\u003e\n    \u003c/Cautions\u003e\n    \u003cExamples\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eApple\u003c/Source\u003e\n            \u003cResult\u003eFruit\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eCar\u003c/Source\u003e\n            \u003cResult\u003eVehicle\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eTokyo\u003c/Source\u003e\n            \u003cResult\u003eCity\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eKeiichi Sogabe\u003c/Source\u003e\n            \u003cResult\u003eMusician\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eAmerica\u003c/Source\u003e\n            \u003cResult\u003eCountry\u003c/Result\u003e\n        \u003c/Example\u003e\n    \u003c/Examples\u003e\n\u003c/Prompt\u003e\n```\n\n### Improve with OpenAI\n\nFor most users, it can be challenging to write a prompt entirely free of contradictions, ambiguities, or\nredundancies.\n`FewShotPromptBuilder` provides an `improve` method to refine your prompt using OpenAI's API.\n\n`improve` method will try to eliminate contradictions, ambiguities, and redundancies in the prompt with OpenAI's API,\nand iterate the process up to `max_iter` times.\n\n```python\nfrom openai import OpenAI\nfrom openaivec.prompt import FewShotPromptBuilder\n\nclient = OpenAI(...)\nmodel_name = \"\u003cyour-model-name\u003e\"\nimproved_prompt: str = (\n    FewShotPromptBuilder()\n    .purpose(\"Return the smallest category that includes the given word\")\n    .caution(\"Never use proper nouns as categories\")\n    # Examples which has contradictions, ambiguities, or redundancies\n    .example(\"Apple\", \"Fruit\")\n    .example(\"Apple\", \"Technology\")\n    .example(\"Apple\", \"Company\")\n    .example(\"Apple\", \"Color\")\n    .example(\"Apple\", \"Animal\")\n    # improve the prompt with OpenAI's API, max_iter is number of iterations to improve the prompt.\n    .improve(client, model_name, max_iter=5)\n    .build()\n)\nprint(improved_prompt)\n```\n\nThen we will get the improved prompt with extra examples, improved purpose, and cautions:\n\n```xml\n\u003cPrompt\u003e\n    \u003cPurpose\u003eClassify a given word into its most relevant category by considering its context and potential meanings.\n        The input is a word accompanied by context, and the output is the appropriate category based on that context.\n        This is useful for disambiguating words with multiple meanings, ensuring accurate understanding and\n        categorization.\n    \u003c/Purpose\u003e\n    \u003cCautions\u003e\n        \u003cCaution\u003eEnsure the context of the word is clear to avoid incorrect categorization.\u003c/Caution\u003e\n        \u003cCaution\u003eBe aware of words with multiple meanings and provide the most relevant category.\u003c/Caution\u003e\n        \u003cCaution\u003eConsider the possibility of new or uncommon contexts that may not fit traditional categories.\u003c/Caution\u003e\n    \u003c/Cautions\u003e\n    \u003cExamples\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eApple (as a fruit)\u003c/Source\u003e\n            \u003cResult\u003eFruit\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eApple (as a tech company)\u003c/Source\u003e\n            \u003cResult\u003eTechnology\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eJava (as a programming language)\u003c/Source\u003e\n            \u003cResult\u003eTechnology\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eJava (as an island)\u003c/Source\u003e\n            \u003cResult\u003eGeography\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eMercury (as a planet)\u003c/Source\u003e\n            \u003cResult\u003eAstronomy\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eMercury (as an element)\u003c/Source\u003e\n            \u003cResult\u003eChemistry\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eBark (as a sound made by a dog)\u003c/Source\u003e\n            \u003cResult\u003eAnimal Behavior\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eBark (as the outer covering of a tree)\u003c/Source\u003e\n            \u003cResult\u003eBotany\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eBass (as a type of fish)\u003c/Source\u003e\n            \u003cResult\u003eAquatic Life\u003c/Result\u003e\n        \u003c/Example\u003e\n        \u003cExample\u003e\n            \u003cSource\u003eBass (as a low-frequency sound)\u003c/Source\u003e\n            \u003cResult\u003eMusic\u003c/Result\u003e\n        \u003c/Example\u003e\n    \u003c/Examples\u003e\n\u003c/Prompt\u003e\n```\n\n## Using with Microsoft Fabric\n\n[Microsoft Fabric](https://www.microsoft.com/en-us/microsoft-fabric/) is a unified, cloud-based analytics platform that\nseamlessly integrates data engineering, warehousing, and business intelligence to simplify the journey from raw data to\nactionable insights.\n\nThis section provides instructions on how to integrate and use `vectorize-openai` within Microsoft Fabric. Follow these\nsteps:\n\n1. **Create an Environment in Microsoft Fabric:**\n    - In Microsoft Fabric, click on **New item** in your workspace.\n    - Select **Environment** to create a new environment for Apache Spark.\n    - Determine the environment name, eg. `openai-environment`.\n    - ![image](https://github.com/user-attachments/assets/bd1754ef-2f58-46b4-83ed-b335b64aaa1c)\n      *Figure: Creating a new Environment in Microsoft Fabric.*\n\n2. **Add `openaivec` to the Environment from Public Library**\n    - Once your environment is set up, go to the **Custom Library** section within that environment.\n    - Click on **Add from PyPI** and search for latest version of `openaivec`.\n    - Save and publish to reflect the changes.\n    - ![image](https://github.com/user-attachments/assets/7b6320db-d9d6-4b89-a49d-e55b1489d1ae)\n      *Figure: Add `openaivec` from PyPI to Public Library*\n\n3. **Use the Environment from a Notebook:**\n    - Open a notebook within Microsoft Fabric.\n    - Select the environment you created in the previous steps.\n    - ![image](https://github.com/user-attachments/assets/2457c078-1691-461b-b66e-accc3989e419)\n      *Figure: Using custom environment from a notebook.*\n    - In the notebook, import and use `openaivec.spark.UDFBuilder` as you normally would. For example:\n\n      ```python\n      from openaivec.spark import UDFBuilder\n \n      udf = UDFBuilder(\n          api_key=\"\u003cyour-api-key\u003e\",\n          api_version=\"2024-10-21\",\n          endpoint=\"https://\u003cyour-resource-name\u003e.openai.azure.com\",\n          model_name=\"\u003cyour-deployment-name\u003e\"\n      )\n      ```\n\nFollowing these steps allows you to successfully integrate and use `vectorize-openai` within Microsoft Fabric.\n\n## Contributing\n\nWe welcome contributions to this project! If you would like to contribute, please follow these guidelines:\n\n1. Fork the repository and create your branch from `main`.\n2. If you've added code that should be tested, add tests.\n3. Ensure the test suite passes.\n4. Make sure your code lints.\n\n### Installing Dependencies\n\nTo install the necessary dependencies for development, run:\n\n```bash\npoetry install --dev\n```\n\n### Code Formatting\n\nTo reformat the code, use the following command:\n\n```bash\npoetry run black ./openaivec\n```\n\n### Linting\n\nTo check for linting issues, use the following command:\n\n```bash\npoetry run flake8 ./openaivec\n```\n\n## Community\n\nJoin our Discord community for developers: https://discord.gg/vbb83Pgn\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanaregdesign%2Fvectorize-openai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanaregdesign%2Fvectorize-openai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanaregdesign%2Fvectorize-openai/lists"}