{"id":25006349,"url":"https://github.com/lotus-data/lotus","last_synced_at":"2025-10-19T22:31:15.069Z","repository":{"id":248736980,"uuid":"829544540","full_name":"lotus-data/lotus","owner":"lotus-data","description":"Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code","archived":false,"fork":false,"pushed_at":"2025-10-07T00:10:57.000Z","size":1940,"stargazers_count":1310,"open_issues_count":44,"forks_count":112,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-10-07T01:19:10.089Z","etag":null,"topics":["ai-data-processing","data","llm","llm-data-processing","llm-document-processing","pandas","python","semantic-operators","semantic-search","unstructured-data"],"latest_commit_sha":null,"homepage":"https://lotus-data.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lotus-data.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-07-16T16:39:06.000Z","updated_at":"2025-10-07T00:23:03.000Z","dependencies_parsed_at":"2024-11-04T19:29:45.936Z","dependency_job_id":"5bbb8c1d-890f-443c-a17d-cd163e1f078c","html_url":"https://github.com/lotus-data/lotus","commit_stats":null,"previous_names":["stanford-futuredata/lotus","tag-research/lotus","lotus-data/lotus","guestrin-lab/lotus"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/lotus-data/lotus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lotus-data%2Flotus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lotus-data%2Flotus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lotus-data%2Flotus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lotus-data%2Flotus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lotus-data","download_url":"https://codeload.github.com/lotus-data/lotus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lotus-data%2Flotus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279927568,"owners_count":26245503,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-19T02:00:07.647Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-data-processing","data","llm","llm-data-processing","llm-document-processing","pandas","python","semantic-operators","semantic-search","unstructured-data"],"created_at":"2025-02-05T01:01:58.695Z","updated_at":"2025-10-19T22:31:15.064Z","avatar_url":"https://github.com/lotus-data.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# LOTUS: LLM-Powered Data Processing Made Fast, Easy, and Robust\n\u003c!--- BADGES: START ---\u003e\n\u003c!--[![Colab Demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OzoJXH13aOwNOIEemClxzNCNYnqSGxVl?usp=sharing)--\u003e\n[![Colab Demo](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mP65YHHdD6mnZmC5-Uqm2uCXJ4-Kbkhu?usp=sharing)\n[![Arxiv](https://img.shields.io/badge/arXiv-2407.11418-B31B1B.svg)][#arxiv-paper-package]\n[![Slack](https://img.shields.io/badge/slack-lotus-purple.svg?logo=slack)][#slack]\n[![Documentation Status](https://readthedocs.org/projects/lotus-ai/badge/?version=latest)](https://lotus-ai.readthedocs.io/en/latest/?badge=latest)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/lotus-ai)][#pypi-package]\n[![PyPI](https://img.shields.io/pypi/v/lotus-ai)][#pypi-package]\n\n[#license-gh-package]: https://lbesson.mit-license.org/\n[#arxiv-paper-package]: https://arxiv.org/abs/2407.11418\n[#pypi-package]: https://pypi.org/project/lotus-ai/\n[#slack]: https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg\n\u003c!--- BADGES: END ---\u003e\n\nLOTUS is the framework that allows you to easily process your datasets, including unstructured and structured data, with LLMs. It provides an **intuitive Pandas-like API**, offers algorithms for **optimizing your programs for up to 1000x speedups**, and makes LLM-based data processing **robust with accuracy guarantees** with respect to high-quality reference algorithms.\n\nLOTUS stands for **L**LMs **O**ver **T**ext, **U**nstructured and **S**tructured Data, and it implements [**semantic operators**](https://arxiv.org/abs/2407.11418), which extend the core philosophy of relational operators—designed for declarative and robust _structured-data_ processing—to _unstructured-data_ processing with AI. Semantic operators are expressive, allowing you to easily capture all of your data-intensive AI programs, from simple RAG, to document extraction, image classification, LLM-judge evals, unstructured data analysis, and more.\n\nFor trouble-shooting or feature requests, please raise an issue and we'll get to it promptly. To share feedback and applications you're working on, you can send us a message on our [community slack](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg), or send an email (lianapat@stanford.edu).\n\n# Installation\nFor the latest stable release:\n```\nconda create -n lotus python=3.10 -y\nconda activate lotus\npip install lotus-ai\n```\n\nFor the latest features, you can alternatively install as follows:\n```\nconda create -n lotus python=3.10 -y\nconda activate lotus\npip install git+https://github.com/lotus-data/lotus.git@main\n```\n\n\n## Running on Mac\nIf you are running on mac, please install Faiss via conda:\n\n### CPU-only version\n```\nconda install -c pytorch faiss-cpu=1.8.0\n```\n\n### GPU(+CPU) version\n```\nconda install -c pytorch -c nvidia faiss-gpu=1.8.0\n```\nFor more details, see [Installing FAISS via Conda](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md#installing-faiss-via-conda).\n\n# Quickstart\nIf you're already familiar with Pandas, getting started will be a breeze! Below we provide a simple example program using the semantic join operator. The join, like many semantic operators, are specified by **langex** (natural language expressions), which the programmer uses to specify the operation. Each langex is parameterized by one or more table columns, denoted in brackets. The join's langex serves as a predicate and is parameterized by a right and left join key.\n```python\nimport pandas as pd\nimport lotus\nfrom lotus.models import LM\n\n# configure the LM, and remember to export your API key\nlm = LM(model=\"gpt-4.1-nano\")\nlotus.settings.configure(lm=lm)\n\n# create dataframes with course names and skills\ncourses_data = {\n    \"Course Name\": [\n        \"History of the Atlantic World\",\n        \"Riemannian Geometry\",\n        \"Operating Systems\",\n        \"Food Science\",\n        \"Compilers\",\n        \"Intro to computer science\",\n    ]\n}\nskills_data = {\"Skill\": [\"Math\", \"Computer Science\"]}\ncourses_df = pd.DataFrame(courses_data)\nskills_df = pd.DataFrame(skills_data)\n\n# lotus sem join \nres = courses_df.sem_join(skills_df, \"Taking {Course Name} will help me learn {Skill}\")\nprint(res)\n\n# Print total LM usage\nlm.print_total_usage()\n```\n### Tutorials\n\nBelow are some short tutorials in Google Colab, to help you get started. We recommend starting with `[1] Introduction to Semantic Operators and LOTUS`, which will provide a broad overview of useful functionality to help you get started.\n\n\u003cdiv align=\"center\"\u003e\n\n| Tutorial                                           | Difficulty                                                      | Colab Link                                                                                                                                                                                                    |\n|----------------------------------------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| 1. Introduction to Semantic Operators and LOTUS             | ![](https://img.shields.io/badge/Level-Beginner-green.svg)      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1mP65YHHdD6mnZmC5-Uqm2uCXJ4-Kbkhu?usp=sharing)              |\n| 2. Failure Analysis Over Agent Traces                           | ![](https://img.shields.io/badge/Level-Intermediate-yellow.svg)      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EJm9A8r_ShYxR0s218J70XhsopOgeT6k?usp=sharing)   |\n| 3. System Prompt Analysis with LOTUS | ![](https://img.shields.io/badge/Level-Intermediate-yellow.svg)      | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1NSVQYOMp2GCre5ZRgvgs6BPGOa20ySMc?usp=sharing) |\n| 4. Processing Multimodal Datasets                             | ![](https://img.shields.io/badge/Level-Intermediate-yellow.svg) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18oaa12T6PrhHIYGw-L01gw1bDmTYaE_e)   |\n\u003c/div\u003e\n\n## Key Concept: The Semantic Operator Model\nLOTUS introduces the semantic operator programming model. Semantic operators are declarative transformations over one or more datasets, parameterized by a natural language expression, that can be implemented by a variety of AI-based algorithms. Semantic operators seamlessly extend the relational model, operating over tables that may contain traditional structured data as well as unstructured fields, such as free-form text. These modular language-based operators allow you to write AI-based pipelines with high-level logic, leaving optimizations to the query engine. Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans, similar to relational operators. To learn more about the semantic operator model, read the full [research paper](https://arxiv.org/abs/2407.11418).\n\nLOTUS offers a number of semantic operators in a Pandas-like API, some of which are described below. To learn more about semantic operators provided in LOTUS, check out the full [documentation](https://lotus-ai.readthedocs.io/en/latest/), run the [colab tutorial](https://colab.research.google.com/drive/1mP65YHHdD6mnZmC5-Uqm2uCXJ4-Kbkhu?usp=sharing), or you can also refer to these [examples](https://github.com/TAG-Research/lotus/tree/main/examples/op_examples).\n\n\n| Operator   | Description                                     |\n|------------|-------------------------------------------------|\n| sem_map      |  Map each record using a natural language projection| \n| sem_filter   | Keep records that match the natural language predicate |  \n| sem_extract  | Extract one or more attributes from each row        |\n| sem_agg      | Aggregate across all records (e.g. for summarization)             |\n| sem_topk     | Order the records by some natural langauge sorting criteria                 |\n| sem_join     | Join two datasets based on a natural language predicate       |\n| sem_sim_join | Join two DataFrames based on semantic similarity             |\n| sem_search   | Perform semantic search the over a text column                |\n\n\n# Supported Models\nThere are 3 main model classes in LOTUS:\n- `LM`: The language model class.\n    - The `LM` class is built on top of the `LiteLLM` library, and supports any model that is supported by `LiteLLM`. See [this page](CONTRIBUTING.md) for examples of using models on `OpenAI`, `Ollama`, and `vLLM`. Any provider supported by `LiteLLM` should work. Check out [litellm's documentation](https://litellm.vercel.app) for more information.\n- `RM`: The retrieval model class.\n    - Any model from `SentenceTransformers` can be used with the `SentenceTransformersRM` class, by passing the model name to the `model` parameter (see [an example here](examples/op_examples/dedup.py)). Additionally, `LiteLLMRM` can be used with any model supported by `LiteLLM` (see [an example here](examples/op_examples/sim_join.py)).\n- `Reranker`: The reranker model class.\n    - Any `CrossEncoder` from `SentenceTransformers` can be used with the `CrossEncoderReranker` class, by passing the model name to the `model` parameter (see [an example here](examples/op_examples/search.py)).\n\n# Feature Requests and Contributing\n\nWe welcome contributions from the community! Whether you're reporting bugs, suggesting features, or contributing code, we have comprehensive templates and guidelines to help you get started.\n\n## Getting Started\n\nBefore contributing, please:\n\n1. **Read our [Contributing Guide](CONTRIBUTING.md)** - Comprehensive guidelines for contributors\n2. **Check existing issues** - Avoid duplicates by searching existing issues and pull requests\n3. **Join our community** - Connect with us on [Slack](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg)\n\n\n## Development Setup\n\nFor development setup and detailed contribution guidelines, see our [Contributing Guide](CONTRIBUTING.md).\n\n## Community\n\n- **Slack**: [Join our community](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-319k232lx-nEcLF~5w274dcQLmw2Wqyg) \n- **Email**: lianapat@stanford.edu\n- **Discussions**: [GitHub Discussions](https://github.com/lotus-data/lotus/discussions)\n\nWe're excited to see what you build with LOTUS! 🚀\n\n# References\nFor recent updates related to LOTUS, follow [@lianapatel_](https://x.com/lianapatel_) on X.\n\nIf you find LOTUS or semantic operators useful, we'd appreciate if you can please cite this work as follows:\n```bibtex\n@article{patel2025semanticoptimization,\n    title = {Semantic Operators and Their Optimization: Enabling LLM-Based Data Processing with Accuracy Guarantees in LOTUS},\n    author = {Patel, Liana and Jha, Siddharth and Pan, Melissa and Gupta, Harshit and Asawa, Parth and Guestrin, Carlos and Zaharia, Matei},\n    year = {2025},\n    journal = {Proc. VLDB Endow.},\n    url = {https://doi.org/10.14778/3749646.3749685},\n}\n@article{patel2024semanticoperators,\n      title={Semantic Operators: A Declarative Model for Rich, AI-based Analytics Over Text Data},\n      author={Liana Patel and Siddharth Jha and Parth Asawa and Melissa Pan and Carlos Guestrin and Matei Zaharia},\n      year={2024},\n      eprint={2407.11418},\n      url={https://arxiv.org/abs/2407.11418},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flotus-data%2Flotus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flotus-data%2Flotus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flotus-data%2Flotus/lists"}