https://github.com/ghomashudson/pyspark-batch-ai
Batch process Spark DataFrames with LLMs
https://github.com/ghomashudson/pyspark-batch-ai
ai batch batch-processing llms openai pyspark
Last synced: 16 days ago
JSON representation
Batch process Spark DataFrames with LLMs
- Host: GitHub
- URL: https://github.com/ghomashudson/pyspark-batch-ai
- Owner: ghomasHudson
- Created: 2025-01-08T15:18:42.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-10T12:50:41.000Z (over 1 year ago)
- Last Synced: 2025-09-19T22:29:39.222Z (8 months ago)
- Topics: ai, batch, batch-processing, llms, openai, pyspark
- Language: Python
- Homepage:
- Size: 22.5 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pyspark Batch AI
> Batch process Spark DataFrames with LLMs
Start with a table with a `prompt` column:
| Name | Age | prompt |
| -------- | --- | ------------ |
| A. Smith | 40 | What is 2+2? |
| B. Jones | 45 | What is 9*4? |
And end up with the same table with a `response` column:
| Name | Age | response |
| -------- | --- | ------------ |
| A. Smith | 40 | 4 |
| B. Jones | 45 | 36 |
Prompts are sent using [openAI's batch API](https://platform.openai.com/docs/guides/batch), so are optimized for processing large dataframes.
## Install
1. Clone the repo
2. `pip install .`
### TODO
pyspark-batch-ai can be installed via pip from [PyPI](https://pypi.org/project/pyspark-batch-ai/):
`pip install pyspark-batch-ai`
## How to use
See the [Examples](https://github.com/ghomasHudson/pyspark-batch-ai/tree/main/examples).
```python
import pandas as pd
from pyspark_batch_ai import process_dataframe
data = {'prompt': ['translate this to french: hello', 'summarize this text in one sentence.']}
df = pd.dataframe(data)
client = openai.client(api_key="sk-...")
result_df = process_dataframe(df, client)
result_df.show()
```