Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hardbyte/qabot
CLI based natural language queries on local or remote data
https://github.com/hardbyte/qabot
Last synced: 3 days ago
JSON representation
CLI based natural language queries on local or remote data
- Host: GitHub
- URL: https://github.com/hardbyte/qabot
- Owner: hardbyte
- License: apache-2.0
- Created: 2023-03-03T21:18:39.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-16T09:30:07.000Z (4 months ago)
- Last Synced: 2024-12-30T17:39:46.671Z (12 days ago)
- Language: Python
- Homepage:
- Size: 3.42 MB
- Stars: 241
- Watchers: 3
- Forks: 20
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: Security.md
Awesome Lists containing this project
- awesome-langchain-zh - QABot
- awesome-langchain - QABot
README
# qabot
Query local or remote files with natural language queries powered by
OpenAI's `gpt` and `duckdb` 🦆.Can query local and remote files (CSV, parquet)
## Installation
Install with [pipx](https://pypa.github.io/pipx/installation/), `pip` etc:
```
pipx install qabot
```## Security Risks
This program gives an LLM access to your local and network accessible files and allows it to execute arbitrary SQL queries, see `Security.md` for more information.
## Command Line Usage
```bash
$ EXPORT OPENAI_API_KEY=sk-...
$ EXPORT QABOT_MODEL_NAME=gpt-4o
$ qabot -w -q "How many Hospitals are there located in Beijing"
Query: How many Hospitals are there located in Beijing
There are 39 hospitals located in Beijing.
Total tokens 1749 approximate cost in USD: 0.05562
```## Python Usage
```python
from qabot import ask_wikidata, ask_file, ask_databaseprint(ask_wikidata("How many hospitals are there in New Zealand?"))
print(ask_file("How many men were aboard the titanic?", 'data/titanic.csv'))
print(ask_database("How many product images are there?", 'postgresql://user:password@localhost:5432/dbname'))
```Output:
```text
There are 54 hospitals in New Zealand.
There were 577 male passengers on the Titanic.
There are 6,225 product images.
```## Features
Works on local CSV and Excel files:
![](.github/local_csv_query.png)
remote CSV files:
```
$ qabot -f https://duckdb.org/data/holdings.csv -q "Tell me how many Apple holdings I currently have"
🦆 Creating local DuckDB database...
🦆 Loading data...
create view 'holdings' as select * from 'https://duckdb.org/data/holdings.csv';
🚀 Sending query to LLM
🧑 Tell me how many Apple holdings I currently have🤖 You currently have 32.23 shares of Apple.
This information was obtained by summing up all the Apple ('APPL') shares in the holdings table.
SELECT SUM(shares) as total_shares FROM holdings WHERE ticker = 'APPL'
```Even on (public) data stored in S3:
![](.github/external_s3_data.png)
You can even load data from disk/URL via the natural language query:
> Load the file 'data/titanic.csv' into a table called 'raw_passengers'.
> Create a view of the raw passengers table for just the male passengers. What
> was the average fare for surviving male passengers?```
~/Dev/qabot> qabot -q "Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the average fare for surviving male passengers?" -v
🦆 Creating local DuckDB database...
🤖 Using model: gpt-4-1106-preview. Max LLM/function iterations before answer 20
🚀 Sending query to LLM
🧑 Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the
average fare for surviving male passengers?
🤖 load_data
{'files': ['data/titanic.csv']}
🦆 Imported with SQL:
["create table 'titanic' as select * from 'data/titanic.csv';"]
🤖 execute_sql
{'query': "CREATE VIEW male_passengers AS SELECT * FROM titanic WHERE Sex = 'male';"}
🦆 No output
🤖 execute_sql
{'query': 'SELECT AVG(Fare) as average_fare FROM male_passengers WHERE Survived = 1;'}
🦆 average_fare
40.82148440366974
🦆 {"summary": "The average fare for surviving male passengers was approximately $40.82.", "detail": "The average fare for surviving male passengers was
calculated by creating a view called `male_passengers` to filter only the male passengers from the `titanic` table, and then running a query to calculate the
average fare for male passengers who survived. The calculated average fare is approximately $40.82.", "query": "CREATE VIEW male_passengers AS SELECT * FROM
titanic WHERE Sex = 'male';\nSELECT AVG(Fare) as average_fare FROM male_passengers WHERE Survived = 1;"}🚀 Question:
🧑 Load the file 'data/titanic.csv' into a table called 'raw_passengers'. Create a view of the raw passengers table for just the male passengers. What was the
average fare for surviving male passengers?
🤖 The average fare for surviving male passengers was approximately $40.82.The average fare for surviving male passengers was calculated by creating a view called `male_passengers` to filter only the male passengers from the `titanic`
table, and then running a query to calculate the average fare for male passengers who survived. The calculated average fare is approximately $40.82.CREATE VIEW male_passengers AS SELECT * FROM titanic WHERE Sex = 'male';
SELECT AVG(Fare) as average_fare FROM male_passengers WHERE Survived = 1;```
## Quickstart
You need to set the `OPENAI_API_KEY` environment variable to your OpenAI API key,
which you can get from [here](https://platform.openai.com/account/api-keys).Install the `qabot` command line tool using pip/pipx:
```bash
$ pip install -U qabot
```Then run the `qabot` command with either local files (`-f my-file.csv`) or `-w` to query wikidata.
See all options with `qabot --help`
## Examples
### Local CSV file/s
```bash
$ qabot -q "how many passengers survived by gender?" -f data/titanic.csv
🦆 Loading data from files...
Loading data/titanic.csv into table titanic...Query: how many passengers survived by gender?
Result:
There were 233 female passengers and 109 male passengers who survived.🚀 any further questions? [y/n] (y): y
🚀 Query: what was the largest family who did not survive?
Query: what was the largest family who did not survive?
Result:
The largest family who did not survive was the Sage family, with 8 members.🚀 any further questions? [y/n] (y): n
```## Query WikiData
Use the `-w` flag to query wikidata. For best results use a `gpt-4` or similar model.
```bash
$ EXPORT QABOT_MODEL_NAME=gpt-4
$ qabot -w -q "How many Hospitals are there located in Beijing"
```## Intermediate steps and database queries
Use the `-v` flag to see the intermediate steps and database queries.
Sometimes it takes a long route to get to the answer, but it's interesting to see how it gets there.```
qabot -f data/titanic.csv -q "how many passengers survived by gender?" -v
```## Data accessed via http/s3
Use the `-f ` flag to load data from a url, e.g. a csv file on s3:
```bash
$ qabot -f s3://covid19-lake/enigma-jhu-timeseries/csv/jhu_csse_covid_19_timeseries_merged.csv -q "how many confirmed cases of covid are there?" -v
🦆 Loading data from files...
create table jhu_csse_covid_19_timeseries_merged as select * from 's3://covid19-lake/enigma-jhu-timeseries/csv/jhu_csse_covid_19_timeseries_merged.csv';Result:
264308334 confirmed cases
```## Docker Usage
You can build and run the Docker image for `qabot` using the following instructions:
### Building the Docker Image
To build the Docker image, run the following command in the root directory of the repository:
```bash
docker build -t qabot .
```### Running the Docker Image
To run the Docker image, use the following command:
```bash
docker run --rm -e OPENAI_API_KEY=your_openai_api_key ghcr.io/hardbyte/qabot -w -q "How many Hospitals are there located in Beijing"
```Replace `your_openai_api_key` with your actual OpenAI API key.
## Ideas
- Streaming mode to output results as they come in
- token limits and better reporting of costs
- Supervisor agent - assess whether a query is "safe" to run, could ask for user confirmation to run anything that gets flagged.
- Often we can zero-shot the question and get a single query out - perhaps we try this before the MKL chain
- test each zeroshot agent individually
- Generate and pass back assumptions made to the user
- Add an optional "clarify" tool to the chain that asks the user to clarify the question
- Create a query checker tool that checks if the query looks valid and/or safe
- Inject AWS credentials into duckdb for access to private resources in S3
- Automatic publishing to pypi e.g. using [trusted publishers](https://blog.pypi.org/posts/2023-04-20-introducing-trusted-publishers/)