https://github.com/pinecone-io/pinecone-datasets
An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
https://github.com/pinecone-io/pinecone-datasets
data database embeddings vector
Last synced: 2 months ago
JSON representation
An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.
- Host: GitHub
- URL: https://github.com/pinecone-io/pinecone-datasets
- Owner: pinecone-io
- Created: 2023-02-15T19:50:06.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-20T17:56:38.000Z (about 1 year ago)
- Last Synced: 2024-08-02T13:24:14.178Z (11 months ago)
- Topics: data, database, embeddings, vector
- Language: Python
- Homepage: https://pinecone-io.github.io/pinecone-datasets/
- Size: 490 KB
- Stars: 31
- Watchers: 3
- Forks: 14
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pinecone Datasets
## install
```bash
pip install pinecone-datasets
```### Loading public datasets
Pinecone hosts a public datasets catalog, you can load a dataset by name using `list_datasets` and `load_dataset` functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.
```python
from pinecone_datasets import list_datasets, load_datasetlist_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]dataset = load_dataset("quora_all-MiniLM-L6-bm25")
dataset.head()
# Prints
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id ┆ values ┆ sparse_values ┆ metadata ┆ blob │
# │ ┆ ┆ ┆ ┆ │
# │ str ┆ list[f32] ┆ struct[2] ┆ struct[3] ┆ │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0 ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │ ┆ 0.0060... ┆ ┆ ┆ │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘
```## Usage - Accessing data
Each dataset has three main attributes, `documents`, `queries`, and `metadata` which are lazily loaded the first time they are accessed. You may notice a delay as the underlying parquet files are being downloaded the first time these attributes are accessed.
Pinecone Datasets is build on top of pandas. `documents` and `queries` are lazily-loaded pandas dataframes. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.
accessing the documents and queries dataframes is done using the `documents` and `queries` properties. These properties are lazy and will only load the data when accessed.
```python
from pinecone_datasets import list_datasets, load_datasetdataset = load_dataset("quora_all-MiniLM-L6-bm25")
document_df: pd.DataFrame = dataset.documents
query_df: pd.DataFrame = dataset.queries
```## Usage - Iterating over documents
The `Dataset` class has helpers for iterating over your dataset. This is useful for upserting a dataset to an index, or for benchmarking.
```python
# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
dataset.iter_documents(batch_size=n)# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
dataset.iter_queries()
```### Upserting to Index
To upsert data to the index, you should install the [Pinecone SDK](https://github.com/pinecone-io/pinecone-python-client)
```python
from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import load_dataset, list_datasets# See what datasets are available
for ds in list_datasets():
print(ds)# Download embeddings data
dataset = load_dataset(dataset_name)# Instantiate a Pinecone client using API key from app.pinecone.io
pc = Pinecone(api_key='key')# Create a Pinecone index
index_config = pc.create_index(
name="demo-index",
dimension=dataset.metadata.dense_model.dimension,
spec=ServerlessSpec(cloud="aws", region="us-east1")
)# Instantiate an index client
index = pc.Index(host=index_config.host)# Upsert data from the dataset
index.upsert_from_dataframe(df=dataset.documents)
```