{"id":13678636,"url":"https://github.com/pinecone-io/pinecone-datasets","last_synced_at":"2025-04-29T15:32:25.308Z","repository":{"id":188267101,"uuid":"602242720","full_name":"pinecone-io/pinecone-datasets","owner":"pinecone-io","description":"An open-source dataset library for pre-embedded dataset: create your own data catalog, or use Pinecone's public datasets.","archived":false,"fork":false,"pushed_at":"2024-05-20T17:56:38.000Z","size":502,"stargazers_count":31,"open_issues_count":11,"forks_count":14,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-08-02T13:24:14.178Z","etag":null,"topics":["data","database","embeddings","vector"],"latest_commit_sha":null,"homepage":"https://pinecone-io.github.io/pinecone-datasets/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pinecone-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-02-15T19:50:06.000Z","updated_at":"2024-07-11T20:05:05.000Z","dependencies_parsed_at":"2023-08-14T16:35:15.078Z","dependency_job_id":"7595adab-e384-4f55-8997-4dd4f4b89f63","html_url":"https://github.com/pinecone-io/pinecone-datasets","commit_stats":null,"previous_names":["pinecone-io/pinecone-datasets"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinecone-io%2Fpinecone-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinecone-io%2Fpinecone-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinecone-io%2Fpinecone-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pinecone-io%2Fpinecone-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pinecone-io","download_url":"https://codeload.github.com/pinecone-io/pinecone-datasets/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224178994,"owners_count":17268984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","database","embeddings","vector"],"created_at":"2024-08-02T13:00:56.368Z","updated_at":"2025-04-29T15:32:25.297Z","avatar_url":"https://github.com/pinecone-io.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Pinecone Datasets\n\n## install\n\n```bash\npip install pinecone-datasets\n```\n\n### Loading public datasets\n\nPinecone hosts a public datasets catalog, you can load a dataset by name using `list_datasets` and `load_dataset` functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.\n\n```python\nfrom pinecone_datasets import list_datasets, load_dataset\n\nlist_datasets()\n# [\"quora_all-MiniLM-L6-bm25\", ... ]\n\ndataset = load_dataset(\"quora_all-MiniLM-L6-bm25\")\n\ndataset.head()\n\n# Prints\n# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐\n# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │\n# │     ┆                           ┆                                     ┆                   ┆      │\n# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │\n# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡\n# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,\"other\"} ┆ .... │\n# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │\n# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘\n```\n\n\n## Usage - Accessing data\n\nEach dataset has three main attributes, `documents`, `queries`, and `metadata` which are lazily loaded the first time they are accessed. You may notice a delay as the underlying parquet files are being downloaded the first time these attributes are accessed.\n\nPinecone Datasets is build on top of pandas. `documents` and `queries` are lazily-loaded pandas dataframes. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way. \n\naccessing the documents and queries dataframes is done using the `documents` and `queries` properties. These properties are lazy and will only load the data when accessed. \n\n```python\nfrom pinecone_datasets import list_datasets, load_dataset\n\ndataset = load_dataset(\"quora_all-MiniLM-L6-bm25\")\n\ndocument_df: pd.DataFrame = dataset.documents\n\nquery_df: pd.DataFrame = dataset.queries\n```\n\n\n## Usage - Iterating over documents\n\nThe `Dataset` class has helpers for iterating over your dataset. This is useful for upserting a dataset to an index, or for benchmarking.\n\n```python\n\n# List Iterator, where every list of size N Dicts with (\"id\", \"values\", \"sparse_values\", \"metadata\")\ndataset.iter_documents(batch_size=n) \n\n# Dict Iterator, where every dict has (\"vector\", \"sparse_vector\", \"filter\", \"top_k\")\ndataset.iter_queries()\n```\n\n### Upserting to Index\n\nTo upsert data to the index, you should install the [Pinecone SDK](https://github.com/pinecone-io/pinecone-python-client)\n\n```python\nfrom pinecone import Pinecone, ServerlessSpec\nfrom pinecone_datasets import load_dataset, list_datasets\n\n# See what datasets are available\nfor ds in list_datasets():\n    print(ds)\n\n# Download embeddings data \ndataset = load_dataset(dataset_name)\n\n# Instantiate a Pinecone client using API key from app.pinecone.io\npc = Pinecone(api_key='key')\n\n# Create a Pinecone index\nindex_config = pc.create_index(\n    name=\"demo-index\",\n    dimension=dataset.metadata.dense_model.dimension,\n    spec=ServerlessSpec(cloud=\"aws\", region=\"us-east1\")\n)\n\n# Instantiate an index client\nindex = pc.Index(host=index_config.host)\n\n# Upsert data from the dataset\nindex.upsert_from_dataframe(df=dataset.documents)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpinecone-io%2Fpinecone-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpinecone-io%2Fpinecone-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpinecone-io%2Fpinecone-datasets/lists"}