{"id":28406893,"url":"https://github.com/activeloopai/langchain-deeplake","last_synced_at":"2026-02-16T17:33:17.959Z","repository":{"id":279431305,"uuid":"922278871","full_name":"activeloopai/langchain-deeplake","owner":"activeloopai","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-20T12:33:24.000Z","size":95,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-07T17:26:51.361Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/activeloopai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-25T19:31:17.000Z","updated_at":"2025-08-15T00:42:32.000Z","dependencies_parsed_at":"2025-02-25T14:59:39.655Z","dependency_job_id":"afdabde4-e2dd-4116-8c9e-0759638c4144","html_url":"https://github.com/activeloopai/langchain-deeplake","commit_stats":null,"previous_names":["activeloopai/langchain-deeplake"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/activeloopai/langchain-deeplake","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/activeloopai%2Flangchain-deeplake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/activeloopai%2Flangchain-deeplake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/activeloopai%2Flangchain-deeplake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/activeloopai%2Flangchain-deeplake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/activeloopai","download_url":"https://codeload.github.com/activeloopai/langchain-deeplake/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/activeloopai%2Flangchain-deeplake/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29513989,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-16T09:05:14.864Z","status":"ssl_error","status_checked_at":"2026-02-16T08:55:59.364Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-01T23:10:37.976Z","updated_at":"2026-02-16T17:33:17.931Z","avatar_url":"https://github.com/activeloopai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# langchain-deeplake\n\nThis package contains the LangChain integration with Deeplake\n\n## Installation\n\n```bash\npip install -U langchain-deeplake\n```\n\n## Usage\n\n```python\nfrom langchain_deeplake import DeeplakeVectorStore\n```\n\n\n## How to Use Deep Lake as a Vector Store in LangChain\nDeep Lake can be used as a VectorStore in [LangChain](https://github.com/langchain-ai/langchain) for building Apps that require filtering and vector search. In this tutorial, we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q\u0026A App about the [Twitter OSS recommendation algorithm](https://github.com/twitter/the-algorithm). This tutorial requires installation of:\n\nInstall the main libraries:\n\n```bash\npip install --upgrade --quiet  langchain-openai langchain-deeplake tiktoken\n```\n## Downloading and Preprocessing the Data\nFirst, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables `ACTIVELOOP_TOKEN`, `OPENAI_API_KEY`.\n\n\n\n\n```python\nimport os\nimport getpass\nfrom langchain_openai import OpenAIEmbeddings\nfrom langchain_deeplake.vectorstores import DeeplakeVectorStore\nfrom langchain_community.document_loaders import TextLoader\nfrom langchain_text_splitters import CharacterTextSplitter\nfrom langchain.chains import RetrievalQA\nfrom langchain_openai import ChatOpenAI\n```\n\nNext, we set up environmental variables\n```python\nif \"OPENAI_API_KEY\" not in os.environ:\n    os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n\nif \"ACTIVELOOP_TOKEN\" not in os.environ:\n    os.environ[\"ACTIVELOOP_TOKEN\"] = getpass.getpass(\"activeloop token:\")\n```\n\nNext, let's clone the Twitter OSS recommendation algorithm:\n\n```bash\n!git clone https://github.com/twitter/the-algorithm\n```\n\nNext, let's load all the files from the repo into a list:\n\n\n```python\nrepo_path = '/the-algorithm'\n\ndocs = []\nfor dirpath, dirnames, filenames in os.walk(repo_path):\n    for file in filenames:\n        try:\n            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')\n            docs.extend(loader.load_and_split())\n        except Exception as e:\n            print(e)\n            pass\n```\n\n## A note on chunking text files\n\nText files are typically split into chunks before creating embeddings. In general, more chunks increases the relevancy of data that is fed into the language model, since granular data can be selected with higher precision. However, since an embedding will be created for each chunk, more chunks increase the computational complexity.\n\n```python\ntext_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\ntexts = text_splitter.split_documents(docs)\n```\n\n## Creating the Deep Lake Vector Store\n\nFirst, we specify a path for storing the Deep Lake dataset containing the embeddings and their metadata.\n\n```python\ndataset_path = 'al://\u003corg-id\u003e/twitter_algorithm'\n```\n\nNext, we specify an OpenAI algorithm for creating the embeddings, and create the VectorStore. This process creates an embedding for each element in the texts lists and stores it in Deep Lake format at the specified path. \n\n```python\nembeddings = OpenAIEmbeddings()\n```\n\n\n```python\ndb = DeeplakeVectorStore.from_documents(dataset_path=dataset_path, embedding=embeddings, documents=texts, overwrite=True)\n```\n\nThe Deep Lake Vector Store has 4 columns including the `texts`, `embeddings`, `ids`, and `metadata`.\n\n```python\nds.dataset.summary()\n```\n\n```bash\nDataset length: 31305\nColumns:\n  documents : text\n  embeddings: embedding(1536, clustered)\n  ids       : text\n  metadata  : dict\n```\n\n## Use the Vector Store in a Q\u0026A App\n\nWe can now use the VectorStore in Q\u0026A app, where the embeddings will be used to filter relevant documents (`texts`) that are fed into an LLM in order to answer a question.\n\nIf we were on another machine, we would load the existing Vector Store without recalculating the embeddings:\n\n```python\ndb = DeeplakeVectorStore(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)\n\n```\n\nWe have to create a `retriever` object and specify the search parameters.\n\n```python\nretriever = db.as_retriever()\nretriever.search_kwargs['distance_metric'] = 'cos'\nretriever.search_kwargs['k'] = 20\n```\n\nFinally, let's create an `RetrievalQA` chain in LangChain and run it:\n\n```python\nmodel = ChatOpenAI(model='gpt-3.5-turbo')\nqa = RetrievalQA.from_llm(model, retriever=retriever)\n```\n\n```python\nqa.run('What programming language is most of the SimClusters written in?')\n```\n\nThis returns:\n```\nMost of the SimClusters code is written in Scala as indicated by the packages such as `com.twitter.simclustersann.modules`, `com.twitter.simclusters_v2.scio.common`, `com.twitter.simclusters_v2.summingbird.storm`, and references to Scala-based GCP jobs.\n```\n\n\n## Accessing the Low Level Deep Lake API (Advanced)\nWhen using a Deep Lake Vector Store in LangChain, the underlying Vector Store and its low-level Deep Lake dataset can be accessed via:\n\n```python\n# LangChain Vector Store\ndb = DeeplakeVectorStore(dataset_path=dataset_path)\n\n# Deep Lake Dataset object\nds = db.dataset\n```\n\n## SelfQueryRetriever with Deep Lake\n\nDeep Lake supports the [SelfQueryRetriever](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.self_query.base.SelfQueryRetriever.html) implementation in LangChain, which translates a user prompt into a metadata filters.\n\n\n\u003eThis section of the tutorial requires installation of additional packages:\n\u003e   `pip install deeplake lark`\n\nFirst let's create a Deep Lake Vector Store with relevant data using the documents below.\n\n```python\nfrom langchain_core.documents import Document\n\ndocs = [\n    Document(\n        page_content=\"A bunch of scientists bring back dinosaurs and mayhem breaks loose\",\n        metadata={\"year\": 1993, \"rating\": 7.7, \"genre\": \"science fiction\"},\n    ),\n    Document(\n        page_content=\"Leo DiCaprio gets lost in a dream within a dream within a dream within a ...\",\n        metadata={\"year\": 2010, \"director\": \"Christopher Nolan\", \"rating\": 8.2},\n    ),\n    Document(\n        page_content=\"A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea\",\n        metadata={\"year\": 2006, \"director\": \"Satoshi Kon\", \"rating\": 8.6},\n    ),\n    Document(\n        page_content=\"A bunch of normal-sized women are supremely wholesome and some men pine after them\",\n        metadata={\"year\": 2019, \"director\": \"Greta Gerwig\", \"rating\": 8.3},\n    ),\n    Document(\n        page_content=\"Toys come alive and have a blast doing so\",\n        metadata={\"year\": 1995, \"genre\": \"animated\"},\n    ),\n    Document(\n        page_content=\"Three men walk into the Zone, three men walk out of the Zone\",\n        metadata={\n            \"year\": 1979,\n            \"rating\": 9.9,\n            \"director\": \"Andrei Tarkovsky\",\n            \"genre\": \"science fiction\",\n            \"rating\": 9.9,\n        },\n    ),\n]\n```\n\nSince this feature uses Deep Lake's [Tensor Query Language](https://docs.deeplake.ai/latest/advanced/tql/) under the hood, the Vector Store must be stored in or connected to Deep Lake, which requires [registration with Activeloop](https://app.activeloop.ai/levongh/home):\n\n```python\norg_id = \u003cYOUR_ORG_ID\u003e\ndataset_path = f\"al://{org_id}/self_query\"\n\nvectorstore = DeeplakeVectorStore.from_documents(\n    docs, embeddings, dataset_path = dataset_path, overwrite = True,\n)\n```\n\nNext, let's instantiate our retriever by providing information about the metadata fields that our documents support and a short description of the document contents.\n\n```python\nfrom langchain.llms import OpenAI\nfrom langchain.retrievers.self_query.base import SelfQueryRetriever\nfrom langchain.chains.query_constructor.base import AttributeInfo\n\nmetadata_field_info = [\n    AttributeInfo(\n        name=\"genre\",\n        description=\"The genre of the movie\",\n        type=\"string or list[string]\",\n    ),\n    AttributeInfo(\n        name=\"year\",\n        description=\"The year the movie was released\",\n        type=\"integer\",\n    ),\n    AttributeInfo(\n        name=\"director\",\n        description=\"The name of the movie director\",\n        type=\"string\",\n    ),\n    AttributeInfo(\n        name=\"rating\", description=\"A 1-10 rating for the movie\", type=\"float\"\n    ),\n]\n\ndocument_content_description = \"Brief summary of a movie\"\nllm = OpenAI(temperature=0)\n\nretriever = SelfQueryRetriever.from_llm(\n    llm, vectorstore, document_content_description, metadata_field_info, verbose=True\n)\n```\n\nAnd now we can try actually using our retriever!\n\n```python\n# This example only specifies a relevant query\nretriever.get_relevant_documents(\"What are some movies about dinosaurs\")\n```\n\nOutput:\n```\n[Document(metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),\n Document(metadata={'genre': 'science fiction', 'rating': 7.7, 'year': 1993}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),\n Document(metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so'),\n Document(metadata={'genre': 'animated', 'year': 1995}, page_content='Toys come alive and have a blast doing so')]\n```\n\nNow we can run a query to find movies that are above a certain ranking:\n\n```python\n# This example only specifies a filter\nretriever.get_relevant_documents(\"I want to watch a movie rated higher than 8.5\")\n```\n\nOutput:\n```\n[Document(metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),\n Document(metadata={'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone'),\n Document(metadata={'director': 'Satoshi Kon', 'rating': 8.6, 'year': 2006}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea'),\n Document(metadata={'director': 'Andrei Tarkovsky', 'genre': 'science fiction', 'rating': 9.9, 'year': 1979}, page_content='Three men walk into the Zone, three men walk out of the Zone')]\n```\n\n\nCongrats! You just used the Deep Lake Vector Store in LangChain to create a Q\u0026A App! 🎉\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Factiveloopai%2Flangchain-deeplake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Factiveloopai%2Flangchain-deeplake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Factiveloopai%2Flangchain-deeplake/lists"}