{"id":21973334,"url":"https://github.com/rishisolanke/pdf_query_langchain","last_synced_at":"2026-04-17T15:31:24.601Z","repository":{"id":249849086,"uuid":"832735523","full_name":"rishisolanke/PDF_Query_Langchain","owner":"rishisolanke","description":"PDF Query LangChain is a tool that extracts and queries information from PDF documents using advanced language processing. Leveraging LangChain, OpenAI, and Cassandra, this app enables efficient, interactive querying of PDF content. Ideal for data analysis, research, and automated reporting, it simplifies detailed document analysis with ease.","archived":false,"fork":false,"pushed_at":"2024-07-23T16:23:08.000Z","size":5,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-22T23:25:00.680Z","etag":null,"topics":["artificial-intelligence","data-analysis","document-query","langchain","natural-language-processing","nlp","openai","pdf-analysis","pdf-extraction","python","research-tool"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rishisolanke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-23T16:06:19.000Z","updated_at":"2024-07-23T16:30:01.000Z","dependencies_parsed_at":"2024-07-23T19:24:29.523Z","dependency_job_id":null,"html_url":"https://github.com/rishisolanke/PDF_Query_Langchain","commit_stats":null,"previous_names":["rishisolanke/pdf_query_langchain"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rishisolanke/PDF_Query_Langchain","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rishisolanke%2FPDF_Query_Langchain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rishisolanke%2FPDF_Query_Langchain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rishisolanke%2FPDF_Query_Langchain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rishisolanke%2FPDF_Query_Langchain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rishisolanke","download_url":"https://codeload.github.com/rishisolanke/PDF_Query_Langchain/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rishisolanke%2FPDF_Query_Langchain/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31934328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-17T12:37:54.787Z","status":"ssl_error","status_checked_at":"2026-04-17T12:37:25.095Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","data-analysis","document-query","langchain","natural-language-processing","nlp","openai","pdf-analysis","pdf-extraction","python","research-tool"],"created_at":"2024-11-29T15:26:35.160Z","updated_at":"2026-04-17T15:31:24.581Z","avatar_url":"https://github.com/rishisolanke.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Overview\n\nPDF Query LangChain is a versatile tool designed to streamline the extraction and querying of information from PDF documents. Leveraging LangChain’s powerful language processing capabilities, OpenAI’s language models, and Cassandra’s vector store, this application provides an efficient and interactive way to interact with PDF content.\n\nFeatures\n\n    •\tPDF Text Extraction: Automatically extracts text from PDF files using PyPDF2 for easy processing.\n    •\tIntelligent Text Splitting: Splits extracted text into manageable chunks to optimize for token limits and improve query accuracy.\n    •\tVector Store Integration: Utilizes Cassandra to create and manage a vector store for efficient text storage and retrieval.\n    •\tAdvanced Language Models: Integrates OpenAI’s language models for embedding and querying text data.\n    •\tInteractive Question-Answer Interface: Allows users to input queries and receive relevant answers from the PDF content in real-time.\n    •\tRelevance-Based Document Retrieval: Displays the most relevant documents based on the query, along with their relevance scores.\n    Installation and Setup\n\nClone the Repository\n\n    git clone https://github.com/yourusername/pdf-query-langchain.git\n    cd pdf-query-langchain\n    \nInstall Dependencies\n\n    pip install -q cassio datasets langchain openai tiktoken\n    pip install pyarrow==14.0.1\n    pip install requests==2.28.2\n    pip check\n    pip install pyPDF2\n    \nLangChain and CassIO Components:\n\n    pip install langchain\n    pip install langchain-community\n    pip install cassio\n    from langchain.vectorstores.cassandra import Cassandra\n    from langchain.indexes.vectorstore import VectorstoreIndexCreator\n    from langchain.llms import OpenAI\n    from langchain.embeddings import OpenAIEmbeddings\n    \nInitialize Database Connection:\n\n    import cassio\n    cassio.init(token=\"YOUR_ASTRA_DB_APPLICATION_TOKEN\", database_id=\"YOUR_ASTRA_DB_ID\")\n    \nRead and Extract Text from PDF:\n    \n    from PyPDF2 import PdfReader\n    pdfreader = PdfReader('path_to_your_pdf.pdf')\n    raw_text = ''\n    for i, page in enumerate(pdfreader.pages):\n        content = page.extract_text()\n        if content:\n            raw_text += content\n        \nText Splitting:\n\n    from langchain.text_splitter import CharacterTextSplitter\n    text_splitter = CharacterTextSplitter(\n        separator=\"\\n\",\n        chunk_size=800,\n        chunk_overlap=200,\n        length_function=len,\n    )\n    texts = text_splitter.split_text(raw_text)\n    \nCreate and Load Vector Store:\n\n    astra_vector_store = Cassandra(\n        embedding=embedding,\n        table_name = \"qa_mini_demo\",\n        session=None,\n        keyspace=None,\n    )\n    astra_vector_store.add_texts(texts[:50])\n    astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)\n    \nUsage\n\nInteractive QA:\n\n    first_question = True\n    while True:\n        if first_question:\n            query_text = input(\"\\nEnter your question (or type 'quit' to exit): \").strip()\n        else:\n            query_text = input(\"\\nWhat's your next question (or type 'quit' to exit): \").strip()\n    \n        if query_text.lower() == 'quit':\n            break\n    \n        first_question = False\n    \n        print(\"\\nQUESTION: \\\"%s\\\"\" % query_text)\n        answer = astra_vector_index.query(query_text, llm=llm).strip()\n        print(\"\\nANSWER: \\\"%s\\\"\\n\" % answer)\n    \n        print(\"FIRST DOCUMENTS BY RELEVANCE:\")\n        for doc in astra_vector_store.similarity_search_with_score(query_text, k=4):\n            score = doc[1]\n            print(\"\\nScore: %.4f\\n%s\\n\" % (score, doc[0].page_content[:84]))\n            \nUsers can interact with the application by entering their queries to extract specific information from the PDF content. The app processes the queries using the vector store and language models to provide accurate answers and displays the most relevant documents for additional context.\n\nApplications\n\n    •\tData Analysis: Extract and analyze specific data points from large PDF documents.\n    •\tResearch: Retrieve relevant information for academic or professional research.\n    •\tAutomated Reporting: Generate reports by querying specific sections of PDF documents.\n    •\tLegal and Compliance: Quickly find relevant legal clauses or compliance information within lengthy documents.\n    This application simplifies the process of querying and extracting information from PDFs, making it an invaluable tool for various use cases that require detailed document analysis.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frishisolanke%2Fpdf_query_langchain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frishisolanke%2Fpdf_query_langchain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frishisolanke%2Fpdf_query_langchain/lists"}