{"id":19468994,"url":"https://github.com/leehanchung/llm-pdf-qa-workshop","last_synced_at":"2025-04-25T11:32:47.348Z","repository":{"id":178524932,"uuid":"658977827","full_name":"leehanchung/llm-pdf-qa-workshop","owner":"leehanchung","description":"Introduction to LLM App Development Workshop: PDF Q\u0026A App using OpenAI, Langchain, and Chainlit","archived":false,"fork":false,"pushed_at":"2023-11-26T07:45:42.000Z","size":2362,"stargazers_count":46,"open_issues_count":0,"forks_count":11,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-03T20:43:26.354Z","etag":null,"topics":["chainlit","chroma","codespaces","langchain","llm","openai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leehanchung.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-26T22:41:51.000Z","updated_at":"2025-03-09T14:07:51.000Z","dependencies_parsed_at":"2024-11-10T18:45:49.038Z","dependency_job_id":null,"html_url":"https://github.com/leehanchung/llm-pdf-qa-workshop","commit_stats":null,"previous_names":["leehanchung/llm-pdf-qa-workshop"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehanchung%2Fllm-pdf-qa-workshop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehanchung%2Fllm-pdf-qa-workshop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehanchung%2Fllm-pdf-qa-workshop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leehanchung%2Fllm-pdf-qa-workshop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leehanchung","download_url":"https://codeload.github.com/leehanchung/llm-pdf-qa-workshop/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250808394,"owners_count":21490657,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chainlit","chroma","codespaces","langchain","llm","openai"],"created_at":"2024-11-10T18:45:41.617Z","updated_at":"2025-04-25T11:32:46.956Z","avatar_url":"https://github.com/leehanchung.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Workshop 1: PDF Q\u0026A❓ App using OpenAI, Langchain, and Chainlit\n\nThis repository contains an introductory workshop for learning LLM Application Development using Langchain, OpenAI, and Chainlist. The workshop goes over a simplified process of developing an LLM application that provides a question answering interface to PDF documents. The prerequisite to the workshop is basic working knowledge of  Git, Linux, and Python. The workshop is organized as follows.\n\n| Lab | Learning Objective | Problem | Solution |\n| --- | ------------------ | ------- | :------: |\n| 1   | Basic chat with data LLM App  | 🐒 [PDF Q\u0026A Application](https://github.com/leehanchung/llm-pdf-qa-workshop/tree/lab1/begin) | ✅ [Solution](https://github.com/leehanchung/llm-pdf-qa-workshop/tree/lab1/end) |\n| 2   | Basic prompt engineering      | 🐒 [Improving Q\u0026A Factuality](https://github.com/leehanchung/llm-pdf-qa-workshop/tree/lab2/begin) | ✅ [Solution](https://github.com/leehanchung/llm-pdf-qa-workshop/tree/lab2/end) |\n\nTo run the fully functional application, please checkout the [main](https://github.com/leehanchung/llm-pdf-qa-workshop/tree/main) branch and follow the [instruction to run the application](#run-the-application)\n\nThe app provides an chat interface that asks user to upload a PDF document and then allow users to ask questions against the PDF document. It uses OpenAI's API for the chat and embedding models, Langchain for the framework, and Chainlit as the fullstack interface.\n\nFor the purpose of the workshop, we are using [Gap Q1 2023 Earnings Release](samples/1Q23-EPR-with-Tables-FINAL.pdf) as the example PDF.\n\nThe completed application looks as follows:\n![PDF Q\u0026A App](assets/app.png)\n\n## 🧰 Stack\n\n- [Python](https://www.python.org/downloads/release/python-3100/)\n- [Langchain](https://python.langchain.com/docs/get_started/introduction.html)\n- [Chainlit](https://docs.chainlit.io/overview)\n- [Chroma](https://www.trychroma.com/)\n- [OpenAI](https://openai.com/)\n\n## 👉 Getting Started\n\nWe use [Python Poetry](https://python-poetry.org/) for managing virtual environments and we recommend using [pyenv](https://github.com/pyenv/pyenv) to manage Python versions. Alternatively, you can use [Mamba](https://mamba.readthedocs.io/en/latest/) for Python version management.\n\nInstall and start the Poetry shell as follows.\n```bash\npoetry install\npoetry shell\n```\n\nPlease create an `.env` file from `.env.sample` once the application is installed. Edit the `.env` file with your OpenAI org and OpenAI key.\n```bash\ncp .env.sample .env\n```\n\n### Run the Application\n\nRun the application by:\n```bash\nchainlit run app/app.py -w\n```\n\n## Lab 1: Basic chat with data LLM App\n\nThe most quintessential llm application is a chat with text application. These type of application uses  a retrieval augmented generation (RAG) design pattern, where the application first retrieve the relevant texts from memory and then generate answers based on the retrieved text.\n\nFor our application, we will go through the following steps in the order of execution:\n\n1. User uploads a PDF file.\n2. App load and decode the PDF into plain text.\n3. App chunks the text into smaller documents. This is because embedding models have limited input size.\n4. App stores the embeddings into memory\n5. User asks a question\n6. App retrieves the relevant documents from memory and generate an answer based on the retrieved text.\n\nThe overall architecture is as follows:\n![init](assets/arch_init.png)\n\nPlease implement the missing pieces in the [application](app/app.py)\n\n### Lab 1: Solution\n\n2. We choose to use [langchain.document_loaders.PDFPlumberLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf#using-pdfplumber) to load PDF files. It helps with PDF file metadata in the future. And we like Super Mario Brothers who are plumbers.\n3. We choose to use [langchain.text_splitter.RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) to chunk the text into smaller documents.\n4. Any in-memory vector stores should be suitable for this application since we are only expecting one single PDF. Anything more is over engineering. We choose to use [Chroma](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/chroma).\n5. We use [langchain.chains.RetrievalQAWithSourcesChain](https://python.langchain.com/docs/modules/chains/popular/vector_db_qa#return-source-documents) since it returns the sources, which helps the end users to access the source documents.\n\nThe completed application has the following architecture:\n![final](assets/arch_final.png)\n\nNow we can [run the application](#run-the-application).\n\n## Lab 2: Basic prompt engineering\n\nPlaying around our newly created application using the provided [sample PDF, Gap Q1 2023 Earnings Release](samples/1Q23-EPR-with-Tables-FINAL.pdf), we run into a hallucination problem from the following long  question to ask the model to summarize the key results:\n\u003e ```What's the results for the reporter quarter? Please describe in the following order using bullet points - revenue, gross margin, opex, op margin, net income, and EPS. INclude both gaap and non-gaap numbers. Please also include quarter over quarter changes.```\n\nA whooping 36.6% operating margin for a retail business. That is 10x higher than Amazon's consolidated operating margins!!\n\n![hallucination](assets/hallucination.png)\n\nWe again asked the model a simpler and more bounded question:\n\u003e ```What's the op margin for the current quarter```\n\n![fact](assets/fact.png)\n\nAnd it finds the answer.\n\nPlease resolve this hallucination problem with prompt engineering.\n\n### Lab 2: Solution\n\nWe utilized Chainlit's Prompt Playground functionality to experiment with the prompts. First, we investigates the prompts that includes the retrieved results. We found the correct operating margins is included. So the model is having a difficult time generating summaries using the right context.\n\nWe found that if we remove the few shot examples implemented by Langchain, `gpt-3.5-turbo-0613` will be able to generate the right answer. However, it, for some reason, decided to change the sources into bullet points with summaries. We then experimented around and \"fixed\" the sources prompt.\n\nTo implement the updated prompts in our application, we traced Langchain's Python source code. We found that `RetrievalQAWithSourcesChain` inherites from `BaseQAWithSourcesChain`, where it has a class method `from_chain_type()` that uses [`load_qa_with_sources_chain`](https://github.com/hwchase17/langchain/blob/b0859c9b185fe897f3c8e2699835a669b2a2ba61/langchain/chains/qa_with_sources/base.py#L81) to create the chain. The function maps the keyword `stuff` to use [_load_stuff_chain](https://github.com/hwchase17/langchain/blob/b0859c9b185fe897f3c8e2699835a669b2a2ba61/langchain/chains/qa_with_sources/loading.py#L52). We then found that [_load_stuff_chain](https://github.com/hwchase17/langchain/blob/b0859c9b185fe897f3c8e2699835a669b2a2ba61/langchain/chains/qa_with_sources/loading.py#L52) takes a `prompt` variable and a `document_prompt` variable to create a [StuffDocumentChain](https://github.com/hwchase17/langchain/blob/b0859c9b185fe897f3c8e2699835a669b2a2ba61/langchain/chains/combine_documents/stuff.py#L22) for doing the QA as a documentation summarization task.\n\nThe composition of the overall prompt is as follows:\n![Alt text](assets/stuff_chain.png)\n\nWe then extracted out [the prompts into their own file](app/prompts.py) and implements them there. We then initialize the `RetrievalQAWithSourcesChain` with our custom prompts!\n\n## LICENSE\n\nThis repository is open source under GPLv3 and please cite it as:\n```bibtex\n@misc{PDF_QA_App_Workshop,\n  author = {Lee, Hanchung},\n  title = {Workshop 1: PDF Q\u0026A App using OpenAI, Langchain, and Chainlit},\n  url = {https://github.com/https://github.com/leehanchung/llm-pdf-qa-workshop},\n  year = {2023},\n  month = {6},\n  howpublished = {Github Repo},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleehanchung%2Fllm-pdf-qa-workshop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleehanchung%2Fllm-pdf-qa-workshop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleehanchung%2Fllm-pdf-qa-workshop/lists"}