{"id":16277633,"url":"https://github.com/aronweiler/doctalk","last_synced_at":"2025-09-17T20:31:43.896Z","repository":{"id":175231851,"uuid":"653458057","full_name":"aronweiler/DocTalk","owner":"aronweiler","description":"This started out as a POC for chatting over my documents, but has turned into a whole framework for using LLMs.","archived":false,"fork":false,"pushed_at":"2023-08-10T05:21:31.000Z","size":10626,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-01T21:47:30.148Z","etag":null,"topics":["chatbot","chatgpt","chatgpt-api","documents","llm","llms","localllm","openai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aronweiler.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-14T05:11:40.000Z","updated_at":"2025-03-18T17:46:42.000Z","dependencies_parsed_at":"2024-11-05T06:36:56.183Z","dependency_job_id":null,"html_url":"https://github.com/aronweiler/DocTalk","commit_stats":null,"previous_names":["aronweiler/doctalk"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aronweiler/DocTalk","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aronweiler%2FDocTalk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aronweiler%2FDocTalk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aronweiler%2FDocTalk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aronweiler%2FDocTalk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aronweiler","download_url":"https://codeload.github.com/aronweiler/DocTalk/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aronweiler%2FDocTalk/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275658692,"owners_count":25504776,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-17T02:00:09.119Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","chatgpt","chatgpt-api","documents","llm","llms","localllm","openai"],"created_at":"2024-10-10T18:55:48.284Z","updated_at":"2025-09-17T20:31:42.465Z","avatar_url":"https://github.com/aronweiler.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DocTalk\nDocTalk is a project I'm working on to try to build my own LLM document chat.  \n\nI'm not 100% sure what I'm doing, but it's been great so far.  \n\n*See [Update Notes](#update-notes) for changes I am making after the initial commit to this project.*\n\nFeel free to play around and please for the love of science, give me some feedback.\n\nI legit have no idea if this is going to be useful or anything, but it's certainly teaching me python, and renewing my interest in [snake_case](https://en.wikipedia.org/wiki/Snake_case) variables.\n\n## Major Update July 9th, 2023:\nI pretty much gutted the project and moved a bunch of things around.  I implemented a different architecture, with the runners and what not. \n\nSome day soon I will fill the rest of this documentation in!\n\n## Basic Usage (python developers)\n1. To create the python env, and install requirements, run: [install.ps1](install.ps1)\n2. Set your `OPENAI_API_KEY` environment variable, if you are going to use OpenAI's API. See [.env.template](.env.template) for guidance.\n3. Load your documents using [ingest_documents.py](/src/ingest_documents.py)\n    - Options for running the document loader include:\n      - `--document_directory`: Directory from which to load documents\n      - `--database_name`: The name of the database where you'd like to store the loaded documents\n      - `--run_open_ai`: When set, this will force the use of the OpenAI LLM and embeddings.  Make sure you set your API key.\n      - `--split_documents`: If this is present, the loader will split loaded documents into smaller chunks\n      - `--split_chunks`: How big the chunk sizes should be\n      - `--split_overlap`: How much of an overlap there should be between chunks\n  \n4. Select a configuration file [from the configurations folder](configurations/), or create your own\n   - Currently there are a few supported AIs and runners- check the [run.py](src/run.py) for the supported types.\n5. Once you've loaded your documents, and selected a configuration file, run `run.py --config=\u003cpath to config file\u003e`\n\n## Usage (non-developers)\n*Coming Soon*\n\n## **Random notes** \nThe following is mostly copy/paste stuff I use (or used) frequently\n\n### **Creating env**\n``` shell\npython -m venv doctalk_venv\n```\n\n### **Fixing pip issues-- upgrading pip, clearing cache, reinstalling dependencies**\n``` shell\npython.exe -m pip install --upgrade pip\npip cache purge\npip --no-cache-dir install -r requirements.txt\n```\n### **Why isn't my llama-cpp working on my GPU?**\nProbably because you ran the `/requirements.txt` install before getting here.  Make sure to set these environment variables before installing llama-cpp next time.\n``` powershell\n$env:CMAKE_ARGS=\"-DLLAMA_CUBLAS=on\"      \n$env:FORCE_CMAKE=1\n$env:LLAMA_CUBLAS=1   \n```\nAnd this next one is for when you have to force a re-install of llama-cpp because you left the instructions for the GPU below the `/requirements.txt` install 🙄\n\n`pip install --no-cache-dir --force-reinstall llama-cpp-python`\n\n### **Random CUDA Memory Error**\nSometimes a random CUDA memory error will show up.  Use this:\n``` powershell\n$env:GGML_CUDA_NO_PINNED=1\n```\n\n## **TODO List**\n- langchain related (although I could do these manually if I want to spend the time learning it??):\n  - Add tool to allow LLM to google search and provide answers (google sign in)\n  - Add tool to allow the LLM to dynamically retrieve individual documents, vs. pre-processing a folder (e.g. from a website, or local folder)\n    - repurpose [scrape_pdfs.py](/scrape_pdfs.py)\n- Probably other things\n- Documentation??  lol\n\n## **Resources to look at**\n- [Question answering using embeddings](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)\n- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)\n- Best source for models: [TheBloke on HuggingFace](https://huggingface.co/TheBloke)\n- [LangChain Dev Blog](https://blog.langchain.dev/)\n\n# Update Notes\n\n- 6/15/2023: \n  - Started to rework the project to separate the local and hosted (OpenAI) LLM stuff.  There are different prompting techniques, and other stuff that I want to play with when it comes to local vs. hosted LLMs.\n  - Renamed run_llm.py to run_local_llm.py\n  - Added run_chain.py\n  - Updated some other random stuff\n\n- 6/20/2023\n  - Updated splitting in [document_loader.py](/src/document_loader.py) so that it splits on newlines before hitting the character max.    \n  - Added [install.ps1](install.ps1)\n  - Added support for top_k in non-local llms\n\n- 6/21/2023\n  - Added command line support for [run_chain.py](/src/run_chain.py) and [document_loader.py](/src/document_loader.py)\n  - Removed old unused code\n  - Collapsed the local and remote LLM access (using langchain) into one file [run_chain.py](src/run_chain.py)\n\n- 6/23/2023\n  - Added multi-document store querying capabilities using [run_react_agent.py](src/run_react_agent.py)\n  - Loading user defined tools using [tool_loader.py](src/tool_loader.py)\n  - Added an example tool configuration for my work-related stuff, [medical_device_config.json](/tool_configurations/medical_device_config.json)\n\n- 6/24/2023\n  - Updated ReAct agent to support self-ask, and call tools in a dynamic way: [run_react_agent.py](src/run_react_agent.py)\n\n- 7/9/2023\n  - Major refactor and reorganization\n  - Removed a bunch of unused old stuff\n  - Implemented selection of AI (QA chain for now) and runners\n  - Simplified document ingestion and running\n  - Added better support for API\n  - Started on getting Docker into the solution","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faronweiler%2Fdoctalk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faronweiler%2Fdoctalk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faronweiler%2Fdoctalk/lists"}