{"id":20618190,"url":"https://github.com/tddschn/langchain-utils","last_synced_at":"2025-04-15T11:37:33.135Z","repository":{"id":153245298,"uuid":"625430564","full_name":"tddschn/langchain-utils","owner":"tddschn","description":"LangChain Utilities for prompt generation from documents, URLs, and arbitrary files - streamlining your interactive workflow with LLMs!","archived":false,"fork":false,"pushed_at":"2024-06-05T17:20:20.000Z","size":763,"stargazers_count":9,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-09T09:09:39.921Z","etag":null,"topics":["chatgpt","cli","langchain","llm","pandoc","prompt","python3","utility"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/langchain-utils","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tddschn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-09T04:40:52.000Z","updated_at":"2024-12-16T07:24:34.000Z","dependencies_parsed_at":"2024-06-05T19:21:51.549Z","dependency_job_id":"b4ae3910-e486-4379-867e-ce1d80064ab2","html_url":"https://github.com/tddschn/langchain-utils","commit_stats":null,"previous_names":[],"tags_count":56,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tddschn%2Flangchain-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tddschn%2Flangchain-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tddschn%2Flangchain-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tddschn%2Flangchain-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tddschn","download_url":"https://codeload.github.com/tddschn/langchain-utils/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248706636,"owners_count":21148747,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","cli","langchain","llm","pandoc","prompt","python3","utility"],"created_at":"2024-11-16T12:07:31.544Z","updated_at":"2025-04-15T11:37:33.094Z","avatar_url":"https://github.com/tddschn.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# langchain-utils\n\nLangChain Utilities\n\n- [langchain-utils](#langchain-utils)\n  - [Prompt generation using LangChain document loaders](#prompt-generation-using-langchain-document-loaders)\n    - [Demos](#demos)\n    - [`pandocprompt`](#pandocprompt)\n    - [`urlprompt`](#urlprompt)\n    - [`pdfprompt`](#pdfprompt)\n    - [`ytprompt`](#ytprompt)\n    - [`textprompt`](#textprompt)\n    - [`htmlprompt`](#htmlprompt)\n  - [Installation](#installation)\n    - [pipx](#pipx)\n    - [pip](#pip)\n  - [Develop](#develop)\n\n## Prompt generation using LangChain document loaders\n\nDo you find yourself frequently copy-pasting texts from the web / PDFs / other documents into ChatGPT?\n\nIf yes, these tools are for you!\n\nOptimized to feed into a chat interface (like ChatGPT) manually in one or multiple (to get around context length limits) goes.\n\nBasically, the prompts generated look like this:\n\n```python\nREPLY_OK_IF_YOU_READ_TEMPLATE = '''\nBelow is {what}, reply \"OK\" if you read:\n\n\"\"\"\n{content}\n\"\"\"\n'''.strip()\n```\n\nYou can feed it directly to a chat interface like ChatGPT, and ask follow up questions about it.\n\nSee [`prompts.py`](./langchain_utils/prompts.py) for other variations.\n\n### Demos\n\n- Loading `https://github.com/tddschn/langchain-utils` and copy to clipboard:\n\n\u003c!-- create a video tag with https://user-images.githubusercontent.com/45612704/231729153-341bd962-28cc-40a3-af8b-91e038ccaf6c.mp4 --\u003e\n\n\u003cvideo src=\"https://user-images.githubusercontent.com/45612704/231729153-341bd962-28cc-40a3-af8b-91e038ccaf6c.mp4\" controls width=\"100%\"\u003e\u003c/video\u003e\n\n- Load 3 pages of a pdf file, open each part for inspection before copying, and optionally merge 3 pages into 2 prompts that wouldn't go over the `gpt-3.5-turbo`'s context length limit with langchain's `TokenTextSplitter`.\n\n\u003c!-- for https://user-images.githubusercontent.com/45612704/231731553-63cf3cef-a210-4761-8ca3-dd47bedc3393.mp4 --\u003e\n\n\u003cvideo src=\"https://user-images.githubusercontent.com/45612704/231731553-63cf3cef-a210-4761-8ca3-dd47bedc3393.mp4\" controls width=\"100%\"\u003e\u003c/video\u003e\n\n### `pandocprompt`\n\n```\n$ pandocprompt --help\n\nusage: pandocprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]\n                    [-P PARTS [PARTS ...]] [-r] [-R]\n                    [--print-percentage-non-ascii] [-n] [--out OUT] [-C]\n                    [-w WHAT] [-M] [--from PANDOC_FROM_FORMAT]\n                    [--to PANDOC_TO_FORMAT]\n                    [PATH ...]\n\nGet prompts from arbitrary files. You need to have `pandoc` installed and in\n$PATH, it will be used to convert source files to desired (hopefully textual)\nformat. Common use cases: Getting prompts from EPub books or several TeX\nfiles.\n\npositional arguments:\n  PATH                  Paths to the text files, or stdin if not provided\n                        (default: None)\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -c, --copy            Copy the prompt to clipboard (default: False)\n  -e, --edit            Edit the prompt and copy manually (default: False)\n  -m model, --model model\n                        Model to use. This only affects the chunk size. Use -S\n                        to disable splitting (infinite chunk size). (default:\n                        gpt-4-32k)\n  -S, --no-split        Do not split the prompt into multiple parts (use this\n                        if the model has a really large context size)\n                        (default: False)\n  -s chunk_size, --chunk-size chunk_size\n                        Chunk size when splitting transcript, also used to\n                        determine whether to split, defaults to 1/2 of the\n                        context length limit of the model (default: None)\n  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]\n                        Parts to select in the processes list of Documents\n                        (default: None)\n  -r, --raw             Wraps the content in triple quotes with no extra text\n                        (default: False)\n  -R, --raw-no-quotes   Output the content only (default: False)\n  --print-percentage-non-ascii\n                        Print percentage of non-ascii characters (default:\n                        False)\n  -n, --dry-run         Dry run (default: False)\n  --out OUT             Output file (default: None)\n  -C, --from-clipboard  Load text from clipboard (default: False)\n  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF\n                        content in the prompt (default: the content of a\n                        document)\n  -M, --merge           Merge contents of all pages before processing\n                        (default: False)\n  --from PANDOC_FROM_FORMAT\n                        The format that is passed to -f in pandoc (default:\n                        None)\n  --to PANDOC_TO_FORMAT\n                        The format that is passed to -t in pandoc. gfm-\n                        raw_html means GitHub Flavored Markdown with raw HTML\n                        stripped. (default: gfm-raw_html)\n\n```\n### `urlprompt`\n\n```\n$ urlprompt --help\n\nusage: urlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]\n                 [-P PARTS [PARTS ...]] [-r] [-R]\n                 [--print-percentage-non-ascii] [-n] [--out OUT] [-w WHAT]\n                 [-M] [-j] [-g] [--github-path GITHUB_PATH]\n                 [--github-revision GITHUB_REVISION] [--substack]\n                 URL\n\nGet a prompt consisting the text content of a webpage\n\npositional arguments:\n  URL                   URL to the webpage\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -c, --copy            Copy the prompt to clipboard (default: False)\n  -e, --edit            Edit the prompt and copy manually (default: False)\n  -m model, --model model\n                        Model to use. This only affects the chunk size. Use -S\n                        to disable splitting (infinite chunk size). (default:\n                        gpt-4-32k)\n  -S, --no-split        Do not split the prompt into multiple parts (use this\n                        if the model has a really large context size)\n                        (default: False)\n  -s chunk_size, --chunk-size chunk_size\n                        Chunk size when splitting transcript, also used to\n                        determine whether to split, defaults to 1/2 of the\n                        context length limit of the model (default: None)\n  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]\n                        Parts to select in the processes list of Documents\n                        (default: None)\n  -r, --raw             Wraps the content in triple quotes with no extra text\n                        (default: False)\n  -R, --raw-no-quotes   Output the content only (default: False)\n  --print-percentage-non-ascii\n                        Print percentage of non-ascii characters (default:\n                        False)\n  -n, --dry-run         Dry run (default: False)\n  --out OUT             Output file (default: None)\n  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF\n                        content in the prompt (default: the content of a\n                        webpage)\n  -M, --merge           Merge contents of all pages before processing\n                        (default: False)\n  -j, --javascript      Use JavaScript to render the page (default: False)\n  -g, --github          Load the raw file from a GitHub URL (default: False)\n  --github-path GITHUB_PATH\n                        Path to the GitHub file (default: README.md)\n  --github-revision GITHUB_REVISION\n                        Revision for the GitHub file (default: master)\n  --substack            Load from a Substack URL and convert it to Markdown\n                        (default: False)\n\n```\n### `pdfprompt`\n\n```\n$ pdfprompt --help\n\nusage: pdfprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]\n                 [-P PARTS [PARTS ...]] [-r] [-R]\n                 [--print-percentage-non-ascii] [-n] [--out OUT]\n                 [-p PAGES [PAGES ...]] [-l PAGE_SLICE] [-M] [-w WHAT] [-o]\n                 [-O] [-L OCR_LANGUAGE]\n                 PDF Path\n\nGet a prompt consisting the text content of a PDF file\n\npositional arguments:\n  PDF Path              Path to the PDF file\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -c, --copy            Copy the prompt to clipboard (default: False)\n  -e, --edit            Edit the prompt and copy manually (default: False)\n  -m model, --model model\n                        Model to use. This only affects the chunk size. Use -S\n                        to disable splitting (infinite chunk size). (default:\n                        gpt-4-32k)\n  -S, --no-split        Do not split the prompt into multiple parts (use this\n                        if the model has a really large context size)\n                        (default: False)\n  -s chunk_size, --chunk-size chunk_size\n                        Chunk size when splitting transcript, also used to\n                        determine whether to split, defaults to 1/2 of the\n                        context length limit of the model (default: None)\n  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]\n                        Parts to select in the processes list of Documents\n                        (default: None)\n  -r, --raw             Wraps the content in triple quotes with no extra text\n                        (default: False)\n  -R, --raw-no-quotes   Output the content only (default: False)\n  --print-percentage-non-ascii\n                        Print percentage of non-ascii characters (default:\n                        False)\n  -n, --dry-run         Dry run (default: False)\n  --out OUT             Output file (default: None)\n  -p PAGES [PAGES ...], --pages PAGES [PAGES ...]\n                        Only include specified page numbers (default: None)\n  -l PAGE_SLICE, --page-slice PAGE_SLICE\n                        Use Python slice syntax to select page numbers (e.g.\n                        1:3, 1:10:2, etc.) (default: None)\n  -M, --merge           Merge contents of all pages before processing\n                        (default: False)\n  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF\n                        content in the prompt (default: the content of a PDF\n                        file)\n  -o, --fallback-ocr    Use OCR as fallback if no text detected on page,\n                        please set TESSDATA_PREFIX environment variable to the\n                        path of your tesseract data directory (default: False)\n  -O, --force-ocr       Force OCR on all pages (default: False)\n  -L OCR_LANGUAGE, --ocr-language OCR_LANGUAGE\n                        Language to use for Tesseract OCR (like eng, chi_sim,\n                        chi_tra, chi_tra_vert etc.)) (default: eng)\n\n```\n### `ytprompt`\n\n```\n$ ytprompt --help\n\nusage: ytprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]\n                [-P PARTS [PARTS ...]] [-r] [-R]\n                [--print-percentage-non-ascii] [-n] [--out OUT]\n                YouTube URL\n\nGet a prompt consisting Title and Transcript of a YouTube Video\n\npositional arguments:\n  YouTube URL           YouTube URL\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -c, --copy            Copy the prompt to clipboard (default: False)\n  -e, --edit            Edit the prompt and copy manually (default: False)\n  -m model, --model model\n                        Model to use. This only affects the chunk size. Use -S\n                        to disable splitting (infinite chunk size). (default:\n                        gpt-4-32k)\n  -S, --no-split        Do not split the prompt into multiple parts (use this\n                        if the model has a really large context size)\n                        (default: False)\n  -s chunk_size, --chunk-size chunk_size\n                        Chunk size when splitting transcript, also used to\n                        determine whether to split, defaults to 1/2 of the\n                        context length limit of the model (default: None)\n  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]\n                        Parts to select in the processes list of Documents\n                        (default: None)\n  -r, --raw             Wraps the content in triple quotes with no extra text\n                        (default: False)\n  -R, --raw-no-quotes   Output the content only (default: False)\n  --print-percentage-non-ascii\n                        Print percentage of non-ascii characters (default:\n                        False)\n  -n, --dry-run         Dry run (default: False)\n  --out OUT             Output file (default: None)\n\n```\n### `textprompt`\n\n```\n$ textprompt --help\n\nusage: textprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]\n                  [-P PARTS [PARTS ...]] [-r] [-R]\n                  [--print-percentage-non-ascii] [-n] [--out OUT] [-C]\n                  [-w WHAT] [-M]\n                  [PATH ...]\n\nGet a prompt from text files\n\npositional arguments:\n  PATH                  Paths to the text files, or stdin if not provided\n                        (default: None)\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -c, --copy            Copy the prompt to clipboard (default: False)\n  -e, --edit            Edit the prompt and copy manually (default: False)\n  -m model, --model model\n                        Model to use. This only affects the chunk size. Use -S\n                        to disable splitting (infinite chunk size). (default:\n                        gpt-4-32k)\n  -S, --no-split        Do not split the prompt into multiple parts (use this\n                        if the model has a really large context size)\n                        (default: False)\n  -s chunk_size, --chunk-size chunk_size\n                        Chunk size when splitting transcript, also used to\n                        determine whether to split, defaults to 1/2 of the\n                        context length limit of the model (default: None)\n  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]\n                        Parts to select in the processes list of Documents\n                        (default: None)\n  -r, --raw             Wraps the content in triple quotes with no extra text\n                        (default: False)\n  -R, --raw-no-quotes   Output the content only (default: False)\n  --print-percentage-non-ascii\n                        Print percentage of non-ascii characters (default:\n                        False)\n  -n, --dry-run         Dry run (default: False)\n  --out OUT             Output file (default: None)\n  -C, --from-clipboard  Load text from clipboard (default: False)\n  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF\n                        content in the prompt (default: the content of a\n                        document)\n  -M, --merge           Merge contents of all pages before processing\n                        (default: False)\n\n```\n### `htmlprompt`\n\n```\n$ htmlprompt --help\n\nusage: htmlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]\n                  [-P PARTS [PARTS ...]] [-r] [-R]\n                  [--print-percentage-non-ascii] [-n] [--out OUT] [-C]\n                  [-w WHAT] [-M]\n                  [PATH ...]\n\nGet a prompt from html files\n\npositional arguments:\n  PATH                  Paths to the html files, or stdin if not provided\n                        (default: None)\n\noptions:\n  -h, --help            show this help message and exit\n  -V, --version         show program's version number and exit\n  -c, --copy            Copy the prompt to clipboard (default: False)\n  -e, --edit            Edit the prompt and copy manually (default: False)\n  -m model, --model model\n                        Model to use. This only affects the chunk size. Use -S\n                        to disable splitting (infinite chunk size). (default:\n                        gpt-4-32k)\n  -S, --no-split        Do not split the prompt into multiple parts (use this\n                        if the model has a really large context size)\n                        (default: False)\n  -s chunk_size, --chunk-size chunk_size\n                        Chunk size when splitting transcript, also used to\n                        determine whether to split, defaults to 1/2 of the\n                        context length limit of the model (default: None)\n  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]\n                        Parts to select in the processes list of Documents\n                        (default: None)\n  -r, --raw             Wraps the content in triple quotes with no extra text\n                        (default: False)\n  -R, --raw-no-quotes   Output the content only (default: False)\n  --print-percentage-non-ascii\n                        Print percentage of non-ascii characters (default:\n                        False)\n  -n, --dry-run         Dry run (default: False)\n  --out OUT             Output file (default: None)\n  -C, --from-clipboard  Load text from clipboard (default: False)\n  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF\n                        content in the prompt (default: the text content of a\n                        html file)\n  -M, --merge           Merge contents of all pages before processing\n                        (default: False)\n\n```\n\n## Installation\n\n### pipx\n\nThis is the recommended installation method.\n\n```\n$ pipx install langchain-utils\n```\n\n### [pip](https://pypi.org/project/langchain-utils/)\n\n```\n$ pip install langchain-utils\n```\n\n## Develop\n\n```\n$ git clone https://github.com/tddschn/langchain-utils.git\n$ cd langchain-utils\n$ poetry install\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftddschn%2Flangchain-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftddschn%2Flangchain-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftddschn%2Flangchain-utils/lists"}