Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/tddschn/langchain-utils

LangChain Utilities for prompt generation from documents, URLs, and arbitrary files - streamlining your interactive workflow with LLMs!
https://github.com/tddschn/langchain-utils
chatgpt cli langchain llm pandoc prompt python3 utility
Last synced: 5 days ago
JSON representation
LangChain Utilities for prompt generation from documents, URLs, and arbitrary files - streamlining your interactive workflow with LLMs!
Host: GitHub
URL: https://github.com/tddschn/langchain-utils
Owner: tddschn
Created: 2023-04-09T04:40:52.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2024-06-05T17:20:20.000Z (8 months ago)
Last Synced: 2025-01-20T23:03:35.543Z (9 days ago)
Topics: chatgpt, cli, langchain, llm, pandoc, prompt, python3, utility
Language: Python
Homepage: https://pypi.org/project/langchain-utils
Size: 745 KB
Stars: 9
Watchers: 3
Forks: 2
Open Issues: 4
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # langchain-utils

LangChain Utilities

- [langchain-utils](#langchain-utils)

  - [Prompt generation using LangChain document loaders](#prompt-generation-using-langchain-document-loaders)

    - [Demos](#demos)

    - [`pandocprompt`](#pandocprompt)

    - [`urlprompt`](#urlprompt)

    - [`pdfprompt`](#pdfprompt)

    - [`ytprompt`](#ytprompt)

    - [`textprompt`](#textprompt)

    - [`htmlprompt`](#htmlprompt)

  - [Installation](#installation)

    - [pipx](#pipx)

    - [pip](#pip)

  - [Develop](#develop)

## Prompt generation using LangChain document loaders

Do you find yourself frequently copy-pasting texts from the web / PDFs / other documents into ChatGPT?

If yes, these tools are for you!

Optimized to feed into a chat interface (like ChatGPT) manually in one or multiple (to get around context length limits) goes.

Basically, the prompts generated look like this:

```python

REPLY_OK_IF_YOU_READ_TEMPLATE = '''

Below is {what}, reply "OK" if you read:

"""

{content}

"""

'''.strip()

```

You can feed it directly to a chat interface like ChatGPT, and ask follow up questions about it.

See [`prompts.py`](./langchain_utils/prompts.py) for other variations.

### Demos

- Loading `https://github.com/tddschn/langchain-utils` and copy to clipboard:

- Load 3 pages of a pdf file, open each part for inspection before copying, and optionally merge 3 pages into 2 prompts that wouldn't go over the `gpt-3.5-turbo`'s context length limit with langchain's `TokenTextSplitter`.

### `pandocprompt`

```

$ pandocprompt --help

usage: pandocprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]

                    [-P PARTS [PARTS ...]] [-r] [-R]

                    [--print-percentage-non-ascii] [-n] [--out OUT] [-C]

                    [-w WHAT] [-M] [--from PANDOC_FROM_FORMAT]

                    [--to PANDOC_TO_FORMAT]

                    [PATH ...]

Get prompts from arbitrary files. You need to have `pandoc` installed and in

$PATH, it will be used to convert source files to desired (hopefully textual)

format. Common use cases: Getting prompts from EPub books or several TeX

files.

positional arguments:

  PATH                  Paths to the text files, or stdin if not provided

                        (default: None)

options:

  -h, --help            show this help message and exit

  -V, --version         show program's version number and exit

  -c, --copy            Copy the prompt to clipboard (default: False)

  -e, --edit            Edit the prompt and copy manually (default: False)

  -m model, --model model

                        Model to use. This only affects the chunk size. Use -S

                        to disable splitting (infinite chunk size). (default:

                        gpt-4-32k)

  -S, --no-split        Do not split the prompt into multiple parts (use this

                        if the model has a really large context size)

                        (default: False)

  -s chunk_size, --chunk-size chunk_size

                        Chunk size when splitting transcript, also used to

                        determine whether to split, defaults to 1/2 of the

                        context length limit of the model (default: None)

  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]

                        Parts to select in the processes list of Documents

                        (default: None)

  -r, --raw             Wraps the content in triple quotes with no extra text

                        (default: False)

  -R, --raw-no-quotes   Output the content only (default: False)

  --print-percentage-non-ascii

                        Print percentage of non-ascii characters (default:

                        False)

  -n, --dry-run         Dry run (default: False)

  --out OUT             Output file (default: None)

  -C, --from-clipboard  Load text from clipboard (default: False)

  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF

                        content in the prompt (default: the content of a

                        document)

  -M, --merge           Merge contents of all pages before processing

                        (default: False)

  --from PANDOC_FROM_FORMAT

                        The format that is passed to -f in pandoc (default:

                        None)

  --to PANDOC_TO_FORMAT

                        The format that is passed to -t in pandoc. gfm-

                        raw_html means GitHub Flavored Markdown with raw HTML

                        stripped. (default: gfm-raw_html)

```

### `urlprompt`

```

$ urlprompt --help

usage: urlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]

                 [-P PARTS [PARTS ...]] [-r] [-R]

                 [--print-percentage-non-ascii] [-n] [--out OUT] [-w WHAT]

                 [-M] [-j] [-g] [--github-path GITHUB_PATH]

                 [--github-revision GITHUB_REVISION] [--substack]

                 URL

Get a prompt consisting the text content of a webpage

positional arguments:

  URL                   URL to the webpage

options:

  -h, --help            show this help message and exit

  -V, --version         show program's version number and exit

  -c, --copy            Copy the prompt to clipboard (default: False)

  -e, --edit            Edit the prompt and copy manually (default: False)

  -m model, --model model

                        Model to use. This only affects the chunk size. Use -S

                        to disable splitting (infinite chunk size). (default:

                        gpt-4-32k)

  -S, --no-split        Do not split the prompt into multiple parts (use this

                        if the model has a really large context size)

                        (default: False)

  -s chunk_size, --chunk-size chunk_size

                        Chunk size when splitting transcript, also used to

                        determine whether to split, defaults to 1/2 of the

                        context length limit of the model (default: None)

  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]

                        Parts to select in the processes list of Documents

                        (default: None)

  -r, --raw             Wraps the content in triple quotes with no extra text

                        (default: False)

  -R, --raw-no-quotes   Output the content only (default: False)

  --print-percentage-non-ascii

                        Print percentage of non-ascii characters (default:

                        False)

  -n, --dry-run         Dry run (default: False)

  --out OUT             Output file (default: None)

  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF

                        content in the prompt (default: the content of a

                        webpage)

  -M, --merge           Merge contents of all pages before processing

                        (default: False)

  -j, --javascript      Use JavaScript to render the page (default: False)

  -g, --github          Load the raw file from a GitHub URL (default: False)

  --github-path GITHUB_PATH

                        Path to the GitHub file (default: README.md)

  --github-revision GITHUB_REVISION

                        Revision for the GitHub file (default: master)

  --substack            Load from a Substack URL and convert it to Markdown

                        (default: False)

```

### `pdfprompt`

```

$ pdfprompt --help

usage: pdfprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]

                 [-P PARTS [PARTS ...]] [-r] [-R]

                 [--print-percentage-non-ascii] [-n] [--out OUT]

                 [-p PAGES [PAGES ...]] [-l PAGE_SLICE] [-M] [-w WHAT] [-o]

                 [-O] [-L OCR_LANGUAGE]

                 PDF Path

Get a prompt consisting the text content of a PDF file

positional arguments:

  PDF Path              Path to the PDF file

options:

  -h, --help            show this help message and exit

  -V, --version         show program's version number and exit

  -c, --copy            Copy the prompt to clipboard (default: False)

  -e, --edit            Edit the prompt and copy manually (default: False)

  -m model, --model model

                        Model to use. This only affects the chunk size. Use -S

                        to disable splitting (infinite chunk size). (default:

                        gpt-4-32k)

  -S, --no-split        Do not split the prompt into multiple parts (use this

                        if the model has a really large context size)

                        (default: False)

  -s chunk_size, --chunk-size chunk_size

                        Chunk size when splitting transcript, also used to

                        determine whether to split, defaults to 1/2 of the

                        context length limit of the model (default: None)

  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]

                        Parts to select in the processes list of Documents

                        (default: None)

  -r, --raw             Wraps the content in triple quotes with no extra text

                        (default: False)

  -R, --raw-no-quotes   Output the content only (default: False)

  --print-percentage-non-ascii

                        Print percentage of non-ascii characters (default:

                        False)

  -n, --dry-run         Dry run (default: False)

  --out OUT             Output file (default: None)

  -p PAGES [PAGES ...], --pages PAGES [PAGES ...]

                        Only include specified page numbers (default: None)

  -l PAGE_SLICE, --page-slice PAGE_SLICE

                        Use Python slice syntax to select page numbers (e.g.

                        1:3, 1:10:2, etc.) (default: None)

  -M, --merge           Merge contents of all pages before processing

                        (default: False)

  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF

                        content in the prompt (default: the content of a PDF

                        file)

  -o, --fallback-ocr    Use OCR as fallback if no text detected on page,

                        please set TESSDATA_PREFIX environment variable to the

                        path of your tesseract data directory (default: False)

  -O, --force-ocr       Force OCR on all pages (default: False)

  -L OCR_LANGUAGE, --ocr-language OCR_LANGUAGE

                        Language to use for Tesseract OCR (like eng, chi_sim,

                        chi_tra, chi_tra_vert etc.)) (default: eng)

```

### `ytprompt`

```

$ ytprompt --help

usage: ytprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]

                [-P PARTS [PARTS ...]] [-r] [-R]

                [--print-percentage-non-ascii] [-n] [--out OUT]

                YouTube URL

Get a prompt consisting Title and Transcript of a YouTube Video

positional arguments:

  YouTube URL           YouTube URL

options:

  -h, --help            show this help message and exit

  -V, --version         show program's version number and exit

  -c, --copy            Copy the prompt to clipboard (default: False)

  -e, --edit            Edit the prompt and copy manually (default: False)

  -m model, --model model

                        Model to use. This only affects the chunk size. Use -S

                        to disable splitting (infinite chunk size). (default:

                        gpt-4-32k)

  -S, --no-split        Do not split the prompt into multiple parts (use this

                        if the model has a really large context size)

                        (default: False)

  -s chunk_size, --chunk-size chunk_size

                        Chunk size when splitting transcript, also used to

                        determine whether to split, defaults to 1/2 of the

                        context length limit of the model (default: None)

  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]

                        Parts to select in the processes list of Documents

                        (default: None)

  -r, --raw             Wraps the content in triple quotes with no extra text

                        (default: False)

  -R, --raw-no-quotes   Output the content only (default: False)

  --print-percentage-non-ascii

                        Print percentage of non-ascii characters (default:

                        False)

  -n, --dry-run         Dry run (default: False)

  --out OUT             Output file (default: None)

```

### `textprompt`

```

$ textprompt --help

usage: textprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]

                  [-P PARTS [PARTS ...]] [-r] [-R]

                  [--print-percentage-non-ascii] [-n] [--out OUT] [-C]

                  [-w WHAT] [-M]

                  [PATH ...]

Get a prompt from text files

positional arguments:

  PATH                  Paths to the text files, or stdin if not provided

                        (default: None)

options:

  -h, --help            show this help message and exit

  -V, --version         show program's version number and exit

  -c, --copy            Copy the prompt to clipboard (default: False)

  -e, --edit            Edit the prompt and copy manually (default: False)

  -m model, --model model

                        Model to use. This only affects the chunk size. Use -S

                        to disable splitting (infinite chunk size). (default:

                        gpt-4-32k)

  -S, --no-split        Do not split the prompt into multiple parts (use this

                        if the model has a really large context size)

                        (default: False)

  -s chunk_size, --chunk-size chunk_size

                        Chunk size when splitting transcript, also used to

                        determine whether to split, defaults to 1/2 of the

                        context length limit of the model (default: None)

  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]

                        Parts to select in the processes list of Documents

                        (default: None)

  -r, --raw             Wraps the content in triple quotes with no extra text

                        (default: False)

  -R, --raw-no-quotes   Output the content only (default: False)

  --print-percentage-non-ascii

                        Print percentage of non-ascii characters (default:

                        False)

  -n, --dry-run         Dry run (default: False)

  --out OUT             Output file (default: None)

  -C, --from-clipboard  Load text from clipboard (default: False)

  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF

                        content in the prompt (default: the content of a

                        document)

  -M, --merge           Merge contents of all pages before processing

                        (default: False)

```

### `htmlprompt`

```

$ htmlprompt --help

usage: htmlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]

                  [-P PARTS [PARTS ...]] [-r] [-R]

                  [--print-percentage-non-ascii] [-n] [--out OUT] [-C]

                  [-w WHAT] [-M]

                  [PATH ...]

Get a prompt from html files

positional arguments:

  PATH                  Paths to the html files, or stdin if not provided

                        (default: None)

options:

  -h, --help            show this help message and exit

  -V, --version         show program's version number and exit

  -c, --copy            Copy the prompt to clipboard (default: False)

  -e, --edit            Edit the prompt and copy manually (default: False)

  -m model, --model model

                        Model to use. This only affects the chunk size. Use -S

                        to disable splitting (infinite chunk size). (default:

                        gpt-4-32k)

  -S, --no-split        Do not split the prompt into multiple parts (use this

                        if the model has a really large context size)

                        (default: False)

  -s chunk_size, --chunk-size chunk_size

                        Chunk size when splitting transcript, also used to

                        determine whether to split, defaults to 1/2 of the

                        context length limit of the model (default: None)

  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]

                        Parts to select in the processes list of Documents

                        (default: None)

  -r, --raw             Wraps the content in triple quotes with no extra text

                        (default: False)

  -R, --raw-no-quotes   Output the content only (default: False)

  --print-percentage-non-ascii

                        Print percentage of non-ascii characters (default:

                        False)

  -n, --dry-run         Dry run (default: False)

  --out OUT             Output file (default: None)

  -C, --from-clipboard  Load text from clipboard (default: False)

  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF

                        content in the prompt (default: the text content of a

                        html file)

  -M, --merge           Merge contents of all pages before processing

                        (default: False)

```

## Installation

### pipx

This is the recommended installation method.

```

$ pipx install langchain-utils

```

### [pip](https://pypi.org/project/langchain-utils/)

```

$ pip install langchain-utils

```

## Develop

```

$ git clone https://github.com/tddschn/langchain-utils.git

$ cd langchain-utils

$ poetry install

```