Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tddschn/langchain-utils

LangChain Utilities for prompt generation from documents, URLs, and arbitrary files - streamlining your interactive workflow with LLMs!
https://github.com/tddschn/langchain-utils

chatgpt cli langchain llm pandoc prompt python3 utility

Last synced: 4 days ago
JSON representation

LangChain Utilities for prompt generation from documents, URLs, and arbitrary files - streamlining your interactive workflow with LLMs!

Awesome Lists containing this project

README

        

# langchain-utils

LangChain Utilities

- [langchain-utils](#langchain-utils)
- [Prompt generation using LangChain document loaders](#prompt-generation-using-langchain-document-loaders)
- [Demos](#demos)
- [`pandocprompt`](#pandocprompt)
- [`urlprompt`](#urlprompt)
- [`pdfprompt`](#pdfprompt)
- [`ytprompt`](#ytprompt)
- [`textprompt`](#textprompt)
- [`htmlprompt`](#htmlprompt)
- [Installation](#installation)
- [pipx](#pipx)
- [pip](#pip)
- [Develop](#develop)

## Prompt generation using LangChain document loaders

Do you find yourself frequently copy-pasting texts from the web / PDFs / other documents into ChatGPT?

If yes, these tools are for you!

Optimized to feed into a chat interface (like ChatGPT) manually in one or multiple (to get around context length limits) goes.

Basically, the prompts generated look like this:

```python
REPLY_OK_IF_YOU_READ_TEMPLATE = '''
Below is {what}, reply "OK" if you read:

"""
{content}
"""
'''.strip()
```

You can feed it directly to a chat interface like ChatGPT, and ask follow up questions about it.

See [`prompts.py`](./langchain_utils/prompts.py) for other variations.

### Demos

- Loading `https://github.com/tddschn/langchain-utils` and copy to clipboard:

- Load 3 pages of a pdf file, open each part for inspection before copying, and optionally merge 3 pages into 2 prompts that wouldn't go over the `gpt-3.5-turbo`'s context length limit with langchain's `TokenTextSplitter`.

### `pandocprompt`

```
$ pandocprompt --help

usage: pandocprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
[-P PARTS [PARTS ...]] [-r] [-R]
[--print-percentage-non-ascii] [-n] [--out OUT] [-C]
[-w WHAT] [-M] [--from PANDOC_FROM_FORMAT]
[--to PANDOC_TO_FORMAT]
[PATH ...]

Get prompts from arbitrary files. You need to have `pandoc` installed and in
$PATH, it will be used to convert source files to desired (hopefully textual)
format. Common use cases: Getting prompts from EPub books or several TeX
files.

positional arguments:
PATH Paths to the text files, or stdin if not provided
(default: None)

options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-c, --copy Copy the prompt to clipboard (default: False)
-e, --edit Edit the prompt and copy manually (default: False)
-m model, --model model
Model to use. This only affects the chunk size. Use -S
to disable splitting (infinite chunk size). (default:
gpt-4-32k)
-S, --no-split Do not split the prompt into multiple parts (use this
if the model has a really large context size)
(default: False)
-s chunk_size, --chunk-size chunk_size
Chunk size when splitting transcript, also used to
determine whether to split, defaults to 1/2 of the
context length limit of the model (default: None)
-P PARTS [PARTS ...], --parts PARTS [PARTS ...]
Parts to select in the processes list of Documents
(default: None)
-r, --raw Wraps the content in triple quotes with no extra text
(default: False)
-R, --raw-no-quotes Output the content only (default: False)
--print-percentage-non-ascii
Print percentage of non-ascii characters (default:
False)
-n, --dry-run Dry run (default: False)
--out OUT Output file (default: None)
-C, --from-clipboard Load text from clipboard (default: False)
-w WHAT, --what WHAT Initial knowledge you want to insert before the PDF
content in the prompt (default: the content of a
document)
-M, --merge Merge contents of all pages before processing
(default: False)
--from PANDOC_FROM_FORMAT
The format that is passed to -f in pandoc (default:
None)
--to PANDOC_TO_FORMAT
The format that is passed to -t in pandoc. gfm-
raw_html means GitHub Flavored Markdown with raw HTML
stripped. (default: gfm-raw_html)

```
### `urlprompt`

```
$ urlprompt --help

usage: urlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
[-P PARTS [PARTS ...]] [-r] [-R]
[--print-percentage-non-ascii] [-n] [--out OUT] [-w WHAT]
[-M] [-j] [-g] [--github-path GITHUB_PATH]
[--github-revision GITHUB_REVISION] [--substack]
URL

Get a prompt consisting the text content of a webpage

positional arguments:
URL URL to the webpage

options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-c, --copy Copy the prompt to clipboard (default: False)
-e, --edit Edit the prompt and copy manually (default: False)
-m model, --model model
Model to use. This only affects the chunk size. Use -S
to disable splitting (infinite chunk size). (default:
gpt-4-32k)
-S, --no-split Do not split the prompt into multiple parts (use this
if the model has a really large context size)
(default: False)
-s chunk_size, --chunk-size chunk_size
Chunk size when splitting transcript, also used to
determine whether to split, defaults to 1/2 of the
context length limit of the model (default: None)
-P PARTS [PARTS ...], --parts PARTS [PARTS ...]
Parts to select in the processes list of Documents
(default: None)
-r, --raw Wraps the content in triple quotes with no extra text
(default: False)
-R, --raw-no-quotes Output the content only (default: False)
--print-percentage-non-ascii
Print percentage of non-ascii characters (default:
False)
-n, --dry-run Dry run (default: False)
--out OUT Output file (default: None)
-w WHAT, --what WHAT Initial knowledge you want to insert before the PDF
content in the prompt (default: the content of a
webpage)
-M, --merge Merge contents of all pages before processing
(default: False)
-j, --javascript Use JavaScript to render the page (default: False)
-g, --github Load the raw file from a GitHub URL (default: False)
--github-path GITHUB_PATH
Path to the GitHub file (default: README.md)
--github-revision GITHUB_REVISION
Revision for the GitHub file (default: master)
--substack Load from a Substack URL and convert it to Markdown
(default: False)

```
### `pdfprompt`

```
$ pdfprompt --help

usage: pdfprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
[-P PARTS [PARTS ...]] [-r] [-R]
[--print-percentage-non-ascii] [-n] [--out OUT]
[-p PAGES [PAGES ...]] [-l PAGE_SLICE] [-M] [-w WHAT] [-o]
[-O] [-L OCR_LANGUAGE]
PDF Path

Get a prompt consisting the text content of a PDF file

positional arguments:
PDF Path Path to the PDF file

options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-c, --copy Copy the prompt to clipboard (default: False)
-e, --edit Edit the prompt and copy manually (default: False)
-m model, --model model
Model to use. This only affects the chunk size. Use -S
to disable splitting (infinite chunk size). (default:
gpt-4-32k)
-S, --no-split Do not split the prompt into multiple parts (use this
if the model has a really large context size)
(default: False)
-s chunk_size, --chunk-size chunk_size
Chunk size when splitting transcript, also used to
determine whether to split, defaults to 1/2 of the
context length limit of the model (default: None)
-P PARTS [PARTS ...], --parts PARTS [PARTS ...]
Parts to select in the processes list of Documents
(default: None)
-r, --raw Wraps the content in triple quotes with no extra text
(default: False)
-R, --raw-no-quotes Output the content only (default: False)
--print-percentage-non-ascii
Print percentage of non-ascii characters (default:
False)
-n, --dry-run Dry run (default: False)
--out OUT Output file (default: None)
-p PAGES [PAGES ...], --pages PAGES [PAGES ...]
Only include specified page numbers (default: None)
-l PAGE_SLICE, --page-slice PAGE_SLICE
Use Python slice syntax to select page numbers (e.g.
1:3, 1:10:2, etc.) (default: None)
-M, --merge Merge contents of all pages before processing
(default: False)
-w WHAT, --what WHAT Initial knowledge you want to insert before the PDF
content in the prompt (default: the content of a PDF
file)
-o, --fallback-ocr Use OCR as fallback if no text detected on page,
please set TESSDATA_PREFIX environment variable to the
path of your tesseract data directory (default: False)
-O, --force-ocr Force OCR on all pages (default: False)
-L OCR_LANGUAGE, --ocr-language OCR_LANGUAGE
Language to use for Tesseract OCR (like eng, chi_sim,
chi_tra, chi_tra_vert etc.)) (default: eng)

```
### `ytprompt`

```
$ ytprompt --help

usage: ytprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
[-P PARTS [PARTS ...]] [-r] [-R]
[--print-percentage-non-ascii] [-n] [--out OUT]
YouTube URL

Get a prompt consisting Title and Transcript of a YouTube Video

positional arguments:
YouTube URL YouTube URL

options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-c, --copy Copy the prompt to clipboard (default: False)
-e, --edit Edit the prompt and copy manually (default: False)
-m model, --model model
Model to use. This only affects the chunk size. Use -S
to disable splitting (infinite chunk size). (default:
gpt-4-32k)
-S, --no-split Do not split the prompt into multiple parts (use this
if the model has a really large context size)
(default: False)
-s chunk_size, --chunk-size chunk_size
Chunk size when splitting transcript, also used to
determine whether to split, defaults to 1/2 of the
context length limit of the model (default: None)
-P PARTS [PARTS ...], --parts PARTS [PARTS ...]
Parts to select in the processes list of Documents
(default: None)
-r, --raw Wraps the content in triple quotes with no extra text
(default: False)
-R, --raw-no-quotes Output the content only (default: False)
--print-percentage-non-ascii
Print percentage of non-ascii characters (default:
False)
-n, --dry-run Dry run (default: False)
--out OUT Output file (default: None)

```
### `textprompt`

```
$ textprompt --help

usage: textprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
[-P PARTS [PARTS ...]] [-r] [-R]
[--print-percentage-non-ascii] [-n] [--out OUT] [-C]
[-w WHAT] [-M]
[PATH ...]

Get a prompt from text files

positional arguments:
PATH Paths to the text files, or stdin if not provided
(default: None)

options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-c, --copy Copy the prompt to clipboard (default: False)
-e, --edit Edit the prompt and copy manually (default: False)
-m model, --model model
Model to use. This only affects the chunk size. Use -S
to disable splitting (infinite chunk size). (default:
gpt-4-32k)
-S, --no-split Do not split the prompt into multiple parts (use this
if the model has a really large context size)
(default: False)
-s chunk_size, --chunk-size chunk_size
Chunk size when splitting transcript, also used to
determine whether to split, defaults to 1/2 of the
context length limit of the model (default: None)
-P PARTS [PARTS ...], --parts PARTS [PARTS ...]
Parts to select in the processes list of Documents
(default: None)
-r, --raw Wraps the content in triple quotes with no extra text
(default: False)
-R, --raw-no-quotes Output the content only (default: False)
--print-percentage-non-ascii
Print percentage of non-ascii characters (default:
False)
-n, --dry-run Dry run (default: False)
--out OUT Output file (default: None)
-C, --from-clipboard Load text from clipboard (default: False)
-w WHAT, --what WHAT Initial knowledge you want to insert before the PDF
content in the prompt (default: the content of a
document)
-M, --merge Merge contents of all pages before processing
(default: False)

```
### `htmlprompt`

```
$ htmlprompt --help

usage: htmlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
[-P PARTS [PARTS ...]] [-r] [-R]
[--print-percentage-non-ascii] [-n] [--out OUT] [-C]
[-w WHAT] [-M]
[PATH ...]

Get a prompt from html files

positional arguments:
PATH Paths to the html files, or stdin if not provided
(default: None)

options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-c, --copy Copy the prompt to clipboard (default: False)
-e, --edit Edit the prompt and copy manually (default: False)
-m model, --model model
Model to use. This only affects the chunk size. Use -S
to disable splitting (infinite chunk size). (default:
gpt-4-32k)
-S, --no-split Do not split the prompt into multiple parts (use this
if the model has a really large context size)
(default: False)
-s chunk_size, --chunk-size chunk_size
Chunk size when splitting transcript, also used to
determine whether to split, defaults to 1/2 of the
context length limit of the model (default: None)
-P PARTS [PARTS ...], --parts PARTS [PARTS ...]
Parts to select in the processes list of Documents
(default: None)
-r, --raw Wraps the content in triple quotes with no extra text
(default: False)
-R, --raw-no-quotes Output the content only (default: False)
--print-percentage-non-ascii
Print percentage of non-ascii characters (default:
False)
-n, --dry-run Dry run (default: False)
--out OUT Output file (default: None)
-C, --from-clipboard Load text from clipboard (default: False)
-w WHAT, --what WHAT Initial knowledge you want to insert before the PDF
content in the prompt (default: the text content of a
html file)
-M, --merge Merge contents of all pages before processing
(default: False)

```

## Installation

### pipx

This is the recommended installation method.

```
$ pipx install langchain-utils
```

### [pip](https://pypi.org/project/langchain-utils/)

```
$ pip install langchain-utils
```

## Develop

```
$ git clone https://github.com/tddschn/langchain-utils.git
$ cd langchain-utils
$ poetry install
```