Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mewmix/gh_llm_loader

clone GitHub repositories and prepare their data for ingestion for LLMs.
https://github.com/mewmix/gh_llm_loader

context data data-structures github llm llm-training python

Last synced: 14 days ago
JSON representation

clone GitHub repositories and prepare their data for ingestion for LLMs.

Awesome Lists containing this project

README

        

# gh_llm_loader

`gh_llm_loader` is a package designed to clone GitHub repositories and prepare their data for ingestion for LLMs.

## Features

- **Prepare Repository Data**: Compiles the contents of repositories into a single, clean file by excluding specified folders and files, or including only specified folders and files. This streamlined format is more accessible for LLM ingestion.
- **Flexible File Filtering**: Filter files based on extensions, filenames, or custom functions for maximum control over the included content.
- **CLI and Library Integration**: Flexibility for various use cases and workflows.

## Installation

To install `gh_llm_loader`, make sure you have Python installed on your system, then run the following command:

```sh
git clone https://github.com/mewmix/gh_llm_loader
cd gh_llm_loader
pip install .
```

**Prerequisites:**
- Python 3.6 or newer
- Git installed on your system

## Usage

### Library Usage

`gh_llm_loader` can be easily integrated into Python scripts. Here's an example of how to use it:

```python
from gh_llm_loader import clone_and_prepare_repo

# Define the GitHub repository URL
git_url = "https://github.com/yourusername/yourrepository.git"

# Clone and prepare the repository, specifying folders and files to ignore or include
clone_and_prepare_repo(git_url, ignored_folders={'node_modules', '.git'}, ignored_files={'README.md'}, included_folders={'.teamcity'}, file_filter=lambda f: f.endswith('.xml'))
```

This function will clone the specified GitHub repository and prepare its data by compiling the files into a single file, excluding any folders or files specified, and including only the specified folders and files that match the filter criteria.

If you wish to simply curate a non github folder with the same methods the core function is available for import and use-

```python
import os
from gh_llm_loader import compile_files_to_single_file

# Specify the path to your project directory
source_path = "/path/to/your/project"

# Define the name for the output file
output_filename = "project_compiled.txt"

# Specify any folders or files you want to ignore during compilation
ignored_folders = {'node_modules', '.git', 'build'}
ignored_files = {'README.md', 'LICENSE'}

# Compile the project files into a single file
compile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)

print(f"Compilation complete. The output is saved in {output_filename}")
```

If you wish to simply curate a non github folder with the same methods the core function is available for import and use-

```python
import os
from gh_llm_loader import compile_files_to_single_file

# Specify the path to your project directory
source_path = "/path/to/your/project"

# Define the name for the output file
output_filename = "project_compiled.txt"

# Specify any folders or files you want to ignore during compilation
ignored_folders = {'node_modules', '.git', 'build'}
ignored_files = {'README.md', 'LICENSE'}

# Compile the project files into a single file
compile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)

print(f"Compilation complete. The output is saved in {output_filename}")
```

### Command-Line Interface (CLI)

For those preferring to use the command line, here are some examples:

a) Only use local folders, not Github
```sh
gh-llm-loader --base-dir test
```

b) Include only Python files (with github):
```sh
gh-llm-loader --git-url https://github.com/psf/requests --file-filter "lambda f: f.endswith('.py')"
```

c) Include only Markdown and text files (with github):
```sh
gh-llm-loader --git-url https://github.com/tensorflow/models --file-filter "lambda f: f.endswith('.md') or f.endswith('.txt')"
```

d) Include only files with "test" in the filename (with github):
```sh
gh-llm-loader --git-url https://github.com/django/django --file-filter "lambda f: 'test' in f"
```

e) Include only JavaScript files and the "package.json" file (with github):
```sh
gh-llm-loader --git-url https://github.com/facebook/react --file-filter "lambda f: f.endswith('.js') or f == 'package.json'"
```

f) Include only files in the "src" and "docs" folders (with github):
```sh
gh-llm-loader --git-url https://github.com/vuejs/vue --included-folders src docs
```

g) Exclude the "tests" folder and include only Python files (with github):
```sh
gh-llm-loader --git-url https://github.com/pallets/flask --ignored-folders tests --file-filter "lambda f: f.endswith('.py')"
```

## Contributing

Contributions to `gh_llm_loader` are highly encouraged and appreciated.

## License

`gh_llm_loader` is made available under the MIT License. For more details, see the LICENSE file in the repository.