Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mewmix/gh_llm_loader
clone GitHub repositories and prepare their data for ingestion for LLMs.
https://github.com/mewmix/gh_llm_loader
context data data-structures github llm llm-training python
Last synced: 14 days ago
JSON representation
clone GitHub repositories and prepare their data for ingestion for LLMs.
- Host: GitHub
- URL: https://github.com/mewmix/gh_llm_loader
- Owner: mewmix
- License: other
- Created: 2024-02-23T23:30:17.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-11-25T20:24:17.000Z (2 months ago)
- Last Synced: 2024-11-25T21:27:55.839Z (2 months ago)
- Topics: context, data, data-structures, github, llm, llm-training, python
- Language: Python
- Homepage:
- Size: 19.5 KB
- Stars: 9
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# gh_llm_loader
`gh_llm_loader` is a package designed to clone GitHub repositories and prepare their data for ingestion for LLMs.
## Features
- **Prepare Repository Data**: Compiles the contents of repositories into a single, clean file by excluding specified folders and files, or including only specified folders and files. This streamlined format is more accessible for LLM ingestion.
- **Flexible File Filtering**: Filter files based on extensions, filenames, or custom functions for maximum control over the included content.
- **CLI and Library Integration**: Flexibility for various use cases and workflows.## Installation
To install `gh_llm_loader`, make sure you have Python installed on your system, then run the following command:
```sh
git clone https://github.com/mewmix/gh_llm_loader
cd gh_llm_loader
pip install .
```**Prerequisites:**
- Python 3.6 or newer
- Git installed on your system## Usage
### Library Usage
`gh_llm_loader` can be easily integrated into Python scripts. Here's an example of how to use it:
```python
from gh_llm_loader import clone_and_prepare_repo# Define the GitHub repository URL
git_url = "https://github.com/yourusername/yourrepository.git"# Clone and prepare the repository, specifying folders and files to ignore or include
clone_and_prepare_repo(git_url, ignored_folders={'node_modules', '.git'}, ignored_files={'README.md'}, included_folders={'.teamcity'}, file_filter=lambda f: f.endswith('.xml'))
```This function will clone the specified GitHub repository and prepare its data by compiling the files into a single file, excluding any folders or files specified, and including only the specified folders and files that match the filter criteria.
If you wish to simply curate a non github folder with the same methods the core function is available for import and use-
```python
import os
from gh_llm_loader import compile_files_to_single_file# Specify the path to your project directory
source_path = "/path/to/your/project"# Define the name for the output file
output_filename = "project_compiled.txt"# Specify any folders or files you want to ignore during compilation
ignored_folders = {'node_modules', '.git', 'build'}
ignored_files = {'README.md', 'LICENSE'}# Compile the project files into a single file
compile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)print(f"Compilation complete. The output is saved in {output_filename}")
```If you wish to simply curate a non github folder with the same methods the core function is available for import and use-
```python
import os
from gh_llm_loader import compile_files_to_single_file# Specify the path to your project directory
source_path = "/path/to/your/project"# Define the name for the output file
output_filename = "project_compiled.txt"# Specify any folders or files you want to ignore during compilation
ignored_folders = {'node_modules', '.git', 'build'}
ignored_files = {'README.md', 'LICENSE'}# Compile the project files into a single file
compile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)print(f"Compilation complete. The output is saved in {output_filename}")
```### Command-Line Interface (CLI)
For those preferring to use the command line, here are some examples:
a) Only use local folders, not Github
```sh
gh-llm-loader --base-dir test
```b) Include only Python files (with github):
```sh
gh-llm-loader --git-url https://github.com/psf/requests --file-filter "lambda f: f.endswith('.py')"
```c) Include only Markdown and text files (with github):
```sh
gh-llm-loader --git-url https://github.com/tensorflow/models --file-filter "lambda f: f.endswith('.md') or f.endswith('.txt')"
```d) Include only files with "test" in the filename (with github):
```sh
gh-llm-loader --git-url https://github.com/django/django --file-filter "lambda f: 'test' in f"
```e) Include only JavaScript files and the "package.json" file (with github):
```sh
gh-llm-loader --git-url https://github.com/facebook/react --file-filter "lambda f: f.endswith('.js') or f == 'package.json'"
```f) Include only files in the "src" and "docs" folders (with github):
```sh
gh-llm-loader --git-url https://github.com/vuejs/vue --included-folders src docs
```g) Exclude the "tests" folder and include only Python files (with github):
```sh
gh-llm-loader --git-url https://github.com/pallets/flask --ignored-folders tests --file-filter "lambda f: f.endswith('.py')"
```## Contributing
Contributions to `gh_llm_loader` are highly encouraged and appreciated.
## License
`gh_llm_loader` is made available under the MIT License. For more details, see the LICENSE file in the repository.