{"id":24142144,"url":"https://github.com/mewmix/gh_llm_loader","last_synced_at":"2025-09-19T11:31:08.959Z","repository":{"id":224147754,"uuid":"762508931","full_name":"mewmix/gh_llm_loader","owner":"mewmix","description":"clone GitHub repositories and prepare their data for ingestion for LLMs.","archived":false,"fork":false,"pushed_at":"2024-11-25T20:24:17.000Z","size":20,"stargazers_count":9,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-25T21:27:55.839Z","etag":null,"topics":["context","data","data-structures","github","llm","llm-training","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mewmix.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-23T23:30:17.000Z","updated_at":"2024-11-25T20:24:21.000Z","dependencies_parsed_at":"2024-02-24T04:25:09.061Z","dependency_job_id":"3c9b9e95-4d6b-4049-9b86-8bc975325010","html_url":"https://github.com/mewmix/gh_llm_loader","commit_stats":null,"previous_names":["mewmix/gh_llm_loader"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mewmix%2Fgh_llm_loader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mewmix%2Fgh_llm_loader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mewmix%2Fgh_llm_loader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mewmix%2Fgh_llm_loader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mewmix","download_url":"https://codeload.github.com/mewmix/gh_llm_loader/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233566904,"owners_count":18695290,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["context","data","data-structures","github","llm","llm-training","python"],"created_at":"2025-01-12T04:56:01.934Z","updated_at":"2025-09-19T11:31:03.673Z","avatar_url":"https://github.com/mewmix.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n# gh_llm_loader\n\n`gh_llm_loader` is a package designed to clone GitHub repositories and prepare their data for ingestion for LLMs.\n\n## Features\n\n- **Prepare Repository Data**: Compiles the contents of repositories into a single, clean file by excluding specified folders and files, or including only specified folders and files. This streamlined format is more accessible for LLM ingestion.\n- **Flexible File Filtering**: Filter files based on extensions, filenames, or custom functions for maximum control over the included content.\n- **CLI and Library Integration**: Flexibility for various use cases and workflows.\n\n## Installation\n\nTo install `gh_llm_loader`, make sure you have Python installed on your system, then run the following command:\n\n```sh\ngit clone https://github.com/mewmix/gh_llm_loader\ncd gh_llm_loader\npip install .\n```\n\n**Prerequisites:**\n- Python 3.6 or newer\n- Git installed on your system\n\n## Usage\n\n### Library Usage\n\n`gh_llm_loader` can be easily integrated into Python scripts. Here's an example of how to use it:\n\n```python\nfrom gh_llm_loader import clone_and_prepare_repo\n\n# Define the GitHub repository URL\ngit_url = \"https://github.com/yourusername/yourrepository.git\"\n\n# Clone and prepare the repository, specifying folders and files to ignore or include\nclone_and_prepare_repo(git_url, ignored_folders={'node_modules', '.git'}, ignored_files={'README.md'}, included_folders={'.teamcity'}, file_filter=lambda f: f.endswith('.xml'))\n```\n\nThis function will clone the specified GitHub repository and prepare its data by compiling the files into a single file, excluding any folders or files specified, and including only the specified folders and files that match the filter criteria.\n\nIf you wish to simply curate a non github folder with the same methods the core function is available for import and use-\n\n```python\nimport os\nfrom gh_llm_loader import compile_files_to_single_file\n\n# Specify the path to your project directory\nsource_path = \"/path/to/your/project\"\n\n# Define the name for the output file\noutput_filename = \"project_compiled.txt\"\n\n# Specify any folders or files you want to ignore during compilation\nignored_folders = {'node_modules', '.git', 'build'}\nignored_files = {'README.md', 'LICENSE'}\n\n# Compile the project files into a single file\ncompile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)\n\nprint(f\"Compilation complete. The output is saved in {output_filename}\")\n```\n\nIf you wish to simply curate a non github folder with the same methods the core function is available for import and use-\n\n```python\nimport os\nfrom gh_llm_loader import compile_files_to_single_file\n\n# Specify the path to your project directory\nsource_path = \"/path/to/your/project\"\n\n# Define the name for the output file\noutput_filename = \"project_compiled.txt\"\n\n# Specify any folders or files you want to ignore during compilation\nignored_folders = {'node_modules', '.git', 'build'}\nignored_files = {'README.md', 'LICENSE'}\n\n# Compile the project files into a single file\ncompile_files_to_single_file(source_path, output_filename, ignored_folders, ignored_files)\n\nprint(f\"Compilation complete. The output is saved in {output_filename}\")\n```\n\n### Command-Line Interface (CLI)\n\nFor those preferring to use the command line, here are some examples:\n\na) Only use local folders, not Github\n```sh\ngh-llm-loader --base-dir test \n```\n\n\nb) Include only Python files (with github):\n```sh\ngh-llm-loader --git-url https://github.com/psf/requests --file-filter \"lambda f: f.endswith('.py')\"\n```\n\nc) Include only Markdown and text files (with github):\n```sh\ngh-llm-loader --git-url https://github.com/tensorflow/models --file-filter \"lambda f: f.endswith('.md') or f.endswith('.txt')\"\n```\n\nd) Include only files with \"test\" in the filename (with github):\n```sh\ngh-llm-loader --git-url https://github.com/django/django --file-filter \"lambda f: 'test' in f\"\n```\n\ne) Include only JavaScript files and the \"package.json\" file (with github):\n```sh\ngh-llm-loader --git-url https://github.com/facebook/react --file-filter \"lambda f: f.endswith('.js') or f == 'package.json'\"\n```\n\nf) Include only files in the \"src\" and \"docs\" folders (with github):\n```sh\ngh-llm-loader --git-url https://github.com/vuejs/vue --included-folders src docs\n```\n\ng) Exclude the \"tests\" folder and include only Python files (with github):\n```sh\ngh-llm-loader --git-url https://github.com/pallets/flask --ignored-folders tests --file-filter \"lambda f: f.endswith('.py')\"\n```\n\n\n\n\n\n## Contributing\n\nContributions to `gh_llm_loader` are highly encouraged and appreciated. \n\n\n## License\n\n`gh_llm_loader` is made available under the MIT License. For more details, see the LICENSE file in the repository.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmewmix%2Fgh_llm_loader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmewmix%2Fgh_llm_loader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmewmix%2Fgh_llm_loader/lists"}