{"id":24990428,"url":"https://github.com/bessouat40/prefect-github-indexer","last_synced_at":"2026-04-20T05:33:07.188Z","repository":{"id":275314079,"uuid":"925324416","full_name":"Bessouat40/prefect-github-indexer","owner":"Bessouat40","description":"A Prefect pipeline that periodically scrapes one or more GitHub repositories, generates embeddings, and indexes them in ChromaDB.","archived":false,"fork":false,"pushed_at":"2025-02-01T15:54:51.000Z","size":9,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-29T12:46:57.801Z","etag":null,"topics":["automation","database","dataengineering","dataextraction","docker","docker-compose","prefect","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bessouat40.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-31T16:58:21.000Z","updated_at":"2025-02-01T15:54:54.000Z","dependencies_parsed_at":"2025-02-01T16:44:22.034Z","dependency_job_id":null,"html_url":"https://github.com/Bessouat40/prefect-github-indexer","commit_stats":null,"previous_names":["bessouat40/prefect-github-indexer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Bessouat40/prefect-github-indexer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2Fprefect-github-indexer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2Fprefect-github-indexer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2Fprefect-github-indexer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2Fprefect-github-indexer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bessouat40","download_url":"https://codeload.github.com/Bessouat40/prefect-github-indexer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bessouat40%2Fprefect-github-indexer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32034647,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","database","dataengineering","dataextraction","docker","docker-compose","prefect","python"],"created_at":"2025-02-04T13:36:27.275Z","updated_at":"2026-04-20T05:33:07.173Z","avatar_url":"https://github.com/Bessouat40.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Prefect GitHub Indexer\n\nA Prefect pipeline that periodically scrapes one or more GitHub repositories, generates embeddings, and indexes them in ChromaDB.\n\n## What is Prefect?\n\nPrefect is a modern workflow orchestration platform that helps you:\n\n- Write data workflows in Python using a straightforward, decorator-based API.\n- Schedule your workflows (flows) to run at specific times or intervals (daily, weekly, etc.).\n- Monitor flow executions in a web-based UI, with clear logs and task statuses.\n- Handle retries, concurrency limits, configuration of infrastructure (Docker, Kubernetes, etc.), and more.\n\nIn this project, Prefect orchestrates:\n\n- Scraping GitHub repositories.\n- Generating embeddings for code files.\n- Indexing these embeddings into ChromaDB.\n- Scheduling daily or periodic runs to keep your embeddings up-to-date.\n\n## Architecture Overview\n\nThis repository’s main flow is defined in flows/github-scrapper.py. Here’s how it all fits together:\n\n1. Flow: update_vector_store\n   The top-level orchestration function that processes multiple GitHub repos in parallel.\n2. Tasks: fetch_repo, ingest_repo_to_vector_store\n   Smaller, reusable steps that do the actual work (cloning a repo, ingesting into ChromaDB, etc.).\n3. Subflow: process_repo\n   Called for each repo. Manages the end-to-end flow of cloning, indexing, and cleanup for that repository.\n4. Scheduling: The flow can be scheduled to run at midnight every day (cron=\"0 0 \\* \\* \\*\").\n\nPrefect’s server and agent allow you to:\n\nServer (UI \u0026 API): See your flow runs, logs, and manage deployments from a nice dashboard.\nAgent (Worker): Listens for scheduled or triggered flow runs and executes them (in Docker containers or other infrastructure).\n\n## Installation\n\n1. Clone this repository.\n2. Install Prefect and any additional dependencies:\n\n```bash\npython -m pip install -U prefect\n```\n\n## Running Locally\n\nYou can test the flow directly by running :\n\n```bash\npython flows/github-scrapper.py\n```\n\nBy default, every day at midnight this will:\n\n- Clone the specified repositories.\n- Ingest them into ChromaDB (in the local chroma_db folder).\n- Print logs about each step.\n\n## Prefect Server\n\nPrefect comes with a built-in API and UI (sometimes referred to as [Orion UI] in older docs). You can start it locally by running :\n\n```bash\npython -m prefect server start\n```\n\nThen open [localhost:4200](http://localhost:4200/dashboard) in your browser. You’ll see:\n\n- A dashboard with Deployments (flows you’ve registered),\n- Flow Runs (individual executions of flows),\n- Logs for debugging, etc.\n\nRegistering your flow with the server allows you to schedule it from the UI, monitor runs, and scale out using agents on other machines or Docker containers.\n\n## Docker Deployment\n\nFor a more production-like setup, you can run everything via Docker Compose. This will spin up:\n\n- prefect-server (UI + API),\n- prefect-agent (the worker that executes flows),\n- A Docker volume for ChromaDB data (chroma_db).\n\n### Build and run\n\n```bash\ndocker-compose up -d --build\n```\n\nOnce the server is up, visit [localhost:4200](http://localhost:4200/dashboard).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbessouat40%2Fprefect-github-indexer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbessouat40%2Fprefect-github-indexer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbessouat40%2Fprefect-github-indexer/lists"}