{"id":23869163,"url":"https://github.com/kameshpoc/rag-data-pipeline","last_synced_at":"2026-05-10T16:03:35.709Z","repository":{"id":270733748,"uuid":"911280093","full_name":"kameshpoc/rag-data-pipeline","owner":"kameshpoc","description":"This repo is to demonstrate rag data processing pipeline using dataflow flex templates","archived":false,"fork":false,"pushed_at":"2025-01-02T17:49:07.000Z","size":23,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-22T17:47:50.836Z","etag":null,"topics":["cloud-storage-bucket","dataflow-job","embeddings","flex-template","gcp","gcp-dataflow","genai","pickle-file","rag","rag-pipeline"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kameshpoc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-02T16:43:39.000Z","updated_at":"2025-01-02T17:49:10.000Z","dependencies_parsed_at":"2025-01-02T18:39:23.148Z","dependency_job_id":"3df6f896-33a1-408e-9cd1-4b71b1b8d722","html_url":"https://github.com/kameshpoc/rag-data-pipeline","commit_stats":null,"previous_names":["kameshpoc/rag-data-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kameshpoc/rag-data-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kameshpoc%2Frag-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kameshpoc%2Frag-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kameshpoc%2Frag-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kameshpoc%2Frag-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kameshpoc","download_url":"https://codeload.github.com/kameshpoc/rag-data-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kameshpoc%2Frag-data-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270991950,"owners_count":24681304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-storage-bucket","dataflow-job","embeddings","flex-template","gcp","gcp-dataflow","genai","pickle-file","rag","rag-pipeline"],"created_at":"2025-01-03T12:16:15.778Z","updated_at":"2026-05-10T16:03:30.665Z","avatar_url":"https://github.com/kameshpoc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG Dataflow Flex Template: a pipeline with dependencies and a custom container image\n\nThis sample project illustrates the following Dataflow Python pipeline setup:\n\n-   The pipeline is a package that consists of\n    [multiple files](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).\n\n-   The pipeline has at least one dependency that is not provided in the default\n    Dataflow runtime environment.\n\n-   The workflow uses a\n    [custom container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers)\n    to preinstall dependencies and to define the pipeline runtime environment.\n\n-   The workflow uses a\n    [Dataflow Flex Template](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates)\n    to control the pipeline submission environment.\n\n-   The runtime and submission environment use same set of Python dependencies\n    and can be created in a reproducible manner.\n\n## What it does exactly?\n\nTo illustrate this setup, we use a pipeline that does the following:\n\n1.  List PDFs from input folder of google Storage bucket\n\n1.  Create embeddings using text-embedding-004 model from google Vertex AI\n\n1.  Store embeddings in pickle file in output folder of storage bucket\n\n1. Move processed pdf files to processed folder of storage bucket\n\n## The structure of the repository\n\nThe pipeline package is comprised of the `src/my_package` directory, the\n`pyproject.toml` file and the `setup.py` file. The package defines the pipeline,\nthe pipeline dependencies, and the input parameters. You can define multiple\npipelines in the same package. The `my_package.launcher` module is used to\nsubmit the pipeline to a runner.\n\nThe `main.py` file provides a top-level entrypoint to trigger the pipeline\nlauncher from a launch environment.\n\nThe `Dockerfile` defines the runtime environment for the pipeline. It also\nconfigures the Flex Template, which lets you reuse the runtime image to build\nthe Flex Template.\n\nThe `requirements.txt` file defines all Python packages in the dependency chain\nof the pipeline package. Use it to create reproducible Python environments in\nthe Docker image.\n\nThe `metadata.json` file defines Flex Template parameters and their validation\nrules. It is optional.\n\n## Before you begin\n\n1.  Make sure to enable Dataflow API \u0026 other ralated API like Cloud Build. Refer Dataflow documentation for details.\n\n1.  Clone the [`rag-data-pipeline` repository](https://github.com/kameshpoc/rag-data-pipeline)\n    and navigate to the code sample.\n\n    ```sh\n    git clone https://github.com/kameshpoc/rag-data-pipeline.git\n    ```\n\n## Create a Cloud Storage bucket\n\n```sh\nexport PROJECT=\"project-id\"\nexport BUCKET=\"your-bucket\"\nexport REGION=\"us-central1\"\ngsutil mb -p $PROJECT gs://$BUCKET\n```\n\n## Create an Artifact Registry repository\n\n```sh\nexport REPOSITORY=\"your-repository\"\n\ngcloud artifacts repositories create $REPOSITORY \\\n    --repository-format=docker \\\n    --location=$REGION \\\n    --project $PROJECT\n\ngcloud auth configure-docker $REGION-docker.pkg.dev\n```\n\n## Build a Docker image for the pipeline runtime environment\n\nUsing a\n[custom SDK container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers)\nallows flexible customizations of the runtime environment.\n\nThis example uses the custom container image both to preinstall all of the\npipeline dependencies before job submission and to create a reproducible runtime\nenvironment.\n\nTo illustrate customizations, a\n[custom base base image](https://cloud.google.com/dataflow/docs/guides/build-container-image#use_a_custom_base_image)\nis used to build the SDK container image.\n\nThe Flex Template launcher is included in the SDK container image, which makes\nit possible to\n[use the SDK container image to build a Flex Template](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).\n\n```sh\n# Use a unique tag to version the artifacts that are built.\nexport TAG=`date +%Y%m%d-%H%M%S`\nexport SDK_CONTAINER_IMAGE=\"$REGION-docker.pkg.dev/$PROJECT/$REPOSITORY/my_rag_base_image:$TAG\"\n\ngcloud builds submit . --tag $SDK_CONTAINER_IMAGE --project $PROJECT\n```\n\n## Build the Flex Template\n\nBuild the Flex Template\n[from the SDK container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).\nUsing the runtime image as the Flex Template image reduces the number of Docker\nimages that need to be maintained. It also ensures that the pipeline uses the\nsame dependencies at submission and at runtime.\n\n```sh\nexport TEMPLATE_FILE=gs://$BUCKET/rag-data-processing-$TAG.json\n```\n\n```sh\ngcloud dataflow flex-template build $TEMPLATE_FILE  \\\n    --image $SDK_CONTAINER_IMAGE \\\n    --sdk-language \"PYTHON\" \\\n    --metadata-file=metadata.json \\\n    --project $PROJECT\n```\n\n## Run the template\n\nThis will create a dataflow job\n\n```sh\n# Create Data flow job using the Flex template\ngcloud dataflow flex-template run \"rag-flex-`date +%Y%m%d-%H%M%S`\" \\\n    --template-file-gcs-location $TEMPLATE_FILE \\\n    --region $REGION \\\n    --staging-location \"gs://$BUCKET/staging\" \\\n    --parameters sdk_container_image=$SDK_CONTAINER_IMAGE \\\n    --project $PROJECT \\\n    --parameters inputFolder=\"gs://$BUCKET/input/\" \\\n    --parameters outputFolder=\"gs://$BUCKET/output/\" \\\n    --parameters processedFolder=\"gs://$BUCKET/processed/\"\n```\nAlternatively use below bash file for executng all of the above commands. Don't forget to make it executable\n```bash\nchmod +x ./build-flex.sh\n```\nTo run bash file use below. Before running the file, make sure to update file with your specific input details.\n```bash\n./build-flex.sh\n```\n\nAfter the pipeline finishes, use the following command to inspect the output:\n\n```bash\ngsutil cat gs://$BUCKET/output*\n```\n\n## Optional: Update the dependencies in the requirements file and rebuild the Docker images\n\nThe top-level pipeline dependencies are defined in the `dependencies` section of\nthe `pyproject.toml` file.\n\nThe `requirements.txt` file pins all Python dependencies, that must be installed\nin the Docker container image, including the transitive dependencies. Listing\nall packages produces reproducible Python environments every time the image is\nbuilt. Version control the `requirements.txt` file together with the rest of\npipeline code.\n\nWhen the dependencies of your pipeline change or when you want to use the latest\navailable versions of packages in the pipeline's dependency chain, regenerate\nthe `requirements.txt` file:\n\n  ```bash\n  python3 -m pip install pip-tools\n  python3 -m piptools compile -o requirements.txt pyproject.toml\n  ```\n\nIf you base your custom container image on the standard Apache Beam base image,\nto reduce the image size and to give preference to the versions already\ninstalled in the Apache Beam base image, use a constraints file:\n\n```bash\nwget https://raw.githubusercontent.com/apache/beam/release-2.54.0/sdks/python/container/py311/base_image_requirements.txt\npython3 -m piptools compile --constraint=base_image_requirements.txt ./pyproject.toml\n```\n\nAlternatively, take the following steps:\n\n1.  Use an empty `requirements.txt` file.\n1.  Build the SDK container Docker image from the Docker file.\n1.  Collect the output of `pip freeze` at the last stage of the Docker build.\n1.  Seed the `requirements.txt` file with that content.\n\nFor more information, see the Apache Beam\n[reproducible environments](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#create-reproducible-environments)\ndocumentation.\n\n---\n## Repository: rag-data-pipeline\n\nThis repository is public for reference purposes only. Contributions in the form of issues, pull requests, or feature requests are not accepted. If you'd like to use or modify this project, please fork it.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENCE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkameshpoc%2Frag-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkameshpoc%2Frag-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkameshpoc%2Frag-data-pipeline/lists"}