{"id":20102522,"url":"https://github.com/clickhouse/bedrock_rag","last_synced_at":"2025-05-06T08:30:46.512Z","repository":{"id":208462364,"uuid":"721690780","full_name":"ClickHouse/bedrock_rag","owner":"ClickHouse","description":"A simple RAG pipeline for Google Analytics with ClickHouse and Bedrock","archived":false,"fork":false,"pushed_at":"2024-07-05T23:35:55.000Z","size":171,"stargazers_count":7,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-09T10:11:49.478Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ClickHouse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-21T15:20:38.000Z","updated_at":"2025-01-27T00:01:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"1adc17aa-03aa-4ca1-8f12-f35cfce905b7","html_url":"https://github.com/ClickHouse/bedrock_rag","commit_stats":null,"previous_names":["clickhouse/bedrock_rag"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fbedrock_rag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fbedrock_rag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fbedrock_rag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2Fbedrock_rag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ClickHouse","download_url":"https://codeload.github.com/ClickHouse/bedrock_rag/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252648448,"owners_count":21782391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T17:31:32.576Z","updated_at":"2025-05-06T08:30:46.503Z","avatar_url":"https://github.com/ClickHouse.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# ClickHouse and AWS Bedrock - A simple RAG pipeline\n\nFiles supporting blog post [Building a RAG pipeline for enhanced Google Analytics with ClickHouse and Amazon Bedrock](https://clickhouse.com/blog/retrieval-augmented-generation-rag-with-clickhouse-bedrock).\n\nThis simple RAG flow uses ClickHouse and Bedrock APIs to convert Google Analytics questions into a SQL responses.\n\nFiles:\n\n- [embed.py](./embed.py) - Simple python UDF to generate an embedding using the `amazon.titan-embed-text-v1` model. Uses client from [bedrock.py](./bedrock.py).\n- [bedrock_function.xml](./bedrock_function.xml) - ClickHouse config for above UDF.\n- [questions.sql](./questions.sql) - Example questions seeded for the RAG flow.\n- [question_to_sql.py](./question_to_sql.py) - RAG test script. Implements the RAG pipeline.\n- [ga.sql](./ga.sql) - Schemas for Google Analytics and site data. See [Enhancing Google Analytics Data with ClickHouse](https://clickhouse.com/blog/enhancing-google-analytics-data-with-clickhouse) for more details.\n- [spider][./spider] - Simple scrapy spider to generate site data. Specific to clickhouse.com but can be adapted.\n\nDependencies:\n\n- python 3.10+\n- ClickHouse instance with `amazon.titan-embed-text-v1` and `anthropic.claude-v2` models.\n- Bedrock account with access to titan and \n\nInstall Python dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n## Running RAG Flow\n\nAssumes ClickHouse port 8123 (non SSL).\n\n```bash\nexport CLICKHOUSE_HOST=\nexport CLICKHOUSE_USERNAME=\nexport CLICKHOUSE_PASSWORD=\n\n#optional AWS role and region\nexport AWS_ROLE=\nexport AWS_REGION=\n\npython question_to_sql.py --question \"What are the number of returning users per day for the month of October for doc pages?\"\n----------------------------------------------------------------------------------------------------\nquestion: What are the number of returning users per day for the month of October for doc pages?\n\nSELECT\n    event_date,\n    uniqExact(user_pseudo_id) AS returning_users\nFROM ga_daily\nWHERE event_name = 'session_start'\n    AND page_location LIKE '%/docs/%'\n    AND event_date BETWEEN '2022-10-01' AND '2022-10-31'\n    AND (ga_session_number \u003e 1 OR user_first_touch_timestamp \u003c event_date)\nGROUP BY event_date\nORDER BY event_date\n```\n\n## Example Questions\n\nExample questions from [blog](https://clickhouse.com/blog/retrieval-augmented-generation-rag-with-clickhouse-bedrock).\n\n1. \"What are the number of returning users per day for the month of October for doc pages?\"\n1. \"What are the number of new users for blogs about dictionaries over time?\"\n1. \"What are the total sessions since January 2023 by month for pages where the url contains '/docs/en'?\"\n1. \"What are the total page views over time?\"\n1. \"How many active users have visited blogs about codecs and compression techniques?\"\n1. \"What are the total users over time?\"\n1. \"What are the total users over time for pages about materialized views?\"\n1. \"What is the source of traffic over time?\"\n1. \"What are the total website sessions for pages about Snowflake?\"\n1. \"What are the average number views per blog post over time?\"\n1. \"What is the average number of views for doc pages for each returning user per day?\"\n1. \"How many users who visited the blog with the title 'Supercharging your large ClickHouse data loads - Tuning a large data load for speed?' were new?\"\n1. \"For each day from September 2003 how many blog posts were published?\"\n1. \"What was the ratio of new to returning users in October 2023?\"\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclickhouse%2Fbedrock_rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclickhouse%2Fbedrock_rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclickhouse%2Fbedrock_rag/lists"}