{"id":28998607,"url":"https://github.com/luminati-io/rag-chatbot","last_synced_at":"2026-04-08T20:51:45.340Z","repository":{"id":283784021,"uuid":"919956475","full_name":"luminati-io/rag-chatbot","owner":"luminati-io","description":"A Python-based RAG chatbot leveraging GPT-4o and Bright Data's SERP API to deliver contextually rich and up-to-date AI responses using real-time search engine data.","archived":false,"fork":false,"pushed_at":"2025-02-11T11:18:48.000Z","size":1156,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-23T13:12:41.448Z","etag":null,"topics":["ai","api","beautifulsoup4","bright-data","chatbot","chatbots","chatgpt","html2text","json","playwright","python","rag","serp","serp-api"],"latest_commit_sha":null,"homepage":"https://brightdata.com/blog/web-data/build-a-rag-chatbot","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminati-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-21T10:08:07.000Z","updated_at":"2025-07-19T18:55:10.000Z","dependencies_parsed_at":"2025-03-22T07:02:08.382Z","dependency_job_id":"40483416-628e-4424-8459-7a468f3c1664","html_url":"https://github.com/luminati-io/rag-chatbot","commit_stats":null,"previous_names":["luminati-io/rag-chatbot"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luminati-io/rag-chatbot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Frag-chatbot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Frag-chatbot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Frag-chatbot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Frag-chatbot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminati-io","download_url":"https://codeload.github.com/luminati-io/rag-chatbot/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Frag-chatbot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31573788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","api","beautifulsoup4","bright-data","chatbot","chatbots","chatgpt","html2text","json","playwright","python","rag","serp","serp-api"],"created_at":"2025-06-25T07:09:21.687Z","updated_at":"2026-04-08T20:51:45.328Z","avatar_url":"https://github.com/luminati-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Creating a RAG Chatbot With GPT-4o Using SERP Data\n\n[![Promo](https://media.brightdata.com/2025/08/SERP-API-50-off-GitHub-banner_1389_166.png)](https://brightdata.com/) \n\nThis guide explain how to build a Python RAG chatbot using GPT-4o and Bright Data’s SERP API for more accurate, context-rich AI responses.\n\n1. [Introduction](#how-to-creating-a-rag-chatbot-with-gpt-4o-using-serp-data)\n2. [What Is RAG?](#what-is-rag)\n3. [Why Feed AI Models With SERP Data](#why-feed-ai-models-with-serp-data)\n4. [RAG With SERP Data With GPT Models Using Python: Step-By-Step Tutorial](#rag-with-serp-data-with-gpt-models-using-python-step-by-step-tutorial)\n    1. [Step #1: Initialize a Python Project](#step-1-initialize-a-python-project)\n    2. [Step #2: Install the Required Libraries](#step-2-install-the-required-libraries)\n    3. [Step #3: Prepare Your Project](#step-3-prepare-your-project)\n    4. [Step #4: Configure SERP API](#step-4-configure-serp-api)\n    5. [Step #5: Implement the SERP Scraping Logic](#step-5-implement-the-serp-scraping-logic)\n    6. [Step #6: Extract Text from the SERP URLs](#step-6-extract-text-from-the-serp-urls)\n    7. [Step #7: Generate the RAG Prompt](#step-7-generate-the-rag-prompt)\n    8. [Step #8: Perform the GPT Request](#step-8-perform-the-gpt-request)\n    9. [Step #9: Create the Application UI](#step-9-create-the-application-ui)\n    10. [Step #10: Put It All Together](#step-10-put-it-all-together)\n    11. [Step #11: Test the Application](#step-11-test-the-application)\n5. [Conclusion](#conclusion)\n\n## What Is RAG?\n\nRAG, short for [Retrieval-Augmented Generation](https://blogs.nvidia.comhttps://brightdata.com/blog/what-is-retrieval-augmented-generation/), is an AI approach that combines information retrieval with text generation. In a RAG workflow, the application first retrieves relevant data from external sources—such as documents, web pages, or databases. Then, it passes data to the AI models so that it can generate more contextually relevant responses.\n\nRAG enhances large language models (LLMs) like GPT by enabling them to access and reference up-to-date information beyond their original training data. This approach is key in scenarios where precise and context-specific information is needed, as it improves both the quality and accuracy of AI-generated responses.\n\n## Why Feed AI Models With SERP Data\n\nThe knowledge cutoff date for GPT-4o is [October 2023](https://computercity.com/artificial-intelligence/knowledge-cutoff-dates-llms), meaning it lacks access to events or information that came out after that time. However, [GPT-4o models](https://openai.com/index/hello-gpt-4o/) can pull in data from the Internet in real-time using Bing search integration. That helps them offer more up-to-date information and responses that are detailed, precise, and contextually rich.\n\n## RAG With SERP Data With GPT Models Using Python: Step-By-Step Tutorial\n\nThis tutorial guides through building a RAG chatbot using OpenAI’s GPT models. The idea is to gather text from the top-performing pages on Google for a specific search query and use it as the context for a GPT request.\n\nThe biggest challenge is scraping SERP data. Most search engines come with advanced anti-bot solutions to prevent automated access to their pages. For detailed guidance, refer to our guide on [how to scrape Google in Python](https://brightdata.com/blog/web-data/scraping-google-with-python).\n\nTo simplify the scraping process, we will use [Bright Data’s SERP API](https://brightdata.com/products/serp-api).\n\nThis SERP scraper allows you to easily retrieve SERPs from Google, DuckDuckGo, Bing, Yandex, Baidu, and other search engines using simple HTTP requests.\n\nWe will then extract text data from the returned URLs using a [headless browser](https://brightdata.com/blog/web-data/best-headless-browsers). Then, we will use that information as the context for the GPT model in a RAG workflow. If you instead want to retrieve online data directly using AI, read our article on [web scraping with ChatGPT](https://brightdata.com/blog/web-data/web-scraping-with-chatgpt).\n\nAll the code in this guide is also available in a GitHub repository:\n\n```bash\ngit clone https://github.com/Tonel/rag_gpt_serp_scraping\n```\n\nFollow the instructions in the README.md file to install the project’s dependencies and launch the project.\n\nKeep in mind that the approach presented in this blog post can easily be adapted to any other search engine or LLM.\n\n\u003e **Note**:\\\n\u003e This guide refers to Unix and macOS. If you are a Windows user, you can still follow the tutorial by using the [Windows Subsystem for Linux (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install).\n\n### Step #1: Initialize a Python Project\n\nMake sure you have Python 3 installed on your machine. Otherwise, [download and install it](https://www.python.org/downloads/).\n\nCreate a folder for your project and switch to it in the terminal:\n\n```bash\nmkdir rag_gpt_serp_scraping\n\ncd rag_gpt_serp_scraping\n```\n\nThe `rag_gpt_serp_scraping` folder will contain your Python RAG project.\n\nThen, load the project directory in your favorite Python IDE. [PyCharm Community Edition](https://www.jetbrains.com/pycharm/download/) or [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/languages/python) will do.\n\nInside rag\\_gpt\\_serp\\_scraping, add an empty app.py file. This will contain your scraping and RAG logic.\n\nNext, initialize a [Python virtual environment](https://docs.python.org/3/library/venv.html) in the project directory:\n\n```bash\npython3 -m venv env\n```\n\nActivate the virtual environment with the command below:\n\n```bash\nsource ./env/bin/activate\n```\n\n### Step #2: Install the Required Libraries\n\nThis Python RAG project will be using the following dependencies:\n\n*   [`python-dotenv`](https://pypi.org/project/python-dotenv/): It will be used to securely manage sensitive credentials, such as Bright Data credentials and OpenAI API keys.\n*   [`requests`](https://pypi.org/project/requests/): To perform HTTP requests to Bright Data’s SERP API.\n*   [`langchain-community`](https://pypi.org/project/langchain-community/): It will be used for retrieving text from the Google SERP pages and cleaning it to generate relevant content for RAG.\n*   [`openai`](https://pypi.org/project/openai/): It will be employed to interface with GPT models to generate natural language responses based on the given inputs and RAG context.\n*   [`streamlit`](https://pypi.org/project/streamlit/): It will come in handy for creating a UI where users can input their Google search queries and AI prompt, and view the results dynamically.\n\nInstall all the dependencies:\n\n```bash\npip install python-dotenv requests langchain-community openai streamlit\n```\n\nWe will use [AsyncChromiumLoader](https://python.langchain.com/docs/integrations/document_loaders/async_chromium/) from langchain-community, which requires the following dependencies:\n\n```bash\npip install --upgrade --quiet playwright beautifulsoup4 html2text\n```\n\nTo function properly, Playwright also requires you to install the browsers:\n\n```bash\nplaywright install\n```\n\n### Step #3: Prepare Your Project\n\nIn `app.py`, add the following imports:\n\n```python\nfrom dotenv import load_dotenv\n\nimport os\n\nimport requests\n\nfrom langchain_community.document_loaders import AsyncChromiumLoader\n\nfrom langchain_community.document_transformers import BeautifulSoupTransformer\n\nfrom openai import OpenAI\n\nimport streamlit as st\n```\n\nThen, create a `.env` file in your project folder to store all your credentials. Your project structure will now look like as below:\n\n![Project structure](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-19.png)\n\nUse the function below in `app.py` to instruct `python-dotenv` to load the environment variables from `.env`:\n\n```python\nload_dotenv()\n```\n\nYou can now import environment variables from `.env` or the system with:\n\n```python\nos.environ.get(\"\u003cENV_NAME\u003e\")\n```\n\n### Step #4: Configure SERP API\n\nWe will use Bright Data’s SERP API to retrieve content from search engine results pages and use that in our Python RAG workflow. Specifically, we will extract text from the URLs of the web pages returned by the SERP API.\n\nTo set up SERP API, refer to the [official documentation](https://docs.brightdata.com/scraping-automation/serp-api/quickstart). Alternatively, follow the instructions below.\n\nIf you have not already created an account, [sign up for Bright Data](https://brightdata.com). Once logged in, navigate to your account dashboard:\n\n![Account main dashboard](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-18.png)\n\nThere, click the “Get proxy products” button.\n\nThat will bring you to the page below, where you have to click on the “SERP API” row:\n\n![Clicking on SERP API](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-17.png)\n\nOn the SERP API product page, toggle “Activate zone” to enable the product:\n\n![Activating the SERP zone](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-16.png)\n\nNow, copy the SERP API host, port, username, and password in the “Access parameters” section and add them to your `.env` file:\n\n```python\nBRIGHT_DATA_SERP_API_HOST=\"\u003cYOUR_HOST\u003e\"\n\nBRIGHT_DATA_SERP_API_PORT=\u003cYOUR_PORT\u003e\n\nBRIGHT_DATA_SERP_API_USERNAME=\"\u003cYOUR_USERNAME\u003e\"\n\nBRIGHT_DATA_SERP_API_PASSWORD=\"\u003cYOUR_PASSWORD\u003e\"\n```\n\nReplace the `\u003cYOUR_XXXX\u003e` placeholders with the values provided by Bright Data on the SERP API page.\n\nNote that the host in “Access parameters” has a format like this:\n\n```python\nbrd.superproxy.io:33335\n```\n\nSplit it as below:\n\n```python\nBRIGHT_DATA_SERP_API_HOST=\"brd.superproxy.io\"\n\nBRIGHT_DATA_SERP_API_PORT=33335\n```\n\n### Step #5: Implement the SERP Scraping Logic\n\nIn `app.py`, add the following function to retrieve the first `number_of_urls` URLs from a Google SERP page:\n\n```python\ndef get_google_serp_urls(query, number_of_urls=5):\n\n# perform a Bright Data's SERP API request\n\n# with JSON autoparsing\n\nhost = os.environ.get(\"BRIGHT_DATA_SERP_API_HOST\")\n\nport = os.environ.get(\"BRIGHT_DATA_SERP_API_PORT\")\n\nusername = os.environ.get(\"BRIGHT_DATA_SERP_API_USERNAME\")\n\npassword = os.environ.get(\"BRIGHT_DATA_SERP_API_PASSWORD\")\n\nproxy_url = f\"http://{username}:{password}@{host}:{port}\"\n\nproxies = {\"http\": proxy_url, \"https\": proxy_url}\n\nurl = f\"https://www.google.com/search?q={query}\u0026brd_json=1\"\n\nresponse = requests.get(url, proxies=proxies, verify=False)\n\n# retrieve the parsed JSON response\n\nresponse_data = response.json()\n\n# extract a \"number_of_urls\" number of\n\n# Google SERP URLs from the response\n\ngoogle_serp_urls = []\n\nif \"organic\" in response_data:\n\nfor item in response_data[\"organic\"]:\n\nif \"link\" in item:\n\ngoogle_serp_urls.append(item[\"link\"])\n\nreturn google_serp_urls[:number_of_urls]\n```\n\nThis makes an HTTP GET request to SERP API with the search query specified in the query argument. The [`brd_json=1`](https://docs.brightdata.com/scraping-automation/serp-api/parsing-search-results) query parameter ensures that SERP API parses the results into JSON for you, in the format below:\n\n```json\n{\n\n\"general\": {\n\n\"search_engine\": \"google\",\n\n\"results_cnt\": 1980000000,\n\n\"search_time\": 0.57,\n\n\"language\": \"en\",\n\n\"mobile\": false,\n\n\"basic_view\": false,\n\n\"search_type\": \"text\",\n\n\"page_title\": \"pizza - Google Search\",\n\n\"code_version\": \"1.90\",\n\n\"timestamp\": \"2023-06-30T08:58:41.786Z\"\n\n},\n\n\"input\": {\n\n\"original_url\": \"https://www.google.com/search?q=pizza\u0026brd_json=1\",\n\n\"user_agent\": \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/608.2.11 (KHTML, like Gecko) Version/13.0.3 Safari/608.2.11\",\n\n\"request_id\": \"hl_1a1be908_i00lwqqxt1\"\n\n},\n\n\"organic\": [\n\n{\n\n\"link\": \"https://www.pizzahut.com/\",\n\n\"display_link\": \"https://www.pizzahut.com\",\n\n\"title\": \"Pizza Hut | Delivery \u0026 Carryout - No One OutPizzas The Hut!\",\n\n\"image\": \"omitted for brevity...\",\n\n\"image_alt\": \"pizza from www.pizzahut.com\",\n\n\"image_base64\": \"omitted for brevity...\",\n\n\"rank\": 1,\n\n\"global_rank\": 1\n\n},\n\n{\n\n\"link\": \"https://www.dominos.com/en/\",\n\n\"display_link\": \"https://www.dominos.com › ...\",\n\n\"title\": \"Domino's: Pizza Delivery \u0026 Carryout, Pasta, Chicken \u0026 More\",\n\n\"description\": \"Order pizza, pasta, sandwiches \u0026 more online for carryout or delivery from Domino's. View menu, find locations, track orders. Sign up for Domino's email ...\",\n\n\"image\": \"omitted for brevity...\",\n\n\"image_alt\": \"pizza from www.dominos.com\",\n\n\"image_base64\": \"omitted for brevity...\",\n\n\"rank\": 2,\n\n\"global_rank\": 3\n\n},\n\n// omitted for brevity...\n\n],\n\n// omitted for brevity...\n\n}\n```\n\nThe last few lines of the function retrieve each SERP URL from the resulting JSON data, select only the first `number_of_urls` URLs, and return them in a list.\n\n### Step #6: Extract Text from the SERP URLs\n\nDefine a function that extracts text from each of the SERP URLs:\n\n```python\n# Note: Some websites may have dynamic content or anti-scraping measures that could prevent text extraction.\n# In such cases, please consider using additional tools like Selenium\ndef extract_text_from_urls(urls, number_of_words=600): \n\n# instruct a headless Chrome instance to visit the provided URLs\n\n# with the specified user-agent\n\nloader = AsyncChromiumLoader(\n\nurls,\n\nuser_agent=\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36\",\n\n)\n\nhtml_documents = loader.load()\n\n# process the extracted HTML documents to extract text from them\n\nbs_transformer = BeautifulSoupTransformer()\n\ndocs_transformed = bs_transformer.transform_documents(\n\nhtml_documents,\n\ntags_to_extract=[\"p\", \"em\", \"li\", \"strong\", \"h1\", \"h2\"],\n\nunwanted_tags=[\"a\"],\n\nremove_comments=True,\n\n)\n\n# make sure each HTML text document contains only a number\n\n# number_of_words words\n\nextracted_text_list = []\n\nfor doc_transformed in docs_transformed:\n\n# split the text into words and join the first number_of_words\n\nwords = doc_transformed.page_content.split()[:number_of_words]\n\nextracted_text = \" \".join(words)\n\n# ignore empty text documents\n\nif len(extracted_text) != 0:\n\nextracted_text_list.append(extracted_text)\n\nreturn extracted_text_list\n```\n\nThis function:\n\n1.  Loads web pages from the URLs passed as an argument using a headless Chrome browser instance.\n2.  Utilizes [BeautifulSoupTransformer](https://python.langchain.com/v0.2/api_reference/community/document_transformers/langchain_community.document_transformers.beautiful_soup_transformer.BeautifulSoupTransformer.html) to process the HTML of each page and extract text from specific tags (like `\u003cp\u003e`, `\u003ch1\u003e`, `\u003cstrong\u003e`, etc.), omitting unwanted tags (like `\u003ca\u003e`) and comments.\n3.  Limits the extracted text for each webpage to a number of words specified by the `number_of_words` argument.\n4.  Returns a list of the extracted text from each URL.\n\nWhile the `[\"p\", \"em\", \"li\", \"strong\", \"h1\", \"h2\"]` tags are enough to extract text from most web pages, in some specific scenarios, you may need to customize this list of HTML tags. Also, you might have to increase or decrease the target number of words for each text item.\n\nFor example, consider the [web page below](https://athomeinhollywood.com/2024/09/19/transformers-one-review/):\n\n![Transformers one review page](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-15.png)\n\nApplying that function to that page will result in this text array:\n\n```python\n[\"Lisa Johnson Mandell’s Transformers One review reveals the heretofore inconceivable: It’s one of the best animated films of the year! I never thought I’d see myself write this about a Transformers movie, but Transformers One is actually an exceptional film! ...\"]\n```\n\nThe list of text items returned by `extract_text_from_urls()` represents the RAG context to feed to the OpenAI model.\n\n### Step #7: Generate the RAG Prompt\n\nDefine a function that transforms the AI prompt request and text context into the final RAG prompt:\n\n```python\ndef get_openai_prompt(request, text_context=[]):\n\n# default prompt\n\nprompt = request\n\n# add the context to the prompt, if present\n\nif len(text_context) != 0:\n\ncontext_string = \"\\n\\n--------\\n\\n\".join(text_context)\n\nprompt = f\"Answer the request using only the context below.\\n\\nContext:\\n{context_string}\\n\\nRequest: {request}\"\n\nreturn prompt\n```\n\nPrompts returned by the previous function when a RAG context is specified have this format:\n\n```\nAnswer the request using only the context below.\n\nContext:\n\nBla bla bla...\n\n--------\n\nBla bla bla...\n\n--------\n\nBla bla bla...\n\nRequest: \u003cYOUR_REQUEST\u003e\n```\n\n### Step #8: Perform the GPT Request\n\nFirst, initialize the OpenAI client at the top of the `app.py` file:\n\n```python\nopenai_client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\"))\n```\n\nThis relies on the `OPENAI_API_KEY` environment variable, which you can define directly in your system’s environments or in the `.env` file:\n\n`OPENAI_API_KEY=\"\u003cYOUR_API_KEY\u003e\"`\n\nReplace `\u003cYOUR_API_KEY\u003e` with the value of your [OpenAI API key](https://platform.openai.com/api-keys). If you do not know how to get one, follow the [official guide](https://platform.openai.com/docs/quickstart).\n\nNext, write a function that uses the OpenAI official client to perform a request to the [GPT-4o mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) AI model:\n\n```python\ndef interrogate_openai(prompt, max_tokens=800):\n\n# interrogate the OpenAI model with the given prompt\n\nresponse = openai_client.chat.completions.create(\n\nmodel=\"gpt-4o-mini\",\n\nmessages=[{\"role\": \"user\", \"content\": prompt}],\n\nmax_tokens=max_tokens,\n\n)\n\nreturn response.choices[0].message.content\n```\n\n\u003e **Note**:\\\n\u003e You can configure any other GPT model supported by the OpenAI API.\n\nIf called with a prompt returned by `get_openai_prompt()` that includes a specified text context, `interrogate_openai()` will successfully perform retrieval-augmented generation as intended.\n\n### Step #9: Create the Application UI\n\nUse Streamlit to define a simple [form UI](https://docs.streamlit.io/develop/concepts/architecture/forms) where users can specify:\n\n1.  The Google search query to pass to the SERP API\n2.  The AI prompt to send to GPT-4o mini\n\nTo do that, use this code:\n\n```python\nwith st.form(\"prompt_form\"):\n\n# initialize the output results\n\nresult = \"\"\n\nfinal_prompt = \"\"\n\n# textarea for user to input their Google search query\n\ngoogle_search_query = st.text_area(\"Google Search:\", None)\n\n# textarea for user to input their AI prompt\n\nrequest = st.text_area(\"AI Prompt:\", None)\n\n# button to submit the form\n\nsubmitted = st.form_submit_button(\"Send\")\n\n# if the form is submitted\n\nif submitted:\n\n# retrieve the Google SERP URLs from the given search query\n\ngoogle_serp_urls = get_google_serp_urls(google_search_query)\n\n# extract the text from the respective HTML pages\n\nextracted_text_list = extract_text_from_urls(google_serp_urls)\n\n# generate the AI prompt using the extracted text as context\n\nfinal_prompt = get_openai_prompt(request, extracted_text_list)\n\n# interrogate an OpenAI model with the generated prompt\n\nresult = interrogate_openai(final_prompt)\n\n# dropdown containing the generated prompt\n\nfinal_prompt_expander = st.expander(\"AI Final Prompt:\")\n\nfinal_prompt_expander.write(final_prompt)\n\n# write the result from the OpenAI model\n\nst.write(result)\n```\n\nThe Python RAG script is ready.\n\n### Step #10: Put It All Together\n\nYour `app.py` file should contain the following code:\n\n```python\nfrom dotenv import load_dotenv\n\nimport os\n\nimport requests\n\nfrom langchain_community.document_loaders import AsyncChromiumLoader\n\nfrom langchain_community.document_transformers import BeautifulSoupTransformer\n\nfrom openai import OpenAI\n\nimport streamlit as st\n\n# load the environment variables from the .env file\n\nload_dotenv()\n\n# initialize the OpenAI API client with your API key\n\nopenai_client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\"))\n\ndef get_google_serp_urls(query, number_of_urls=5):\n\n# perform a Bright Data's SERP API request\n\n# with JSON autoparsing\n\nhost = os.environ.get(\"BRIGHT_DATA_SERP_API_HOST\")\n\nport = os.environ.get(\"BRIGHT_DATA_SERP_API_PORT\")\n\nusername = os.environ.get(\"BRIGHT_DATA_SERP_API_USERNAME\")\n\npassword = os.environ.get(\"BRIGHT_DATA_SERP_API_PASSWORD\")\n\nproxy_url = f\"http://{username}:{password}@{host}:{port}\"\n\nproxies = {\"http\": proxy_url, \"https\": proxy_url}\n\nurl = f\"https://www.google.com/search?q={query}\u0026brd_json=1\"\n\nresponse = requests.get(url, proxies=proxies, verify=False)\n\n# retrieve the parsed JSON response\n\nresponse_data = response.json()\n\n# extract a \"number_of_urls\" number of\n\n# Google SERP URLs from the response\n\ngoogle_serp_urls = []\n\nif \"organic\" in response_data:\n\nfor item in response_data[\"organic\"]:\n\nif \"link\" in item:\n\ngoogle_serp_urls.append(item[\"link\"])\n\nreturn google_serp_urls[:number_of_urls]\n\ndef extract_text_from_urls(urls, number_of_words=600):\n\n# instruct a headless Chrome instance to visit the provided URLs\n\n# with the specified user-agent\n\nloader = AsyncChromiumLoader(\n\nurls,\n\nuser_agent=\"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36\",\n\n)\n\nhtml_documents = loader.load()\n\n# process the extracted HTML documents to extract text from them\n\nbs_transformer = BeautifulSoupTransformer()\n\ndocs_transformed = bs_transformer.transform_documents(\n\nhtml_documents,\n\ntags_to_extract=[\"p\", \"em\", \"li\", \"strong\", \"h1\", \"h2\"],\n\nunwanted_tags=[\"a\"],\n\nremove_comments=True,\n\n)\n\n# make sure each HTML text document contains only a number\n\n# number_of_words words\n\nextracted_text_list = []\n\nfor doc_transformed in docs_transformed:\n\n# split the text into words and join the first number_of_words\n\nwords = doc_transformed.page_content.split()[:number_of_words]\n\nextracted_text = \" \".join(words)\n\n# ignore empty text documents\n\nif len(extracted_text) != 0:\n\nextracted_text_list.append(extracted_text)\n\nreturn extracted_text_list\n\ndef get_openai_prompt(request, text_context=[]):\n\n# default prompt\n\nprompt = request\n\n# add the context to the prompt, if present\n\nif len(text_context) != 0:\n\ncontext_string = \"\\n\\n--------\\n\\n\".join(text_context)\n\nprompt = f\"Answer the request using only the context below.\\n\\nContext:\\n{context_string}\\n\\nRequest: {request}\"\n\nreturn prompt\n\ndef interrogate_openai(prompt, max_tokens=800):\n\n# interrogate the OpenAI model with the given prompt\n\nresponse = openai_client.chat.completions.create(\n\nmodel=\"gpt-4o-mini\",\n\nmessages=[{\"role\": \"user\", \"content\": prompt}],\n\nmax_tokens=max_tokens,\n\n)\n\nreturn response.choices[0].message.content\n\n# create a form in the Streamlit app for user input\n\nwith st.form(\"prompt_form\"):\n\n# initialize the output results\n\nresult = \"\"\n\nfinal_prompt = \"\"\n\n# textarea for user to input their Google search query\n\ngoogle_search_query = st.text_area(\"Google Search:\", None)\n\n# textarea for user to input their AI prompt\n\nrequest = st.text_area(\"AI Prompt:\", None)\n\n# button to submit the form\n\nsubmitted = st.form_submit_button(\"Send\")\n\n# if the form is submitted\n\nif submitted:\n\n# retrieve the Google SERP URLs from the given search query\n\ngoogle_serp_urls = get_google_serp_urls(google_search_query)\n\n# extract the text from the respective HTML pages\n\nextracted_text_list = extract_text_from_urls(google_serp_urls)\n\n# generate the AI prompt using the extracted text as context\n\nfinal_prompt = get_openai_prompt(request, extracted_text_list)\n\n# interrogate an OpenAI model with the generated prompt\n\nresult = interrogate_openai(final_prompt)\n\n# dropdown containing the generated prompt\n\nfinal_prompt_expander = st.expander(\"AI Final Prompt\")\n\nfinal_prompt_expander.write(final_prompt)\n\n# write the result from the OpenAI model\n\nst.write(result)\n```\n\n### Step #11: Test the Application\n\nLaunch your Python RAG application with:\n\n```bash\n# Note: Streamlit is designed for lightweight applications. For production-grade deployments, consider using frameworks like Flask or FastAPI.\nstreamlit run app.py\n```\nIn the terminal, you should see the following output:\n\n```\nYou can now view your Streamlit app in your browser.\n\nLocal URL: http://localhost:8501\n\nNetwork URL: http://172.27.134.248:8501\n```\n\nFollow the instructions, and visit `http://localhost:8501` in the browser. Below is what you should be seeing:\n\n![Streamlit app screenshot](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-14.png)\n\nTest the application by using a Google search query as below:\n\n```\nTransformers One review\n```\n\nAnd an AI prompt as follows:\n\n```\nWrite a review for the movie Transformers One\n```\n\nClick “Send” and wait while your application processes the request. After a few seconds, you should get a result like this:\n\n![App result screenshot](https://github.com/luminati-io/rag-chatbot/blob/main/Images/image-13.png)\n\nIf you expand the “AI Final Prompt” dropdown, you will see the complete prompt used by the application for RAG.\n\n## Conclusion\n\nThe major challenge with using a Python RAG chatbot is scraping search engines like Google:\n\n1. They frequently alter the structure of their SERP pages.\n2. They are protected by some of the most sophisticated anti-bot measures available.\n3. Retrieving large volumes of SERP data concurrently is complex and can be expensive.\n\n[Bright Data’s SERP API](https://brightdata.com/products/serp-api) helps you retrieve real-time SERP data from all major search engines with no effort. It also supports RAG and many other applications. Get your free trial now!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Frag-chatbot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminati-io%2Frag-chatbot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Frag-chatbot/lists"}