{"id":26502790,"url":"https://github.com/tejas-130704/webscraperai","last_synced_at":"2025-07-08T04:04:34.767Z","repository":{"id":279019264,"uuid":"937496077","full_name":"tejas-130704/WebScraperAI","owner":"tejas-130704","description":"WebScraperAI is a powerful tool that enables users to perform question-answering on website content using web scraping and retrieval-augmented generation (RAG) with LlamaIndex. It supports multiple LLMs, including OpenAI GPT-3.5, GPT-4, Gemini Pro, Gemini Ultra, and DeepSeek.","archived":false,"fork":false,"pushed_at":"2025-02-25T04:21:31.000Z","size":8,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-20T18:46:15.693Z","etag":null,"topics":["ai","llms","open-source","python","rag-pipeline","streamlit","web-scraping","web-scraping-ai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tejas-130704.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-23T07:35:10.000Z","updated_at":"2025-02-25T04:21:35.000Z","dependencies_parsed_at":"2025-02-23T08:36:58.492Z","dependency_job_id":null,"html_url":"https://github.com/tejas-130704/WebScraperAI","commit_stats":null,"previous_names":["tejas-130704/webscraperai"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tejas-130704/WebScraperAI","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tejas-130704%2FWebScraperAI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tejas-130704%2FWebScraperAI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tejas-130704%2FWebScraperAI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tejas-130704%2FWebScraperAI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tejas-130704","download_url":"https://codeload.github.com/tejas-130704/WebScraperAI/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tejas-130704%2FWebScraperAI/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264192232,"owners_count":23570737,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llms","open-source","python","rag-pipeline","streamlit","web-scraping","web-scraping-ai"],"created_at":"2025-03-20T18:35:04.447Z","updated_at":"2025-07-08T04:04:34.743Z","avatar_url":"https://github.com/tejas-130704.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WebScraperAI\n\n## Overview\nWebScraperAI is a web-based tool that allows users to perform question-answering on a given website URL. It supports multiple LLMs and has two modes of operation:\n\n## Preview\n![Screenshot 2025-02-23 130842](https://github.com/user-attachments/assets/23cbc98f-d875-43db-9d2f-9601d7e94ebb)\n\n\n![Screenshot 2025-02-23 132439](https://github.com/user-attachments/assets/cbd001ba-8ef5-4b04-9aaf-ddd67b875199)\n\n\n![Screenshot 2025-02-23 134013](https://github.com/user-attachments/assets/e6ea68dc-1b6b-4775-8510-6c65ec8b6c1f)\n\n\n![Screenshot 2025-02-25 092021](https://github.com/user-attachments/assets/7fdb742c-36f3-4cbc-a231-6571876edd31)\n\n\n![Screenshot 2025-02-25 092727](https://github.com/user-attachments/assets/f18e034f-77fe-4537-a535-fb3e9ea66526)\n\n\n![Screenshot 2025-02-25 093013](https://github.com/user-attachments/assets/1595c817-6e5f-49b6-962d-2273593c87bc)\n\n\n\n1. **Page-Specific Q\u0026A**: Extracts information only from the given webpage.\n2. **Deep Analysis Q\u0026A**: Extracts information from the given page and all its linked pages (use cautiously, as it may take a long time for large websites).\n\nThe project uses **BeautifulSoup** for web scraping and a **RAG pipeline in LlamaIndex and HuggingFace** to enhance response accuracy. Supported LLMs include:\n\n- OpenAI GPT-3.5\n- OpenAI GPT-4\n- Gemini Pro\n- Gemini Ultra\n- DeepSeek\n- Groq\n\n## Features\n- Extracts and analyzes website content for Q\u0026A.\n- Offers two modes: specific page analysis and deep analysis.\n- Supports multiple LLMs for flexibility.\n- Built with Streamlit for an interactive UI.\n\n## Installation\nFollow these steps to set up the project on your local machine:\n\n### 1. Clone the Repository\n```sh\ngit clone https://github.com/tejas-130704/WebScraperAI.git\ncd WebScraperAI\n```\n\n### 2. Create a Virtual Environment\n```sh\npython -m venv venv\n```\n\n### 3. Activate the Virtual Environment\n- **Windows:**\n  ```sh\n  venv\\Scripts\\activate\n  ```\n- **Mac/Linux:**\n  ```sh\n  source venv/bin/activate\n  ```\n\n### 4. Install Dependencies\n```sh\npip install -r requirements.txt\n```\n\n## Usage\n### 1. Run the Streamlit App\n```sh\nstreamlit run app.py\n```\n\n### 2. Enter Details\n- **Select Model**: Choose an LLM for processing.\n- **Enter API Key**: Provide the API key for the selected LLM.\n- **Enter Website URL**: Input the URL to analyze.\n- **Choose Deep Analysis (Optional)**: Check this box if you want to analyze linked pages.\n\n### 3. Click **Load Website \u0026 LLM** to start the process.\n- After processing, enter a question related to the webpage and click **Ask Question**.\n\n## Caution ⚠️\n- **Use Deep Analysis Only for Limited Scope Websites**: Avoid using it on large websites like Wikipedia, as the high number of linked pages may cause extreme delays or failures.\n- **Respect Website Policies**: Some sites may have anti-scraping policies. Always ensure compliance.\n- **API Limits**: LLM responses are subject to API limits and costs depending on the provider.\n\n## Future Enhancements\n- Implement caching to improve deep analysis speed.\n- Add support for multi-threaded scraping.\n- Introduce a ranking system for LLM performance comparison.\n\n## Contributing\nPull requests are welcome! If you find any issues, feel free to open an issue in the repository.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftejas-130704%2Fwebscraperai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftejas-130704%2Fwebscraperai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftejas-130704%2Fwebscraperai/lists"}