{"id":21410660,"url":"https://github.com/srikarveluvali/scraperwizard","last_synced_at":"2026-01-03T10:16:36.392Z","repository":{"id":263004670,"uuid":"887928558","full_name":"SrikarVeluvali/ScraperWizard","owner":"SrikarVeluvali","description":"ScraperWizard is a full-stack application that automates web data extraction using custom search prompts and AI-powered processing. Users can upload datasets, define dynamic queries, and retrieve structured information seamlessly through an intuitive dashboard. Built with Flask, React, and integrations like ScraperAPI and Groq's LLM,.","archived":false,"fork":false,"pushed_at":"2024-11-16T15:26:25.000Z","size":217,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-23T04:41:53.187Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SrikarVeluvali.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-13T14:22:59.000Z","updated_at":"2024-11-26T12:34:31.000Z","dependencies_parsed_at":"2024-11-15T15:20:22.687Z","dependency_job_id":"8b37c857-ab3a-475a-bf62-b7d611ab05a8","html_url":"https://github.com/SrikarVeluvali/ScraperWizard","commit_stats":null,"previous_names":["srikarveluvali/scraperwizard"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SrikarVeluvali%2FScraperWizard","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SrikarVeluvali%2FScraperWizard/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SrikarVeluvali%2FScraperWizard/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SrikarVeluvali%2FScraperWizard/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SrikarVeluvali","download_url":"https://codeload.github.com/SrikarVeluvali/ScraperWizard/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243910814,"owners_count":20367545,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-22T17:40:59.134Z","updated_at":"2026-01-03T10:16:36.347Z","avatar_url":"https://github.com/SrikarVeluvali.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScraperWizard\n\nScraperWizard is an AI-powered application designed to automate information retrieval from the web based on user-defined prompts. This tool allows users to upload datasets, define search queries dynamically, and extract relevant information using advanced LLM capabilities. The extracted data can be displayed in a user-friendly dashboard and downloaded as structured files.\n\n## Loom Video\nThis video describes the demo and a few important other points.\n[Loom Video](https://youtu.be/4v641dp8FMQ?si=lSfV-Ic-AkJPrIHE)\n\n## Key Features\n\n- **File Upload \u0026 Google Sheets Integration**:\n  - Upload CSV files or connect Google Sheets for data input.\n  - Select a primary column (e.g., company names) for the search query.\n  - Preview uploaded data within the dashboard.\n\n- **Dynamic Prompt Input**:\n  - Define custom search prompts using placeholders like `{entity}`.\n  - Prompts are dynamically replaced with each entity from the selected column.\n\n- **Automated Web Search**:\n  - Perform searches using ScraperAPI or similar services.\n  - Handle rate limits and API constraints effectively.\n  - Collect and store search results (e.g., URLs, snippets).\n\n- **LLM Integration for Data Parsing**:\n  - Use Groq’s LLM or OpenAI’s GPT API to extract precise information from search results.\n  - Customize backend prompts for detailed extraction.\n\n- **Data Display \u0026 Download**:\n  - Visualize extracted data in a structured table format.\n  - Download results as CSV files or update the connected Google Sheet.\n\n## Setup Instructions\n\n### Prerequisites\n\n- Python 3.8+\n- API keys for ScraperAPI (or equivalent), Groq API, Google Cloud OAuth, Google Cloud API Key.\n- Google Cloud account for accessing Google Sheets API.\n\n## Project Structure\n\n```\nAI Based Webscraper\n├── backend\n│   ├── results\n│   │   └── result_input.csv\n│   ├── uploads\n│   │   └── input.csv\n│   ├── .env               # Backend environment variables\n│   ├── .gitignore\n│   ├── app.py             # Backend server code\n│   ├── requirements.txt   # Python dependencies\n│   ├── Test.csv\n├── frontend\n│   ├── public\n│   │   ├── favicon.svg\n│   │   ├── index.html\n│   │   ├── logo192.png\n│   │   ├── logo512.png\n│   │   ├── manifest.json\n│   │   ├── robots.txt\n│   ├── src\n│   │   ├── components\n│   │   │   └── CSVProcessor.tsx  # Main data processor component\n│   │   ├── App.css\n│   │   ├── App.js\n│   │   ├── App.test.js\n│   │   ├── index.css\n│   │   ├── index.js\n│   │   ├── logo.svg\n│   │   ├── reportWebVitals.js\n│   │   ├── setupTests.js\n│   ├── .env                # Frontend environment variables\n│   ├── .gitignore\n│   ├── package-lock.json\n│   ├── package.json\n│   ├── postcss.config.js\n│   ├── README.md\n│   ├── tailwind.config.js\n├── README.md               # Main project readme\n```\n\n### Installation\n\n1. Navigate to the backend directory:\n   ```bash\n   cd backend\n   ```\n\n2. Create a virtual environment:\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. Install dependencies:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. Configure environment variables in `.env`:\n   ```plaintext\n   SCRAPER_API_KEY=\u003cScraper API Key\u003e\n   GROQ_API_KEY=\u003cGroq API Key\u003e\n   ```\n\n5. Start the server:\n   ```bash\n   python app.py\n   ```\n\nThe backend server will be available at `http://localhost:5000`.\n\n## Frontend Setup\n\n### Prerequisites\n\n- Node.js 16+\n\n### Installation\n\n1. Navigate to the frontend directory:\n   ```bash\n   cd frontend\n   ```\n\n2. Install dependencies:\n   ```bash\n   npm install\n   ```\n\n3. Configure environment variables in `.env`:\n   ```plaintext\n   REACT_APP_CLIENT_ID=\u003cGoogle Oauth Client ID\u003e\n   REACT_APP_API_KEY=\u003cGoogle Cloud API Key\u003e\n   ```\n\n4. Start the development server:\n   ```bash\n   npm start\n   ```\n\nThe frontend will be available at `http://localhost:3000`.\n\n## Usage Guide\n\n1. **Upload your data**:\n   - Upload a CSV file or connect to a Google Sheet.\n     - ![image](https://github.com/user-attachments/assets/cbf94e3c-b77f-4622-a80e-187906cfbf6f)\n   - Select the column containing entities for the search query.\n     - ![image](https://github.com/user-attachments/assets/d062875f-d280-4eb6-998d-a1e9ff46ae1b)\n\n2. **Define your prompt**:\n   - Input a query template like: \"Find the email address of {company}.\"\n     - ![image](https://github.com/user-attachments/assets/494dc646-3166-413d-b1f2-78757c68f63f)\n   - The placeholder `{entity}` will be dynamically replaced for each row.\n\n3. **Retrieve and process data**:\n   - ScraperWizard performs automated searches and processes results through the integrated LLM.\n\n4. **View and download results**:\n   - Extracted data is displayed in a table format.\n     - ![image](https://github.com/user-attachments/assets/5f0c5797-1332-4b31-a496-21602e415a86)\n\n   - Download the results as a CSV.\n\n## Optional Features\n\n- Real-time Google Sheets updates with the extracted data.\n- Robust error handling for failed queries.\n\n## Technologies Used\n\n- **Backend**: Python, Flask\n- **Data Handling**: Pandas, Google Sheets API\n- **Search API**: ScraperAPI\n- **LLM API**: Groq\n- **Frontend**: ReactJS, Tailwind CSS\n\nMade by Srikar Veluvali.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrikarveluvali%2Fscraperwizard","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsrikarveluvali%2Fscraperwizard","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsrikarveluvali%2Fscraperwizard/lists"}