{"id":23771645,"url":"https://github.com/arjuncodess/webcrawlai","last_synced_at":"2025-04-05T21:08:03.686Z","repository":{"id":269994344,"uuid":"909066340","full_name":"ArjunCodess/WebCrawlAI","owner":"ArjunCodess","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-03T16:12:09.000Z","size":11,"stargazers_count":115,"open_issues_count":0,"forks_count":25,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-29T20:05:58.232Z","etag":null,"topics":["brightdata","gemini","python","selenium"],"latest_commit_sha":null,"homepage":"https://webcrawlai.onrender.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ArjunCodess.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-27T16:47:04.000Z","updated_at":"2025-03-02T02:34:00.000Z","dependencies_parsed_at":"2025-02-21T04:26:46.271Z","dependency_job_id":"49ce7ede-a05e-4e0e-8083-39d18bba4ecd","html_url":"https://github.com/ArjunCodess/WebCrawlAI","commit_stats":null,"previous_names":["arjuncodess/webcrawlai"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArjunCodess%2FWebCrawlAI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArjunCodess%2FWebCrawlAI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArjunCodess%2FWebCrawlAI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ArjunCodess%2FWebCrawlAI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ArjunCodess","download_url":"https://codeload.github.com/ArjunCodess/WebCrawlAI/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247399877,"owners_count":20932876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["brightdata","gemini","python","selenium"],"created_at":"2025-01-01T04:20:39.886Z","updated_at":"2025-04-05T21:08:03.667Z","avatar_url":"https://github.com/ArjunCodess.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WebCrawlAI: AI-Powered Web Scraper\n\nThis project implements a web scraping API that leverages the Gemini AI model to extract specific information from websites.  It provides a user-friendly interface for defining extraction criteria and handles dynamic content and CAPTCHAs using a scraping browser.  The API is deployed on Render and is designed for easy integration into various projects.\n\n## Features\n\n*   Scrapes data from websites, handling dynamic content and CAPTCHAs.\n*   Uses Gemini AI to precisely extract the requested information.\n*   Provides a clean JSON output of the extracted data.\n*   Includes a user-friendly web interface for easy interaction.\n*   Error handling and retry mechanisms for robust operation.\n*   Event tracking using GetAnalyzr for monitoring API usage.\n\n## Usage\n\n1.  **Access the Web Interface:** Visit [https://webcrawlai.onrender.com/](https://webcrawlai.onrender.com/)\n2.  **Enter the URL:** Input the website URL you want to scrape.\n3.  **Specify Extraction Prompt:** Provide a clear description of the data you need (e.g., \"Extract all product names and prices\").\n4.  **Click \"Extract Information\":** The API will process your request, and the results will be displayed.\n\n## Installation\n\nThis project is deployed as a web application. No local installation is required for usage.  However, if you wish to run the code locally, follow these steps:\n\n1.  **Clone the Repository:**\n    ```bash\n    git clone https://github.com/YOUR_USERNAME/WebCrawlAI.git\n    cd WebCrawlAI\n    ```\n2.  **Install Dependencies:**\n    ```bash\n    pip install -r requirements.txt\n    ```\n3.  **Set Environment Variables:** Create a `.env` file (refer to `.env.example`) and populate it with your `SBR_WEBDRIVER` (Bright Data Scraping Browser URL) and `GEMINI_API_KEY` (Google Gemini API Key).\n4.  **Run the Application:**\n    ```bash\n    python main.py\n    ```\n\n## Technologies Used\n\n*   **Flask (3.0.0):** Web framework for building the API.\n*   **BeautifulSoup (4.12.2):** HTML/XML parser for extracting data from web pages.\n*   **Selenium (4.16.0):** For automating browser interactions, handling dynamic content and CAPTCHAs.\n*   **lxml:** Fast and efficient XML and HTML processing library.\n*   **html5lib:**  For parsing HTML documents.\n*   **python-dotenv (1.0.0):** For managing environment variables.\n*   **google-generativeai (0.3.1):**  Integrates the Gemini AI model for data parsing and extraction.\n*   **axios:** JavaScript library for making HTTP requests (client-side).\n*   **marked:** JavaScript library for rendering Markdown (client-side).\n*   **Tailwind CSS:** Utility-first CSS framework for styling (client-side).\n*   **GetAnalyzr:** For event tracking and API usage monitoring.\n*   **Bright Data Scraping Browser:** Provides fully-managed, headless browsers for reliable web scraping.\n\n\n## API Documentation\n\n**Endpoint:** `/scrape-and-parse`\n\n**Method:** `POST`\n\n**Request Body (JSON):**\n\n```json\n{\n  \"url\": \"https://www.example.com\",\n  \"parse_description\": \"Extract all product names and prices\"\n}\n```\n\n**Response (JSON):**\n\n**Success:**\n\n```json\n{\n  \"success\": true,\n  \"result\": {\n    \"products\": [\n      {\"name\": \"Product A\", \"price\": \"$10\"},\n      {\"name\": \"Product B\", \"price\": \"$20\"}\n    ]\n  }\n}\n```\n\n**Error:**\n\n```json\n{\n  \"error\": \"An error occurred during scraping or parsing\"\n}\n```\n\n\n## Dependencies\n\nThe project dependencies are listed in `requirements.txt`.  Use `pip install -r requirements.txt` to install them.\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request.\n\n## Testing\n\nNo formal testing framework is currently implemented.  Testing should be added as part of future development.\n\n\n*README.md was made with [Etchr](https://etchr.dev)*","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farjuncodess%2Fwebcrawlai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farjuncodess%2Fwebcrawlai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farjuncodess%2Fwebcrawlai/lists"}